LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V5 0/3] arm64/mm: Enable memory hot remove
@ 2019-05-29  9:16 Anshuman Khandual
  2019-05-29  9:16 ` [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Anshuman Khandual @ 2019-05-29  9:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic __remove_memory() and one possible
arm64 platform specific kernel page table race condition. This series
is based on latest v5.2-rc2 tag.

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations. Its only build tested on non-arm64 platforms.

Changes in V5:

- Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump
- Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged
- Dropped the above patch from this series
- Fixed indentation problem in arch_[add|remove]_memory() as per David
- Collected all new Acked-by tags
 
Changes in V4: (https://lkml.org/lkml/2019/5/20/19)

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore

- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

[1] https://lkml.org/lkml/2019/5/28/584

Anshuman Khandual (3):
  mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig             |   3 +
 arch/arm64/mm/mmu.c            | 211 ++++++++++++++++++++++++++++++++++++++++-
 arch/arm64/mm/ptdump_debugfs.c |   3 +
 mm/memory_hotplug.c            |   2 +-
 4 files changed, 216 insertions(+), 3 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-05-29  9:16 [PATCH V5 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-29  9:16 ` Anshuman Khandual
  2019-05-30 10:37   ` Mark Rutland
  2019-05-29  9:16 ` [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-05-29  9:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory_hotplug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e096c98..67dfdb8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1851,10 +1851,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
+	arch_remove_memory(nid, start, size, NULL);
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(nid, start, size, NULL);
 	__release_memory_resource(start, size);
 
 	try_offline_node(nid);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-29  9:16 [PATCH V5 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-29  9:16 ` [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-05-29  9:16 ` Anshuman Khandual
  2019-05-30 10:42   ` Mark Rutland
  2019-05-29  9:16 ` [PATCH V5 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-29 22:06 ` [PATCH V5 0/3] " Andrew Morton
  3 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-05-29  9:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..80171d1 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V5 3/3] arm64/mm: Enable memory hot remove
  2019-05-29  9:16 [PATCH V5 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-29  9:16 ` [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
  2019-05-29  9:16 ` [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-05-29  9:16 ` Anshuman Khandual
  2019-05-30 15:12   ` Mark Rutland
  2019-05-29 22:06 ` [PATCH V5 0/3] " Andrew Morton
  3 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-05-29  9:16 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

While freeing intermediate level page table pages bail out if any of it's
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region. Take
the kernel ptl so that we can safely free potentially-shared tables.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/mm/mmu.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 697ea05..7f917fe 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -268,6 +268,9 @@ config HAVE_GENERIC_GUP
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a1bfc44..4803624 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,6 +733,187 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, ssize_t size)
+{
+	WARN_ON(PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(ptep[i]))
+			return;
+	}
+
+	page = pmd_page(READ_ONCE(*pmdp));
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(pmdp[i]))
+			return;
+	}
+
+	page = pud_page(READ_ONCE(*pudp));
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(pudp[i]))
+			return;
+	}
+
+	page = pgd_page(READ_ONCE(*pgdp));
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void
+remove_pte_table(pmd_t *pmdp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	struct page *page;
+	pte_t *ptep, pte;
+	unsigned long start = addr;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		if (sparse_vmap) {
+			page = pte_page(pte);
+			free_hotplug_page_range(page, PAGE_SIZE);
+		}
+		pte_clear(&init_mm, addr, ptep);
+	}
+	flush_tlb_kernel_range(start, end);
+}
+
+static void
+remove_pmd_table(pud_t *pudp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pte_t *ptep_base;
+	pmd_t *pmdp, pmd;
+
+	for (; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			if (sparse_vmap) {
+				page = pmd_page(pmd);
+				free_hotplug_page_range(page, PMD_SIZE);
+			}
+			pmd_clear(pmdp);
+			continue;
+		}
+		ptep_base = pte_offset_kernel(pmdp, 0UL);
+		remove_pte_table(pmdp, addr, next, sparse_vmap);
+		free_pte_table(ptep_base, pmdp, addr);
+	}
+}
+
+static void
+remove_pud_table(pgd_t *pgdp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pmd_t *pmdp_base;
+	pud_t *pudp, pud;
+
+	for (; addr < end; addr = next) {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			if (sparse_vmap) {
+				page = pud_page(pud);
+				free_hotplug_page_range(page, PUD_SIZE);
+			}
+			pud_clear(pudp);
+			continue;
+		}
+		pmdp_base = pmd_offset(pudp, 0UL);
+		remove_pmd_table(pudp, addr, next, sparse_vmap);
+		free_pmd_table(pmdp_base, pudp, addr);
+	}
+}
+
+static void
+remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
+{
+	unsigned long addr, next;
+	pud_t *pudp_base;
+	pgd_t *pgdp, pgd;
+
+	spin_lock(&init_mm.page_table_lock);
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		pudp_base = pud_offset(pgdp, 0UL);
+		remove_pud_table(pgdp, addr, next, sparse_vmap);
+		free_pud_table(pudp_base, pgdp, addr);
+	}
+	spin_unlock(&init_mm.page_table_lock);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -780,6 +961,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	remove_pagetable(start, end, true);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1070,10 +1254,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, start + size, false);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1081,7 +1271,24 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+void arch_remove_memory(int nid, u64 start, u64 size,
+				struct vmem_altmap *altmap)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone = page_zone(pfn_to_page(start_pfn));
+
+	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 #endif
+#endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 0/3] arm64/mm: Enable memory hot remove
  2019-05-29  9:16 [PATCH V5 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
                   ` (2 preceding siblings ...)
  2019-05-29  9:16 ` [PATCH V5 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-29 22:06 ` Andrew Morton
  2019-05-30  4:23   ` Anshuman Khandual
  3 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2019-05-29 22:06 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, mark.rutland, mhocko, ira.weiny, david, cai, logang,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, David Hildenbrand

On Wed, 29 May 2019 14:46:24 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:

> This series enables memory hot remove on arm64 after fixing a memblock
> removal ordering problem in generic __remove_memory() and one possible
> arm64 platform specific kernel page table race condition. This series
> is based on latest v5.2-rc2 tag.

Unfortunately this series clashes syntactically and semantically with
David Hildenbrand's series "mm/memory_hotplug: Factor out memory block
devicehandling".  Could you and David please figure out what we should
do here?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 0/3] arm64/mm: Enable memory hot remove
  2019-05-29 22:06 ` [PATCH V5 0/3] " Andrew Morton
@ 2019-05-30  4:23   ` Anshuman Khandual
  2019-05-30  7:23     ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-05-30  4:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, mark.rutland, mhocko, ira.weiny, david, cai, logang,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel



On 05/30/2019 03:36 AM, Andrew Morton wrote:
> On Wed, 29 May 2019 14:46:24 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
>> This series enables memory hot remove on arm64 after fixing a memblock
>> removal ordering problem in generic __remove_memory() and one possible
>> arm64 platform specific kernel page table race condition. This series
>> is based on latest v5.2-rc2 tag.
> 
> Unfortunately this series clashes syntactically and semantically with
> David Hildenbrand's series "mm/memory_hotplug: Factor out memory block
> devicehandling".  Could you and David please figure out what we should
> do here?
> 

Hello Andrew,

I was able to apply the above mentioned V3 series [1] from David with some changes
listed below which tests positively on arm64. These changes assume that the arm64
hot-remove series (current V5) gets applied first.

Changes to David's series

A) Please drop (https://patchwork.kernel.org/patch/10962565/) [v3,04/11]

	- arch_remove_memory() is already being added through hot-remove series

B) Rebase (https://patchwork.kernel.org/patch/10962575/) [v3, 06/11]

	- arm64 hot-remove series adds CONFIG_MEMORY_HOTREMOVE wrapper around
	  arch_remove_memory() which can be dropped in the rebased patch

C) Rebase (https://patchwork.kernel.org/patch/10962589/) [v3, 09/11]

	- hot-remove series moves arch_remove_memory() before memblock_[free|remove]()
	- So remove_memory_block_devices() should be moved before arch_remove_memory()
	  in it's new position

David,

Please do let me know if the plan sounds good or you have some other suggestions.

- Anshuman

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=123133 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 0/3] arm64/mm: Enable memory hot remove
  2019-05-30  4:23   ` Anshuman Khandual
@ 2019-05-30  7:23     ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2019-05-30  7:23 UTC (permalink / raw)
  To: Anshuman Khandual, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, mark.rutland, mhocko, ira.weiny, cai, logang,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

On 30.05.19 06:23, Anshuman Khandual wrote:
> 
> 
> On 05/30/2019 03:36 AM, Andrew Morton wrote:
>> On Wed, 29 May 2019 14:46:24 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>
>>> This series enables memory hot remove on arm64 after fixing a memblock
>>> removal ordering problem in generic __remove_memory() and one possible
>>> arm64 platform specific kernel page table race condition. This series
>>> is based on latest v5.2-rc2 tag.
>>
>> Unfortunately this series clashes syntactically and semantically with
>> David Hildenbrand's series "mm/memory_hotplug: Factor out memory block
>> devicehandling".  Could you and David please figure out what we should
>> do here?
>>
> 
> Hello Andrew,
> 
> I was able to apply the above mentioned V3 series [1] from David with some changes
> listed below which tests positively on arm64. These changes assume that the arm64
> hot-remove series (current V5) gets applied first.
> 
> Changes to David's series
> 
> A) Please drop (https://patchwork.kernel.org/patch/10962565/) [v3,04/11]
> 
> 	- arch_remove_memory() is already being added through hot-remove series
> 
> B) Rebase (https://patchwork.kernel.org/patch/10962575/) [v3, 06/11]
> 
> 	- arm64 hot-remove series adds CONFIG_MEMORY_HOTREMOVE wrapper around
> 	  arch_remove_memory() which can be dropped in the rebased patch
> 
> C) Rebase (https://patchwork.kernel.org/patch/10962589/) [v3, 09/11]
> 
> 	- hot-remove series moves arch_remove_memory() before memblock_[free|remove]()
> 	- So remove_memory_block_devices() should be moved before arch_remove_memory()
> 	  in it's new position
> 
> David,
> 
> Please do let me know if the plan sounds good or you have some other suggestions.

That's exactly what I had in mind :)

Andrew, you can drop my series and pick up Anshumans series first. I can
then rebase and resend.
Cheers!

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-05-29  9:16 ` [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-05-30 10:37   ` Mark Rutland
  2019-06-07  2:28     ` Anshuman Khandual
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Rutland @ 2019-05-30 10:37 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

On Wed, May 29, 2019 at 02:46:25PM +0530, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is safe because there is no other memblock (bootmem) allocator user that
> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> ---
>  mm/memory_hotplug.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e096c98..67dfdb8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1851,10 +1851,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> +	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(nid, start, size, NULL);
>  	__release_memory_resource(start, size);
>  
>  	try_offline_node(nid);
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-29  9:16 ` [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-05-30 10:42   ` Mark Rutland
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Rutland @ 2019-05-30 10:42 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

On Wed, May 29, 2019 at 02:46:26PM +0530, Anshuman Khandual wrote:
> The arm64 page table dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addresses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove,
> which will be enabled by a subsequent patch. To avoid racing with
> this, take the memory hotplug lock when walking the kernel page table.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/mm/ptdump_debugfs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f..80171d1 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -7,7 +7,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();
>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();

We need to explicitly include <linux/memory_hotplug.h> to get the
prototypes of {get,put}_online_mems().

With that fixed up:

Acked-by: Mark Rutland <mark.rutland@arm.com>

I understand from [1] that Michal is ok with using these here, and we're
open to reworking this if/when the hotplug locking is changed.

Mark.

[1] https://lkml.kernel.org/r/20190527072001.GB1658@dhcp22.suse.cz

>  	return 0;
>  }
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 3/3] arm64/mm: Enable memory hot remove
  2019-05-29  9:16 ` [PATCH V5 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-30 15:12   ` Mark Rutland
  2019-06-11 14:43     ` Anshuman Khandual
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Rutland @ 2019-05-30 15:12 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

Hi Anshuman,

From reviwing the below, I can see some major issues that need to be
addressed, which I've commented on below.

Andrew, please do not pick up this patch.

On Wed, May 29, 2019 at 02:46:27PM +0530, Anshuman Khandual wrote:
> The arch code for hot-remove must tear down portions of the linear map and
> vmemmap corresponding to memory being removed. In both cases the page
> tables mapping these regions must be freed, and when sparse vmemmap is in
> use the memory backing the vmemmap must also be freed.
> 
> This patch adds a new remove_pagetable() helper which can be used to tear
> down either region, and calls it from vmemmap_free() and
> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
> backing memory will be freed.
> 
> While freeing intermediate level page table pages bail out if any of it's

Nit: s/it's/its/

> entries are still valid. This can happen for partially filled kernel page
> table either from a previously attempted failed memory hot add or while
> removing an address range which does not span the entire page table page
> range.
> 
> The vmemmap region may share levels of table with the vmalloc region. Take
> the kernel ptl so that we can safely free potentially-shared tables.

AFAICT, this is not sufficient; please see below for details.

> While here update arch_add_memory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Looking at this some more, I don't think this is quite right, and tI
think that structure of the free_*() and remove_*() functions makes this
unnecessarily hard to follow. We should aim for this to be obviously
correct.

The x86 code is the best template to follow here. As mentioned
previously, I'm fairly certain it's not entirely correct (e.g. due to
missing TLB maintenance), and we've already diverged a fair amount in
fixing up obvious issues, so we shouldn't aim to mirror it.

I think that the structure of unmap_region() is closer to what we want
here -- do one pass to unmap leaf entries (and freeing the associated
memory if unmapping the vmemmap), then do a second pass cleaning up any
empty tables.

In general I'm concerned that we don't strictly follow a
clear->tlbi->free sequence, and free pages before tearing down their
corresponding mapping. It doesn't feel great to leave a cacheable alias
around, even transiently. Further, as commented below, the
remove_p?d_table() functions leave stale leaf entries in the TLBs when
removing section entries.

> ---
>  arch/arm64/Kconfig  |   3 +
>  arch/arm64/mm/mmu.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 212 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 697ea05..7f917fe 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -268,6 +268,9 @@ config HAVE_GENERIC_GUP
>  config ARCH_ENABLE_MEMORY_HOTPLUG
>  	def_bool y
>  
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> +	def_bool y
> +
>  config SMP
>  	def_bool y
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index a1bfc44..4803624 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -733,6 +733,187 @@ int kern_addr_valid(unsigned long addr)
>  
>  	return pfn_valid(pte_pfn(pte));
>  }
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void free_hotplug_page_range(struct page *page, ssize_t size)

The size argument should never be negative, so size_t would be best.

> +{
> +	WARN_ON(PageReserved(page));
> +	free_pages((unsigned long)page_address(page), get_order(size));
> +}
> +
> +static void free_hotplug_pgtable_page(struct page *page)
> +{
> +	free_hotplug_page_range(page, PAGE_SIZE);
> +}
> +
> +static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		if (!pte_none(ptep[i]))
> +			return;
> +	}
> +
> +	page = pmd_page(READ_ONCE(*pmdp));
> +	pmd_clear(pmdp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +
> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	if (CONFIG_PGTABLE_LEVELS <= 2)
> +		return;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		if (!pmd_none(pmdp[i]))
> +			return;
> +	}
> +
> +	page = pud_page(READ_ONCE(*pudp));
> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +
> +static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	if (CONFIG_PGTABLE_LEVELS <= 3)
> +		return;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		if (!pud_none(pudp[i]))
> +			return;
> +	}
> +
> +	page = pgd_page(READ_ONCE(*pgdp));
> +	pgd_clear(pgdp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +
> +static void
> +remove_pte_table(pmd_t *pmdp, unsigned long addr,

Please put this on a single line.

All the existing functions in this file (and the ones you add above)
have the return type on the same line as the name, and since this
portion of the prototype doesn't encroach 80 columns there's no reason
to flow it.

> +			unsigned long end, bool sparse_vmap)
> +{
> +	struct page *page;
> +	pte_t *ptep, pte;
> +	unsigned long start = addr;
> +
> +	for (; addr < end; addr += PAGE_SIZE) {
> +		ptep = pte_offset_kernel(pmdp, addr);
> +		pte = READ_ONCE(*ptep);
> +
> +		if (pte_none(pte))
> +			continue;
> +
> +		WARN_ON(!pte_present(pte));
> +		if (sparse_vmap) {
> +			page = pte_page(pte);
> +			free_hotplug_page_range(page, PAGE_SIZE);
> +		}
> +		pte_clear(&init_mm, addr, ptep);
> +	}
> +	flush_tlb_kernel_range(start, end);
> +}

For consistency we should use a do { ... } while (..., addr != end) loop
to iterate over the page tables. All the other code in our mmu.c does
that, and diverging from that doesn't save use anything here but does
make review and maintenance harder.

> +
> +static void
> +remove_pmd_table(pud_t *pudp, unsigned long addr,

Same line please.

> +			unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long next;
> +	struct page *page;
> +	pte_t *ptep_base;
> +	pmd_t *pmdp, pmd;
> +
> +	for (; addr < end; addr = next) {
> +		next = pmd_addr_end(addr, end);
> +		pmdp = pmd_offset(pudp, addr);
> +		pmd = READ_ONCE(*pmdp);
> +
> +		if (pmd_none(pmd))
> +			continue;
> +
> +		WARN_ON(!pmd_present(pmd));
> +		if (pmd_sect(pmd)) {
> +			if (sparse_vmap) {
> +				page = pmd_page(pmd);
> +				free_hotplug_page_range(page, PMD_SIZE);
> +			}
> +			pmd_clear(pmdp);

As mentioned above, this has no corresponding TLB maintenance, and I'm
concerned that we free the page before clearing the entry. If the page
gets re-allocated elsewhere, whoever received the page may not be
expecting a cacheable alias to exist other than the linear map.

> +			continue;
> +		}
> +		ptep_base = pte_offset_kernel(pmdp, 0UL);
> +		remove_pte_table(pmdp, addr, next, sparse_vmap);
> +		free_pte_table(ptep_base, pmdp, addr);
> +	}
> +}
> +
> +static void
> +remove_pud_table(pgd_t *pgdp, unsigned long addr,

Same line please

> +			unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long next;
> +	struct page *page;
> +	pmd_t *pmdp_base;
> +	pud_t *pudp, pud;
> +
> +	for (; addr < end; addr = next) {
> +		next = pud_addr_end(addr, end);
> +		pudp = pud_offset(pgdp, addr);
> +		pud = READ_ONCE(*pudp);
> +
> +		if (pud_none(pud))
> +			continue;
> +
> +		WARN_ON(!pud_present(pud));
> +		if (pud_sect(pud)) {
> +			if (sparse_vmap) {
> +				page = pud_page(pud);
> +				free_hotplug_page_range(page, PUD_SIZE);
> +			}
> +			pud_clear(pudp);

Same issue as in remove_pmd_table().

> +			continue;
> +		}
> +		pmdp_base = pmd_offset(pudp, 0UL);
> +		remove_pmd_table(pudp, addr, next, sparse_vmap);
> +		free_pmd_table(pmdp_base, pudp, addr);
> +	}
> +}
> +
> +static void
> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)

Same line please (with the sparse_vmap argument flowed on to the next
line as that will encroach 80 characters).

> +{
> +	unsigned long addr, next;
> +	pud_t *pudp_base;
> +	pgd_t *pgdp, pgd;
> +
> +	spin_lock(&init_mm.page_table_lock);

Please add a comment above this to explain why we need to take the
init_mm ptl. Per the cover letter, this should be something like:

	/*
	 * We may share tables with the vmalloc region, so we must take
	 * the init_mm ptl so that we can safely free any
	 * potentially-shared tables that we have emptied.
	 */

The vmalloc code doesn't hold the init_mm ptl when walking a table; it
only takes the init_mm ptl when populating a none entry in
__p??_alloc(), to avoid a race where two threads need to populate the
entry.

So AFAICT, taking the init_mm ptl here is not sufficient to make this
safe.

Thanks,
Mark.

> +	for (addr = start; addr < end; addr = next) {
> +		next = pgd_addr_end(addr, end);
> +		pgdp = pgd_offset_k(addr);
> +		pgd = READ_ONCE(*pgdp);
> +
> +		if (pgd_none(pgd))
> +			continue;
> +
> +		WARN_ON(!pgd_present(pgd));
> +		pudp_base = pud_offset(pgdp, 0UL);
> +		remove_pud_table(pgdp, addr, next, sparse_vmap);
> +		free_pud_table(pudp_base, pgdp, addr);
> +	}
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +#endif
> +
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  #if !ARM64_SWAPPER_USES_SECTION_MAPS
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> @@ -780,6 +961,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	remove_pagetable(start, end, true);
> +#endif
>  }
>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
>  
> @@ -1070,10 +1254,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
> +{
> +	WARN_ON(pgdir != init_mm.pgd);
> +	remove_pagetable(start, start + size, false);
> +}
> +
>  int arch_add_memory(int nid, u64 start, u64 size,
>  			struct mhp_restrictions *restrictions)
>  {
> -	int flags = 0;
> +	int ret, flags = 0;
>  
>  	if (rodata_full || debug_pagealloc_enabled())
>  		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> @@ -1081,7 +1271,24 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>  			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
>  
> -	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
> +	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
>  			   restrictions);
> +	if (ret)
> +		__remove_pgd_mapping(swapper_pg_dir,
> +				     __phys_to_virt(start), size);
> +	return ret;
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +void arch_remove_memory(int nid, u64 start, u64 size,
> +				struct vmem_altmap *altmap)
> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct zone *zone = page_zone(pfn_to_page(start_pfn));
> +
> +	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
>  }
>  #endif
> +#endif
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-05-30 10:37   ` Mark Rutland
@ 2019-06-07  2:28     ` Anshuman Khandual
  2019-06-07  7:43       ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-06-07  2:28 UTC (permalink / raw)
  To: Mark Rutland, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel



On 05/30/2019 04:07 PM, Mark Rutland wrote:
> On Wed, May 29, 2019 at 02:46:25PM +0530, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
>>
>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>> [   62.052517] ------------[ cut here ]------------
>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [   62.054589] Modules linked in:
>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>> [   62.059842] sp : ffff0000168b3ce0
>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>> [   62.076930] Call trace:
>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>> [   62.078227]  __add_memory+0x70/0xa8
>> [   62.078901]  probe_store+0xa4/0xc8
>> [   62.079561]  dev_attr_store+0x18/0x28
>> [   62.080270]  sysfs_kf_write+0x40/0x58
>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>> [   62.081744]  __vfs_write+0x18/0x40
>> [   62.082400]  vfs_write+0xa4/0x1b0
>> [   62.083037]  ksys_write+0x5c/0xc0
>> [   62.083681]  __arm64_sys_write+0x18/0x20
>> [   62.084432]  el0_svc_handler+0x88/0x100
>> [   62.085177]  el0_svc+0x8/0xc
>>
>> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>> as memblock for the address range still exists. arch_remove_memory()
>> removes applicable memory sections from zone with __remove_pages() and
>> tears down kernel linear mapping. Removing memblock regions afterwards
>> is safe because there is no other memblock (bootmem) allocator user that
>> late. So nobody is going to allocate from the removed range just to blow
>> up later. Also nobody should be using the bootmem allocated range else
>> we wouldn't allow to remove it. So reordering is indeed safe.
>>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> 
> Acked-by: Mark Rutland <mark.rutland@arm.com>

Hello Andrew,

Will it be possible for this particular patch of the series to be merged alone.
I am still reworking arm64 hot-remove parts as per the suggestions from Mark.
Just wondering if this patch which has been reviewed and acked for a while now
can be out of our way.

Also because this has some conflict with David's series which can be sorted out
earlier before arm64 hot-remove V6 series comes in.

From my previous response on this series last week, the following can resolve
the conflict with David's [v3, 09/11] patch.

C) Rebase (https://patchwork.kernel.org/patch/10962589/) [v3, 09/11]

	- hot-remove series moves arch_remove_memory() before memblock_[free|remove]()
	- So remove_memory_block_devices() should be moved before arch_remove_memory()
	  in it's new position   

It will be great if this patch can be merged alone.

- Anshuman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-06-07  2:28     ` Anshuman Khandual
@ 2019-06-07  7:43       ` David Hildenbrand
  2019-06-11 11:26         ` [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2019-06-07  7:43 UTC (permalink / raw)
  To: Anshuman Khandual, Mark Rutland, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, mhocko, ira.weiny, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

On 07.06.19 04:28, Anshuman Khandual wrote:
> 
> 
> On 05/30/2019 04:07 PM, Mark Rutland wrote:
>> On Wed, May 29, 2019 at 02:46:25PM +0530, Anshuman Khandual wrote:
>>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>>> entries between memory block and node. It first checks pfn validity with
>>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>>
>>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>>> creates a problem in memory hot remove path which has already removed given
>>> memory range from memory block with memblock_[remove|free] before arriving
>>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>>> of existing sysfs entries.
>>>
>>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>>> [   62.052517] ------------[ cut here ]------------
>>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
>>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>>> [   62.054589] Modules linked in:
>>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>>> [   62.059842] sp : ffff0000168b3ce0
>>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>>> [   62.076930] Call trace:
>>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>>> [   62.078227]  __add_memory+0x70/0xa8
>>> [   62.078901]  probe_store+0xa4/0xc8
>>> [   62.079561]  dev_attr_store+0x18/0x28
>>> [   62.080270]  sysfs_kf_write+0x40/0x58
>>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>>> [   62.081744]  __vfs_write+0x18/0x40
>>> [   62.082400]  vfs_write+0xa4/0x1b0
>>> [   62.083037]  ksys_write+0x5c/0xc0
>>> [   62.083681]  __arm64_sys_write+0x18/0x20
>>> [   62.084432]  el0_svc_handler+0x88/0x100
>>> [   62.085177]  el0_svc+0x8/0xc
>>>
>>> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
>>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>>> as memblock for the address range still exists. arch_remove_memory()
>>> removes applicable memory sections from zone with __remove_pages() and
>>> tears down kernel linear mapping. Removing memblock regions afterwards
>>> is safe because there is no other memblock (bootmem) allocator user that
>>> late. So nobody is going to allocate from the removed range just to blow
>>> up later. Also nobody should be using the bootmem allocated range else
>>> we wouldn't allow to remove it. So reordering is indeed safe.
>>>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>
>> Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> Hello Andrew,
> 
> Will it be possible for this particular patch of the series to be merged alone.
> I am still reworking arm64 hot-remove parts as per the suggestions from Mark.
> Just wondering if this patch which has been reviewed and acked for a while now
> can be out of our way.

I suggest that you just resend a rebased version (linux-next). That
makes it much easier for Andrew to pick up (and for others to review).

Cheers!

> 
> Also because this has some conflict with David's series which can be sorted out
> earlier before arm64 hot-remove V6 series comes in.
> 
> From my previous response on this series last week, the following can resolve
> the conflict with David's [v3, 09/11] patch.
> 
> C) Rebase (https://patchwork.kernel.org/patch/10962589/) [v3, 09/11]
> 
> 	- hot-remove series moves arch_remove_memory() before memblock_[free|remove]()
> 	- So remove_memory_block_devices() should be moved before arch_remove_memory()
> 	  in it's new position   
> 
> It will be great if this patch can be merged alone.
> 
> - Anshuman
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-07  7:43       ` David Hildenbrand
@ 2019-06-11 11:26         ` Anshuman Khandual
  2019-06-11 22:19           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-06-11 11:26 UTC (permalink / raw)
  To: akpm, linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, ard.biesheuvel
  Cc: osalvador, david, mhocko, mark.rutland

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in PATCH V5 - Rebased (from V5)

- Rebased on linux-next (next-20190611)
- s/__remove_memory/try_remove_memory in the subject line
- s/arch_remove_memory/memblock_[free|remove] in the subject line
- A small change in the commit message as re-order happens now for
  memblock remove functions not arch_remove_memory()

 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a88c5f334e5a..cfa5facf1b38 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1831,13 +1831,13 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
-	memblock_free(start, size);
-	memblock_remove(start, size);
 
 	/* remove memory block devices before removing memory */
 	remove_memory_block_devices(start, size);
 
 	arch_remove_memory(nid, start, size, NULL);
+	memblock_free(start, size);
+	memblock_remove(start, size);
 	__release_memory_resource(start, size);
 
 	try_offline_node(nid);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 3/3] arm64/mm: Enable memory hot remove
  2019-05-30 15:12   ` Mark Rutland
@ 2019-06-11 14:43     ` Anshuman Khandual
  0 siblings, 0 replies; 20+ messages in thread
From: Anshuman Khandual @ 2019-06-11 14:43 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel

On 05/30/2019 08:42 PM, Mark Rutland wrote:
> Hi Anshuman,

Hello Mark,

> 
>>From reviwing the below, I can see some major issues that need to be
> addressed, which I've commented on below.
> 
> Andrew, please do not pick up this patch.

I was reworking this patch series and investigating the vmalloc/vmemmap
conflict issues. Hence could not respond earlier.

> 
> On Wed, May 29, 2019 at 02:46:27PM +0530, Anshuman Khandual wrote:
>> The arch code for hot-remove must tear down portions of the linear map and
>> vmemmap corresponding to memory being removed. In both cases the page
>> tables mapping these regions must be freed, and when sparse vmemmap is in
>> use the memory backing the vmemmap must also be freed.
>>
>> This patch adds a new remove_pagetable() helper which can be used to tear
>> down either region, and calls it from vmemmap_free() and
>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
>> backing memory will be freed.
>>
>> While freeing intermediate level page table pages bail out if any of it's
> 
> Nit: s/it's/its/

Done.

> 
>> entries are still valid. This can happen for partially filled kernel page
>> table either from a previously attempted failed memory hot add or while
>> removing an address range which does not span the entire page table page
>> range.
>>
>> The vmemmap region may share levels of table with the vmalloc region. Take
>> the kernel ptl so that we can safely free potentially-shared tables.
> 
> AFAICT, this is not sufficient; please see below for details.

Sure.

> 
>> While here update arch_add_memory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>
>> This implementation is overall inspired from kernel page table tear down
>> procedure on X86 architecture.
>>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Looking at this some more, I don't think this is quite right, and tI
> think that structure of the free_*() and remove_*() functions makes this
> unnecessarily hard to follow. We should aim for this to be obviously
> correct.

Okay.

> 
> The x86 code is the best template to follow here. As mentioned

Did you mean *not the best* instead.

> previously, I'm fairly certain it's not entirely correct (e.g. due to
> missing TLB maintenance), and we've already diverged a fair amount in
> fixing up obvious issues, so we shouldn't aim to mirror it.

Okay.

> 
> I think that the structure of unmap_region() is closer to what we want
> here -- do one pass to unmap leaf entries (and freeing the associated
> memory if unmapping the vmemmap), then do a second pass cleaning up any
> empty tables.

Done.

> 
> In general I'm concerned that we don't strictly follow a
> clear->tlbi->free sequence, and free pages before tearing down their
> corresponding mapping. It doesn't feel great to leave a cacheable alias
> around, even transiently. Further, as commented below, the
> remove_p?d_table() functions leave stale leaf entries in the TLBs when
> removing section entries.

Fixed these.

> 
>> ---
>>  arch/arm64/Kconfig  |   3 +
>>  arch/arm64/mm/mmu.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 212 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 697ea05..7f917fe 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -268,6 +268,9 @@ config HAVE_GENERIC_GUP
>>  config ARCH_ENABLE_MEMORY_HOTPLUG
>>  	def_bool y
>>  
>> +config ARCH_ENABLE_MEMORY_HOTREMOVE
>> +	def_bool y
>> +
>>  config SMP
>>  	def_bool y
>>  
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index a1bfc44..4803624 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -733,6 +733,187 @@ int kern_addr_valid(unsigned long addr)
>>  
>>  	return pfn_valid(pte_pfn(pte));
>>  }
>> +
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void free_hotplug_page_range(struct page *page, ssize_t size)
> 
> The size argument should never be negative, so size_t would be best.

Done.

> 
>> +{
>> +	WARN_ON(PageReserved(page));
>> +	free_pages((unsigned long)page_address(page), get_order(size));
>> +}
>> +
>> +static void free_hotplug_pgtable_page(struct page *page)
>> +{
>> +	free_hotplug_page_range(page, PAGE_SIZE);
>> +}
>> +
>> +static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
>> +{
>> +	struct page *page;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		if (!pte_none(ptep[i]))
>> +			return;
>> +	}
>> +
>> +	page = pmd_page(READ_ONCE(*pmdp));
>> +	pmd_clear(pmdp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	free_hotplug_pgtable_page(page);
>> +}
>> +
>> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
>> +{
>> +	struct page *page;
>> +	int i;
>> +
>> +	if (CONFIG_PGTABLE_LEVELS <= 2)
>> +		return;
>> +
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		if (!pmd_none(pmdp[i]))
>> +			return;
>> +	}
>> +
>> +	page = pud_page(READ_ONCE(*pudp));
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	free_hotplug_pgtable_page(page);
>> +}
>> +
>> +static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
>> +{
>> +	struct page *page;
>> +	int i;
>> +
>> +	if (CONFIG_PGTABLE_LEVELS <= 3)
>> +		return;
>> +
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		if (!pud_none(pudp[i]))
>> +			return;
>> +	}
>> +
>> +	page = pgd_page(READ_ONCE(*pgdp));
>> +	pgd_clear(pgdp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	free_hotplug_pgtable_page(page);
>> +}
>> +
>> +static void
>> +remove_pte_table(pmd_t *pmdp, unsigned long addr,
> 
> Please put this on a single line.
> 
> All the existing functions in this file (and the ones you add above)
> have the return type on the same line as the name, and since this
> portion of the prototype doesn't encroach 80 columns there's no reason
> to flow it.

Fixed.

> 
>> +			unsigned long end, bool sparse_vmap)
>> +{
>> +	struct page *page;
>> +	pte_t *ptep, pte;
>> +	unsigned long start = addr;
>> +
>> +	for (; addr < end; addr += PAGE_SIZE) {
>> +		ptep = pte_offset_kernel(pmdp, addr);
>> +		pte = READ_ONCE(*ptep);
>> +
>> +		if (pte_none(pte))
>> +			continue;
>> +
>> +		WARN_ON(!pte_present(pte));
>> +		if (sparse_vmap) {
>> +			page = pte_page(pte);
>> +			free_hotplug_page_range(page, PAGE_SIZE);
>> +		}
>> +		pte_clear(&init_mm, addr, ptep);
>> +	}
>> +	flush_tlb_kernel_range(start, end);
>> +}
> 
> For consistency we should use a do { ... } while (..., addr != end) loop
> to iterate over the page tables. All the other code in our mmu.c does
> that, and diverging from that doesn't save use anything here but does
> make review and maintenance harder.

Done.

> 
>> +
>> +static void
>> +remove_pmd_table(pud_t *pudp, unsigned long addr,
> 
> Same line please.
> 
>> +			unsigned long end, bool sparse_vmap)
>> +{
>> +	unsigned long next;
>> +	struct page *page;
>> +	pte_t *ptep_base;
>> +	pmd_t *pmdp, pmd;
>> +
>> +	for (; addr < end; addr = next) {
>> +		next = pmd_addr_end(addr, end);
>> +		pmdp = pmd_offset(pudp, addr);
>> +		pmd = READ_ONCE(*pmdp);
>> +
>> +		if (pmd_none(pmd))
>> +			continue;
>> +
>> +		WARN_ON(!pmd_present(pmd));
>> +		if (pmd_sect(pmd)) {
>> +			if (sparse_vmap) {
>> +				page = pmd_page(pmd);
>> +				free_hotplug_page_range(page, PMD_SIZE);
>> +			}
>> +			pmd_clear(pmdp);
> 
> As mentioned above, this has no corresponding TLB maintenance, and I'm
> concerned that we free the page before clearing the entry. If the page
> gets re-allocated elsewhere, whoever received the page may not be
> expecting a cacheable alias to exist other than the linear map.

Fixed.

> 
>> +			continue;
>> +		}
>> +		ptep_base = pte_offset_kernel(pmdp, 0UL);
>> +		remove_pte_table(pmdp, addr, next, sparse_vmap);
>> +		free_pte_table(ptep_base, pmdp, addr);
>> +	}
>> +}
>> +
>> +static void
>> +remove_pud_table(pgd_t *pgdp, unsigned long addr,
> 
> Same line please

Fixed.

> 
>> +			unsigned long end, bool sparse_vmap)
>> +{
>> +	unsigned long next;
>> +	struct page *page;
>> +	pmd_t *pmdp_base;
>> +	pud_t *pudp, pud;
>> +
>> +	for (; addr < end; addr = next) {
>> +		next = pud_addr_end(addr, end);
>> +		pudp = pud_offset(pgdp, addr);
>> +		pud = READ_ONCE(*pudp);
>> +
>> +		if (pud_none(pud))
>> +			continue;
>> +
>> +		WARN_ON(!pud_present(pud));
>> +		if (pud_sect(pud)) {
>> +			if (sparse_vmap) {
>> +				page = pud_page(pud);
>> +				free_hotplug_page_range(page, PUD_SIZE);
>> +			}
>> +			pud_clear(pudp);
> 
> Same issue as in remove_pmd_table().

Fixed.


> 
>> +			continue;
>> +		}
>> +		pmdp_base = pmd_offset(pudp, 0UL);
>> +		remove_pmd_table(pudp, addr, next, sparse_vmap);
>> +		free_pmd_table(pmdp_base, pudp, addr);
>> +	}
>> +}
>> +
>> +static void
>> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
> 
> Same line please (with the sparse_vmap argument flowed on to the next
> line as that will encroach 80 characters).

Done.

> 
>> +{
>> +	unsigned long addr, next;
>> +	pud_t *pudp_base;
>> +	pgd_t *pgdp, pgd;
>> +
>> +	spin_lock(&init_mm.page_table_lock);
> 
> Please add a comment above this to explain why we need to take the
> init_mm ptl. Per the cover letter, this should be something like:
> 
> 	/*
> 	 * We may share tables with the vmalloc region, so we must take
> 	 * the init_mm ptl so that we can safely free any
> 	 * potentially-shared tables that we have emptied.
> 	 */

This might not be required any more (see below comments)

> 
> The vmalloc code doesn't hold the init_mm ptl when walking a table; it

Right.

> only takes the init_mm ptl when populating a none entry in
> __p??_alloc(), to avoid a race where two threads need to populate the
> entry.

Right.

> 
> So AFAICT, taking the init_mm ptl here is not sufficient to make this
> safe.

I understand that there can be potential conflicts here if vmalloc and
vmemap mappings share kernel intermediate level page table pages.

For example.

- vmalloc takes an intermediate page table page pointer during walk
  (without init_mm lock) and proceeds further to create leaf level
  entries

- memory hot-remove walks the page table, (clear-->--invalidate-->free)
  leaf level entries and then removes (clear-->--invalidate-->free) an
  intermediate level page table pages (already emptied) while holding
  init_mm lock

- vmalloc now holds an invalid page table entry pointer derived from a
  freed page (potentially being used else where) and proceeds to create
  an entry on it !

The primary cause which creates this problematic situation is

- vmalloc does not take init_mm.page_table_lock for it's entire duration.
  Kernel page table walk, page table page insert, creation of leaf level
  entries etc. This should have prevented memory hot-remove from deleting
  intermediate page table pages while vmalloc was at it.

So how to solve this problem ?

Broadly there are three options (unless I have missed some more)

Option 1:

Take init_mm ptl for the entire duration of vmalloc() but it will then
have significant impact on it's performance. vmalloc() works on mutually
exclusive ranges which can proceed concurrently for their allocation
except the page table pages which are currently protected. Multiple
threads doing vmalloc() dont need init_mm ptl for it's entire duration.
Hence doing so can affect performance.

Option 2:

Take mem_hotplug_lock in read mode through [get|put]_online_mems() for
the entire duration of vmalloc(). It protects vmalloc() from concurrent
memory hot remove operation but does not add significant overhead to
other concurrent vmalloc() threads.

Option 3:

Dont not free page table pages for vmemmap mappings after unmapping the
hotplug range. The only downside is that some page table pages might
remain empty and unused till the next hot add operation for the same
memory range, which should be fine.

IMHO

- Option 1 does not seem to be viable for it's performance impact
- Option 2 seems to solve the problem in the right way unless we dont
           want to further the usage of mem_hotplug_lock in core MM

- Option 3 seems like an easy and quick solution on the platform side
           which avoids the problem for now

Please let me know your thoughts.

- Anshuman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-11 11:26         ` [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
@ 2019-06-11 22:19           ` Andrew Morton
  2019-06-12  4:02             ` Anshuman Khandual
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2019-06-11 22:19 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, ard.biesheuvel, osalvador, david, mhocko,
	mark.rutland

On Tue, 11 Jun 2019 16:56:13 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:

> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc

This seems like a serious problem.  Once which should be fixed in 5.2
and perhaps the various -stable kernels as well.

> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is safe because there is no other memblock (bootmem) allocator user that
> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.
> 
> ...
>
> 
> - Rebased on linux-next (next-20190611)

Yet the patch you've prepared is designed for 5.3.  Was that
deliberate, or should we be targeting earlier kernels?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-11 22:19           ` Andrew Morton
@ 2019-06-12  4:02             ` Anshuman Khandual
  2019-06-12  6:53               ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: Anshuman Khandual @ 2019-06-12  4:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, ard.biesheuvel, osalvador, david, mhocko,
	mark.rutland



On 06/12/2019 03:49 AM, Andrew Morton wrote:
> On Tue, 11 Jun 2019 16:56:13 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
>>
>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>> [   62.052517] ------------[ cut here ]------------
>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [   62.054589] Modules linked in:
>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>> [   62.059842] sp : ffff0000168b3ce0
>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>> [   62.076930] Call trace:
>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>> [   62.078227]  __add_memory+0x70/0xa8
>> [   62.078901]  probe_store+0xa4/0xc8
>> [   62.079561]  dev_attr_store+0x18/0x28
>> [   62.080270]  sysfs_kf_write+0x40/0x58
>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>> [   62.081744]  __vfs_write+0x18/0x40
>> [   62.082400]  vfs_write+0xa4/0x1b0
>> [   62.083037]  ksys_write+0x5c/0xc0
>> [   62.083681]  __arm64_sys_write+0x18/0x20
>> [   62.084432]  el0_svc_handler+0x88/0x100
>> [   62.085177]  el0_svc+0x8/0xc
> 
> This seems like a serious problem.  Once which should be fixed in 5.2
> and perhaps the various -stable kernels as well.

But the problem does not exist in the current kernel as yet till the reworked
versions of the other two patches in this series get merged. This patch was
after arm64 hot-remove enablement in V1 (https://lkml.org/lkml/2019/4/3/28)
but after some discussions it was decided to be moved before hot-remove from
V2 (https://lkml.org/lkml/2019/4/14/5) onwards as a prerequisite patch instead.

> 
>> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>> as memblock for the address range still exists. arch_remove_memory()
>> removes applicable memory sections from zone with __remove_pages() and
>> tears down kernel linear mapping. Removing memblock regions afterwards
>> is safe because there is no other memblock (bootmem) allocator user that
>> late. So nobody is going to allocate from the removed range just to blow
>> up later. Also nobody should be using the bootmem allocated range else
>> we wouldn't allow to remove it. So reordering is indeed safe.
>>
>> ...
>>
>>
>> - Rebased on linux-next (next-20190611)
> 
> Yet the patch you've prepared is designed for 5.3.  Was that
> deliberate, or should we be targeting earlier kernels?

It was deliberate for 5.3 as a preparation for upcoming reworked arm64 hot-remove.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-12  4:02             ` Anshuman Khandual
@ 2019-06-12  6:53               ` David Hildenbrand
  2019-06-13  1:54                 ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2019-06-12  6:53 UTC (permalink / raw)
  To: Anshuman Khandual, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, ard.biesheuvel, osalvador, mhocko, mark.rutland

On 12.06.19 06:02, Anshuman Khandual wrote:
> 
> 
> On 06/12/2019 03:49 AM, Andrew Morton wrote:
>> On Tue, 11 Jun 2019 16:56:13 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote:
>>
>>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>>> entries between memory block and node. It first checks pfn validity with
>>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>>
>>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>>> creates a problem in memory hot remove path which has already removed given
>>> memory range from memory block with memblock_[remove|free] before arriving
>>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>>> of existing sysfs entries.
>>>
>>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>>> [   62.052517] ------------[ cut here ]------------
>>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
>>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>>> [   62.054589] Modules linked in:
>>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>>> [   62.059842] sp : ffff0000168b3ce0
>>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>>> [   62.076930] Call trace:
>>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>>> [   62.078227]  __add_memory+0x70/0xa8
>>> [   62.078901]  probe_store+0xa4/0xc8
>>> [   62.079561]  dev_attr_store+0x18/0x28
>>> [   62.080270]  sysfs_kf_write+0x40/0x58
>>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>>> [   62.081744]  __vfs_write+0x18/0x40
>>> [   62.082400]  vfs_write+0xa4/0x1b0
>>> [   62.083037]  ksys_write+0x5c/0xc0
>>> [   62.083681]  __arm64_sys_write+0x18/0x20
>>> [   62.084432]  el0_svc_handler+0x88/0x100
>>> [   62.085177]  el0_svc+0x8/0xc
>>
>> This seems like a serious problem.  Once which should be fixed in 5.2
>> and perhaps the various -stable kernels as well.
> 
> But the problem does not exist in the current kernel as yet till the reworked
> versions of the other two patches in this series get merged. This patch was
> after arm64 hot-remove enablement in V1 (https://lkml.org/lkml/2019/4/3/28)
> but after some discussions it was decided to be moved before hot-remove from
> V2 (https://lkml.org/lkml/2019/4/14/5) onwards as a prerequisite patch instead.
> 
>>
>>> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
>>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>>> as memblock for the address range still exists. arch_remove_memory()
>>> removes applicable memory sections from zone with __remove_pages() and
>>> tears down kernel linear mapping. Removing memblock regions afterwards
>>> is safe because there is no other memblock (bootmem) allocator user that
>>> late. So nobody is going to allocate from the removed range just to blow
>>> up later. Also nobody should be using the bootmem allocated range else
>>> we wouldn't allow to remove it. So reordering is indeed safe.
>>>
>>> ...
>>>
>>>
>>> - Rebased on linux-next (next-20190611)
>>
>> Yet the patch you've prepared is designed for 5.3.  Was that
>> deliberate, or should we be targeting earlier kernels?
> 
> It was deliberate for 5.3 as a preparation for upcoming reworked arm64 hot-remove.
> 

We should probably add to the patch description something like "This is
a preparation for arm64 memory hotremove. The described issue is not
relevant on other architectures."

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-12  6:53               ` David Hildenbrand
@ 2019-06-13  1:54                 ` Andrew Morton
  2019-06-13  5:37                   ` Anshuman Khandual
  2019-06-13  6:14                   ` David Hildenbrand
  0 siblings, 2 replies; 20+ messages in thread
From: Andrew Morton @ 2019-06-13  1:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Anshuman Khandual, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, ard.biesheuvel, osalvador, mhocko,
	mark.rutland

On Wed, 12 Jun 2019 08:53:33 +0200 David Hildenbrand <david@redhat.com> wrote:

> >>> ...
> >>>
> >>>
> >>> - Rebased on linux-next (next-20190611)
> >>
> >> Yet the patch you've prepared is designed for 5.3.  Was that
> >> deliberate, or should we be targeting earlier kernels?
> > 
> > It was deliberate for 5.3 as a preparation for upcoming reworked arm64 hot-remove.
> > 
> 
> We should probably add to the patch description something like "This is
> a preparation for arm64 memory hotremove. The described issue is not
> relevant on other architectures."

Please.  And is there any reason to merge it separately?  Can it be
[patch 1/3] in the "arm64/mm: Enable memory hot remove" series?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-13  1:54                 ` Andrew Morton
@ 2019-06-13  5:37                   ` Anshuman Khandual
  2019-06-13  6:14                   ` David Hildenbrand
  1 sibling, 0 replies; 20+ messages in thread
From: Anshuman Khandual @ 2019-06-13  5:37 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand
  Cc: linux-mm, linux-kernel, linux-arm-kernel, catalin.marinas,
	will.deacon, ard.biesheuvel, osalvador, mhocko, mark.rutland



On 06/13/2019 07:24 AM, Andrew Morton wrote:
> On Wed, 12 Jun 2019 08:53:33 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>>>>> ...
>>>>>
>>>>>
>>>>> - Rebased on linux-next (next-20190611)
>>>>
>>>> Yet the patch you've prepared is designed for 5.3.  Was that
>>>> deliberate, or should we be targeting earlier kernels?
>>>
>>> It was deliberate for 5.3 as a preparation for upcoming reworked arm64 hot-remove.
>>>
>>
>> We should probably add to the patch description something like "This is
>> a preparation for arm64 memory hotremove. The described issue is not
>> relevant on other architectures."
> 
> Please.  And is there any reason to merge it separately?  Can it be
> [patch 1/3] in the "arm64/mm: Enable memory hot remove" series?

Sure it can be. I will make this [patch 1/3] in the next version for
"arm64/mm: Enable memory hot remove". Apologies for the noise here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-06-13  1:54                 ` Andrew Morton
  2019-06-13  5:37                   ` Anshuman Khandual
@ 2019-06-13  6:14                   ` David Hildenbrand
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2019-06-13  6:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anshuman Khandual, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, ard.biesheuvel, osalvador, mhocko,
	mark.rutland

On 13.06.19 03:54, Andrew Morton wrote:
> On Wed, 12 Jun 2019 08:53:33 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>>>>> ...
>>>>>
>>>>>
>>>>> - Rebased on linux-next (next-20190611)
>>>>
>>>> Yet the patch you've prepared is designed for 5.3.  Was that
>>>> deliberate, or should we be targeting earlier kernels?
>>>
>>> It was deliberate for 5.3 as a preparation for upcoming reworked arm64 hot-remove.
>>>
>>
>> We should probably add to the patch description something like "This is
>> a preparation for arm64 memory hotremove. The described issue is not
>> relevant on other architectures."
> 
> Please.  And is there any reason to merge it separately?  Can it be
> [patch 1/3] in the "arm64/mm: Enable memory hot remove" series?
> 

Nothing that the patch can be considered a cleanup:

"
mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

In add_memory_resource() we have:

	memblock_add_node(start, size, nid)
	...
	arch_add_memory(nid, start, size, &restrictions);
	...
	create_memory_block_devices(start, size);

While in try_remove_memory() we have:

	memblock_free(start, size);
	memblock_remove(start, size);
	...
	remove_memory_block_devices(start, size);
	arch_remove_memory(nid, start, size, NULL);

Let's restore the correct order by removing the memblock after
arch_remove_memory().
"

I think with such a description, we can include it now. Andrew?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-06-13 16:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29  9:16 [PATCH V5 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-05-29  9:16 ` [PATCH V5 1/3] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
2019-05-30 10:37   ` Mark Rutland
2019-06-07  2:28     ` Anshuman Khandual
2019-06-07  7:43       ` David Hildenbrand
2019-06-11 11:26         ` [PATCH V5 - Rebased] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
2019-06-11 22:19           ` Andrew Morton
2019-06-12  4:02             ` Anshuman Khandual
2019-06-12  6:53               ` David Hildenbrand
2019-06-13  1:54                 ` Andrew Morton
2019-06-13  5:37                   ` Anshuman Khandual
2019-06-13  6:14                   ` David Hildenbrand
2019-05-29  9:16 ` [PATCH V5 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
2019-05-30 10:42   ` Mark Rutland
2019-05-29  9:16 ` [PATCH V5 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-05-30 15:12   ` Mark Rutland
2019-06-11 14:43     ` Anshuman Khandual
2019-05-29 22:06 ` [PATCH V5 0/3] " Andrew Morton
2019-05-30  4:23   ` Anshuman Khandual
2019-05-30  7:23     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).