LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V3 0/4] arm64/mm: Enable memory hot remove
@ 2019-05-14  9:00 Anshuman Khandual
  2019-05-14  9:00 ` [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-14  9:00 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic __remove_memory() and kernel page
table race conditions on arm64. This is based on the following arm64
working tree.

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core

David had pointed out that the following patch is already in next/master
(58b11e136dcc14358) and will conflict with the last patch here. Will fix
the conflict once this series gets reviewed and agreed upon.

Author: David Hildenbrand <david@redhat.com>
Date:   Wed Apr 10 11:02:27 2019 +1000

    mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail

    All callers of arch_remove_memory() ignore errors.  And we should really
    try to remove any errors from the memory removal path.  No more errors are
    reported from __remove_pages().  BUG() in s390x code in case
    arch_remove_memory() is triggered.  We may implement that properly later.
    WARN in case powerpc code failed to remove the section mapping, which is
    better than ignoring the error completely right now.

Testing:

Tested memory hot remove on arm64 for 4K, 16K, 64K page config options with
all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS combinations. But
build tested on non arm64 platforms.

Changes in V3:
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore

- Added a new patch to protect kernel page table race condtion for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

Anshuman Khandual (3):
  mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

Mark Rutland (1):
  arm64/mm: Inhibit huge-vmap with ptdump

 arch/arm64/Kconfig             |   3 +
 arch/arm64/mm/mmu.c            | 215 ++++++++++++++++++++++++++++++++++++++++-
 arch/arm64/mm/ptdump_debugfs.c |   3 +
 mm/memory_hotplug.c            |   3 +-
 4 files changed, 217 insertions(+), 7 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-14  9:00 ` Anshuman Khandual
  2019-05-14  9:05   ` David Hildenbrand
  2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-14  9:00 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory_hotplug.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0082d69..71d0d79 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
+	arch_remove_memory(nid, start, size, NULL);
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(nid, start, size, NULL);
-
 	try_offline_node(nid);
 
 	mem_hotplug_done();
-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-14  9:00 ` [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-05-14  9:00 ` Anshuman Khandual
  2019-05-14  9:16   ` David Hildenbrand
                     ` (2 more replies)
  2019-05-14  9:00 ` [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump Anshuman Khandual
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-14  9:00 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

The arm64 pagetable dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addressses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove, or when
installing a huge mapping in the vmalloc region. To avoid racing with these
cases, take the memory hotplug lock when walking the kernel page table.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..80171d1 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump
  2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-14  9:00 ` [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
  2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-05-14  9:00 ` Anshuman Khandual
  2019-05-14  9:20   ` David Hildenbrand
  2019-05-16  8:38   ` Ard Biesheuvel
  2019-05-14  9:00 ` [PATCH V3 4/4] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-14  9:10 ` [PATCH V3 0/4] " David Hildenbrand
  4 siblings, 2 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-14  9:00 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

From: Mark Rutland <mark.rutland@arm.com>

The arm64 ptdump code can race with concurrent modification of the
kernel page tables. At the time this was added, this was sound as:

* Modifications to leaf entries could result in stale information being
  logged, but would not result in a functional problem.

* Boot time modifications to non-leaf entries (e.g. freeing of initmem)
  were performed when the ptdump code cannot be invoked.

* At runtime, modifications to non-leaf entries only occurred in the
  vmalloc region, and these were strictly additive, as intermediate
  entries were never freed.

However, since commit:

  commit 324420bf91f6 ("arm64: add support for ioremap() block mappings")

... it has been possible to create huge mappings in the vmalloc area at
runtime, and as part of this existing intermediate levels of table my be
removed and freed.

It's possible for the ptdump code to race with this, and continue to
walk tables which have been freed (and potentially poisoned or
reallocated). As a result of this, the ptdump code may dereference bogus
addresses, which could be fatal.

Since huge-vmap is a TLB and memory optimization, we can disable it when
the runtime ptdump code is in use to avoid this problem.

Fixes: 324420bf91f60582 ("arm64: add support for ioremap() block mappings")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/mm/mmu.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ef82312..37a902c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -955,13 +955,18 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys)
 
 int __init arch_ioremap_pud_supported(void)
 {
-	/* only 4k granule supports level 1 block mappings */
-	return IS_ENABLED(CONFIG_ARM64_4K_PAGES);
+	/*
+	 * Only 4k granule supports level 1 block mappings.
+	 * SW table walks can't handle removal of intermediate entries.
+	 */
+	return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+	       !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
 }
 
 int __init arch_ioremap_pmd_supported(void)
 {
-	return 1;
+	/* See arch_ioremap_pud_supported() */
+	return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
 }
 
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
                   ` (2 preceding siblings ...)
  2019-05-14  9:00 ` [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump Anshuman Khandual
@ 2019-05-14  9:00 ` Anshuman Khandual
  2019-05-14  9:08   ` David Hildenbrand
  2019-05-15 11:49   ` Mark Rutland
  2019-05-14  9:10 ` [PATCH V3 0/4] " David Hildenbrand
  4 siblings, 2 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-14  9:00 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

Memory removal from an arch perspective involves tearing down two different
kernel based mappings i.e vmemmap and linear while releasing related page
table and any mapped pages allocated for given physical memory range to be
removed.

Define a common kernel page table tear down helper remove_pagetable() which
can be used to unmap given kernel virtual address range. In effect it can
tear down both vmemap or kernel linear mappings. This new helper is called
from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.

For linear mapping there are no actual allocated pages which are mapped to
create the translation. Any pfn on a given entry is derived from physical
address (__va(PA) --> PA) whose linear translation is to be created. They
need not be freed as they were never allocated in the first place. But for
vmemmap which is a real virtual mapping (like vmalloc) physical pages are
allocated either from buddy or memblock which get mapped in the kernel page
table. These allocated and mapped pages need to be freed during translation
tear down. But page table pages need to be freed in both these cases.

These mappings need to be differentiated while deciding if a mapped page at
any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
mapping tear down process should pass on 'sparse_vmap' variable identifying
kernel vmemmap mappings.

While here update arch_add_mempory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/mm/mmu.c | 204 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1c0cb51..bb4e571 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -268,6 +268,9 @@ config HAVE_GENERIC_GUP
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 37a902c..bd2d003 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,6 +733,177 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, ssize_t size)
+{
+	WARN_ON(PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(ptep[i]))
+			return;
+	}
+
+	page = pmd_page(*pmdp);
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+#if (CONFIG_PGTABLE_LEVELS > 2)
+static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(pmdp[i]))
+			return;
+	}
+
+	page = pud_page(*pudp);
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+#else
+static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr) { }
+#endif
+
+#if (CONFIG_PGTABLE_LEVELS > 3)
+static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
+{
+	struct page *page;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(pudp[i]))
+			return;
+	}
+
+	page = pgd_page(*pgdp);
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+#else
+static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr) { }
+#endif
+
+static void
+remove_pte_table(pmd_t *pmdp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	struct page *page;
+	pte_t *ptep;
+	unsigned long start = addr;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (!pte_present(*ptep))
+			continue;
+
+		if (sparse_vmap) {
+			page = pte_page(READ_ONCE(*ptep));
+			free_hotplug_page_range(page, PAGE_SIZE);
+		}
+		pte_clear(&init_mm, addr, ptep);
+	}
+	flush_tlb_kernel_range(start, end);
+}
+
+static void
+remove_pmd_table(pud_t *pudp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pte_t *ptep_base;
+	pmd_t *pmdp;
+
+	for (; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		if (!pmd_present(*pmdp))
+			continue;
+
+		if (pmd_sect(*pmdp)) {
+			if (sparse_vmap) {
+				page = pmd_page(READ_ONCE(*pmdp));
+				free_hotplug_page_range(page, PMD_SIZE);
+			}
+			pmd_clear(pmdp);
+			continue;
+		}
+		ptep_base = pte_offset_kernel(pmdp, 0UL);
+		remove_pte_table(pmdp, addr, next, sparse_vmap);
+		free_pte_table(ptep_base, pmdp, addr);
+	}
+}
+
+static void
+remove_pud_table(pgd_t *pgdp, unsigned long addr,
+			unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pmd_t *pmdp_base;
+	pud_t *pudp;
+
+	for (; addr < end; addr = next) {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		if (!pud_present(*pudp))
+			continue;
+
+		if (pud_sect(*pudp)) {
+			if (sparse_vmap) {
+				page = pud_page(READ_ONCE(*pudp));
+				free_hotplug_page_range(page, PUD_SIZE);
+			}
+			pud_clear(pudp);
+			continue;
+		}
+		pmdp_base = pmd_offset(pudp, 0UL);
+		remove_pmd_table(pudp, addr, next, sparse_vmap);
+		free_pmd_table(pmdp_base, pudp, addr);
+	}
+}
+
+static void
+remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
+{
+	unsigned long addr, next;
+	pud_t *pudp_base;
+	pgd_t *pgdp;
+
+	spin_lock(&init_mm.page_table_lock);
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		if (!pgd_present(*pgdp))
+			continue;
+
+		pudp_base = pud_offset(pgdp, 0UL);
+		remove_pud_table(pgdp, addr, next, sparse_vmap);
+		free_pud_table(pudp_base, pgdp, addr);
+	}
+	spin_unlock(&init_mm.page_table_lock);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -780,6 +951,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	remove_pagetable(start, end, true);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1070,10 +1244,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, start + size, false);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 		    bool want_memblock)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1081,7 +1261,27 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   altmap, want_memblock);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+					__phys_to_virt(start), size);
+	return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone = page_zone(pfn_to_page(start_pfn));
+	int ret = 0;
+
+	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	if (!ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+					__phys_to_virt(start), size);
+	return ret;
+}
+#endif
 #endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-05-14  9:00 ` [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-05-14  9:05   ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2019-05-14  9:05 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 14.05.19 11:00, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is safe because there is no other memblock (bootmem) allocator user that
> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  mm/memory_hotplug.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0082d69..71d0d79 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> +	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(nid, start, size, NULL);
> -
>  	try_offline_node(nid);
>  
>  	mem_hotplug_done();
> 

I think you have to rebase this patch to -next (and soon to linus master).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-14  9:00 ` [PATCH V3 4/4] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-14  9:08   ` David Hildenbrand
  2019-05-15 11:49   ` Mark Rutland
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2019-05-14  9:08 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 14.05.19 11:00, Anshuman Khandual wrote:
> Memory removal from an arch perspective involves tearing down two different
> kernel based mappings i.e vmemmap and linear while releasing related page
> table and any mapped pages allocated for given physical memory range to be
> removed.
> 
> Define a common kernel page table tear down helper remove_pagetable() which
> can be used to unmap given kernel virtual address range. In effect it can
> tear down both vmemap or kernel linear mappings. This new helper is called
> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
> 
> For linear mapping there are no actual allocated pages which are mapped to
> create the translation. Any pfn on a given entry is derived from physical
> address (__va(PA) --> PA) whose linear translation is to be created. They
> need not be freed as they were never allocated in the first place. But for
> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
> allocated either from buddy or memblock which get mapped in the kernel page
> table. These allocated and mapped pages need to be freed during translation
> tear down. But page table pages need to be freed in both these cases.
> 
> These mappings need to be differentiated while deciding if a mapped page at
> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
> mapping tear down process should pass on 'sparse_vmap' variable identifying
> kernel vmemmap mappings.
> 
> While here update arch_add_mempory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/Kconfig  |   3 +
>  arch/arm64/mm/mmu.c | 204 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 205 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1c0cb51..bb4e571 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -268,6 +268,9 @@ config HAVE_GENERIC_GUP
>  config ARCH_ENABLE_MEMORY_HOTPLUG
>  	def_bool y
>  
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> +	def_bool y
> +
>  config SMP
>  	def_bool y
>  
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 37a902c..bd2d003 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -733,6 +733,177 @@ int kern_addr_valid(unsigned long addr)
>  
>  	return pfn_valid(pte_pfn(pte));
>  }
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void free_hotplug_page_range(struct page *page, ssize_t size)
> +{
> +	WARN_ON(PageReserved(page));
> +	free_pages((unsigned long)page_address(page), get_order(size));
> +}
> +
> +static void free_hotplug_pgtable_page(struct page *page)
> +{
> +	free_hotplug_page_range(page, PAGE_SIZE);
> +}
> +
> +static void free_pte_table(pte_t *ptep, pmd_t *pmdp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		if (!pte_none(ptep[i]))
> +			return;
> +	}
> +
> +	page = pmd_page(*pmdp);
> +	pmd_clear(pmdp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +
> +#if (CONFIG_PGTABLE_LEVELS > 2)
> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		if (!pmd_none(pmdp[i]))
> +			return;
> +	}
> +
> +	page = pud_page(*pudp);
> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +#else
> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr) { }
> +#endif
> +
> +#if (CONFIG_PGTABLE_LEVELS > 3)
> +static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		if (!pud_none(pudp[i]))
> +			return;
> +	}
> +
> +	page = pgd_page(*pgdp);
> +	pgd_clear(pgdp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +#else
> +static void free_pud_table(pud_t *pudp, pgd_t *pgdp, unsigned long addr) { }
> +#endif
> +
> +static void
> +remove_pte_table(pmd_t *pmdp, unsigned long addr,
> +			unsigned long end, bool sparse_vmap)
> +{
> +	struct page *page;
> +	pte_t *ptep;
> +	unsigned long start = addr;
> +
> +	for (; addr < end; addr += PAGE_SIZE) {
> +		ptep = pte_offset_kernel(pmdp, addr);
> +		if (!pte_present(*ptep))
> +			continue;
> +
> +		if (sparse_vmap) {
> +			page = pte_page(READ_ONCE(*ptep));
> +			free_hotplug_page_range(page, PAGE_SIZE);
> +		}
> +		pte_clear(&init_mm, addr, ptep);
> +	}
> +	flush_tlb_kernel_range(start, end);
> +}
> +
> +static void
> +remove_pmd_table(pud_t *pudp, unsigned long addr,
> +			unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long next;
> +	struct page *page;
> +	pte_t *ptep_base;
> +	pmd_t *pmdp;
> +
> +	for (; addr < end; addr = next) {
> +		next = pmd_addr_end(addr, end);
> +		pmdp = pmd_offset(pudp, addr);
> +		if (!pmd_present(*pmdp))
> +			continue;
> +
> +		if (pmd_sect(*pmdp)) {
> +			if (sparse_vmap) {
> +				page = pmd_page(READ_ONCE(*pmdp));
> +				free_hotplug_page_range(page, PMD_SIZE);
> +			}
> +			pmd_clear(pmdp);
> +			continue;
> +		}
> +		ptep_base = pte_offset_kernel(pmdp, 0UL);
> +		remove_pte_table(pmdp, addr, next, sparse_vmap);
> +		free_pte_table(ptep_base, pmdp, addr);
> +	}
> +}
> +
> +static void
> +remove_pud_table(pgd_t *pgdp, unsigned long addr,
> +			unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long next;
> +	struct page *page;
> +	pmd_t *pmdp_base;
> +	pud_t *pudp;
> +
> +	for (; addr < end; addr = next) {
> +		next = pud_addr_end(addr, end);
> +		pudp = pud_offset(pgdp, addr);
> +		if (!pud_present(*pudp))
> +			continue;
> +
> +		if (pud_sect(*pudp)) {
> +			if (sparse_vmap) {
> +				page = pud_page(READ_ONCE(*pudp));
> +				free_hotplug_page_range(page, PUD_SIZE);
> +			}
> +			pud_clear(pudp);
> +			continue;
> +		}
> +		pmdp_base = pmd_offset(pudp, 0UL);
> +		remove_pmd_table(pudp, addr, next, sparse_vmap);
> +		free_pmd_table(pmdp_base, pudp, addr);
> +	}
> +}
> +
> +static void
> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long addr, next;
> +	pud_t *pudp_base;
> +	pgd_t *pgdp;
> +
> +	spin_lock(&init_mm.page_table_lock);
> +	for (addr = start; addr < end; addr = next) {
> +		next = pgd_addr_end(addr, end);
> +		pgdp = pgd_offset_k(addr);
> +		if (!pgd_present(*pgdp))
> +			continue;
> +
> +		pudp_base = pud_offset(pgdp, 0UL);
> +		remove_pud_table(pgdp, addr, next, sparse_vmap);
> +		free_pud_table(pudp_base, pgdp, addr);
> +	}
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +#endif
> +
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  #if !ARM64_SWAPPER_USES_SECTION_MAPS
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> @@ -780,6 +951,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	remove_pagetable(start, end, true);
> +#endif
>  }
>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
>  
> @@ -1070,10 +1244,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
> +{
> +	WARN_ON(pgdir != init_mm.pgd);
> +	remove_pagetable(start, start + size, false);
> +}
> +
>  int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>  		    bool want_memblock)
>  {
> -	int flags = 0;
> +	int ret, flags = 0;
>  
>  	if (rodata_full || debug_pagealloc_enabled())
>  		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> @@ -1081,7 +1261,27 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>  	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>  			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
>  
> -	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
> +	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
>  			   altmap, want_memblock);
> +	if (ret)
> +		__remove_pgd_mapping(swapper_pg_dir,
> +					__phys_to_virt(start), size);
> +	return ret;
>  }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)

You need to rebase to -next (linus master soon). This function is now a
void ...

> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct zone *zone = page_zone(pfn_to_page(start_pfn));
> +	int ret = 0;
> +
> +	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);

.. and this call can no longer fail :) Which simplifies this patch.

> +	if (!ret)
> +		__remove_pgd_mapping(swapper_pg_dir,
> +					__phys_to_virt(start), size);
> +	return ret;
> +}
> +#endif
>  #endif
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 0/4] arm64/mm: Enable memory hot remove
  2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
                   ` (3 preceding siblings ...)
  2019-05-14  9:00 ` [PATCH V3 4/4] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-05-14  9:10 ` David Hildenbrand
  4 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2019-05-14  9:10 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 14.05.19 11:00, Anshuman Khandual wrote:
> This series enables memory hot remove on arm64 after fixing a memblock
> removal ordering problem in generic __remove_memory() and kernel page
> table race conditions on arm64. This is based on the following arm64
> working tree.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
> 
> David had pointed out that the following patch is already in next/master
> (58b11e136dcc14358) and will conflict with the last patch here. Will fix
> the conflict once this series gets reviewed and agreed upon.

I should read the cover letter first, so ignore my comments :)

> 
> Author: David Hildenbrand <david@redhat.com>
> Date:   Wed Apr 10 11:02:27 2019 +1000
> 
>     mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail
> 
>     All callers of arch_remove_memory() ignore errors.  And we should really
>     try to remove any errors from the memory removal path.  No more errors are
>     reported from __remove_pages().  BUG() in s390x code in case
>     arch_remove_memory() is triggered.  We may implement that properly later.
>     WARN in case powerpc code failed to remove the section mapping, which is
>     better than ignoring the error completely right now.
> 
> Testing:
> 
> Tested memory hot remove on arm64 for 4K, 16K, 64K page config options with
> all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS combinations. But
> build tested on non arm64 platforms.
> 
> Changes in V3:
>  
> - Implemented most of the suggestions from Mark Rutland for remove_pagetable()
> - Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
> - Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
> - Changed pointer names ('p' at end) and removed tmp from iterations
> - Perform intermediate TLB invalidation while clearing pgtable entries
> - Dropped flush_tlb_kernel_range() in remove_pagetable()
> - Added flush_tlb_kernel_range() in remove_pte_table() instead
> - Renamed page freeing functions for pgtable page and mapped pages
> - Used page range size instead of order while freeing mapped or pgtable pages
> - Removed all PageReserved() handling while freeing mapped or pgtable pages
> - Replaced XXX_index() with XXX_offset() while walking the kernel page table
> - Used READ_ONCE() while fetching individual pgtable entries
> - Taken overall init_mm.page_table_lock instead of just while changing an entry
> - Dropped previously added [pmd|pud]_index() which are not required anymore
> 
> - Added a new patch to protect kernel page table race condtion for ptdump
> - Added a new patch from Mark Rutland to prevent huge-vmap with ptdump
> 
> Changes in V2: (https://lkml.org/lkml/2019/4/14/5)
> 
> - Added all received review and ack tags
> - Split the series from ZONE_DEVICE enablement for better review
> - Moved memblock re-order patch to the front as per Robin Murphy
> - Updated commit message on memblock re-order patch per Michal Hocko
> - Dropped [pmd|pud]_large() definitions
> - Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
> - Removed __meminit and __ref tags as per Oscar Salvador
> - Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
> - Skipped calling into pgtable_page_dtor() for linear mapping page table
>   pages and updated all relevant functions
> 
> Changes in V1: (https://lkml.org/lkml/2019/4/3/28)
> 
> Anshuman Khandual (3):
>   mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
>   arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
>   arm64/mm: Enable memory hot remove
> 
> Mark Rutland (1):
>   arm64/mm: Inhibit huge-vmap with ptdump
> 
>  arch/arm64/Kconfig             |   3 +
>  arch/arm64/mm/mmu.c            | 215 ++++++++++++++++++++++++++++++++++++++++-
>  arch/arm64/mm/ptdump_debugfs.c |   3 +
>  mm/memory_hotplug.c            |   3 +-
>  4 files changed, 217 insertions(+), 7 deletions(-)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-05-14  9:16   ` David Hildenbrand
  2019-05-14 15:40   ` Mark Rutland
  2019-05-15 16:58   ` Michal Hocko
  2 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2019-05-14  9:16 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 14.05.19 11:00, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/mm/ptdump_debugfs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f..80171d1 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -7,7 +7,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();

The name of that function is somewhat stale now, it used to refer to
online/offlining of pages only. The underlying lock is the
mem_hotplug_lock. Maybe we should rename that function some day ...

>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();
>  	return 0;

Acked-by: David Hildenbrand <david@redhat.com>

>  }
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump
  2019-05-14  9:00 ` [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump Anshuman Khandual
@ 2019-05-14  9:20   ` David Hildenbrand
  2019-05-16  8:38   ` Ard Biesheuvel
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2019-05-14  9:20 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 14.05.19 11:00, Anshuman Khandual wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> 
> The arm64 ptdump code can race with concurrent modification of the
> kernel page tables. At the time this was added, this was sound as:
> 
> * Modifications to leaf entries could result in stale information being
>   logged, but would not result in a functional problem.
> 
> * Boot time modifications to non-leaf entries (e.g. freeing of initmem)
>   were performed when the ptdump code cannot be invoked.
> 
> * At runtime, modifications to non-leaf entries only occurred in the
>   vmalloc region, and these were strictly additive, as intermediate
>   entries were never freed.
> 
> However, since commit:
> 
>   commit 324420bf91f6 ("arm64: add support for ioremap() block mappings")
> 
> ... it has been possible to create huge mappings in the vmalloc area at
> runtime, and as part of this existing intermediate levels of table my be
> removed and freed.
> 
> It's possible for the ptdump code to race with this, and continue to
> walk tables which have been freed (and potentially poisoned or
> reallocated). As a result of this, the ptdump code may dereference bogus
> addresses, which could be fatal.
> 
> Since huge-vmap is a TLB and memory optimization, we can disable it when
> the runtime ptdump code is in use to avoid this problem.

I wonder if there is another way to protect from such a race happening.
(IOW, a lock). But as you say, it's a debug feature, so this is an easy fix.

Looks good to me (with limited arm64 code insight :) )

> 
> Fixes: 324420bf91f60582 ("arm64: add support for ioremap() block mappings")
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ef82312..37a902c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -955,13 +955,18 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys)
>  
>  int __init arch_ioremap_pud_supported(void)
>  {
> -	/* only 4k granule supports level 1 block mappings */
> -	return IS_ENABLED(CONFIG_ARM64_4K_PAGES);
> +	/*
> +	 * Only 4k granule supports level 1 block mappings.
> +	 * SW table walks can't handle removal of intermediate entries.
> +	 */
> +	return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
> +	       !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
>  }
>  
>  int __init arch_ioremap_pmd_supported(void)
>  {
> -	return 1;
> +	/* See arch_ioremap_pud_supported() */
> +	return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
>  }
>  
>  int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
  2019-05-14  9:16   ` David Hildenbrand
@ 2019-05-14 15:40   ` Mark Rutland
  2019-05-15  1:56     ` Anshuman Khandual
  2019-05-15 16:58   ` Michal Hocko
  2 siblings, 1 reply; 28+ messages in thread
From: Mark Rutland @ 2019-05-14 15:40 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny

On Tue, May 14, 2019 at 02:30:05PM +0530, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Can we please move this after the next patch (which addresses the huge
vmap case), and change the last paragraph to:

  Intermediate levels of table may by freed during memory hot-remove,
  which will be enabled by a subsequent patch. To avoid racing with
  this, take the memory hotplug lock when walking the kernel page table.

With that, this looks good to me.

Thanks,
Mark.

> ---
>  arch/arm64/mm/ptdump_debugfs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f..80171d1 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -7,7 +7,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();
>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();
>  	return 0;
>  }
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-14 15:40   ` Mark Rutland
@ 2019-05-15  1:56     ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-15  1:56 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny



On 05/14/2019 09:10 PM, Mark Rutland wrote:
> On Tue, May 14, 2019 at 02:30:05PM +0530, Anshuman Khandual wrote:
>> The arm64 pagetable dump code can race with concurrent modification of the
>> kernel page tables. When a leaf entries are modified concurrently, the dump
>> code may log stale or inconsistent information for a VA range, but this is
>> otherwise not harmful.
>>
>> When intermediate levels of table are freed, the dump code will continue to
>> use memory which has been freed and potentially reallocated for another
>> purpose. In such cases, the dump code may dereference bogus addressses,
>> leading to a number of potential problems.
>>
>> Intermediate levels of table may by freed during memory hot-remove, or when
>> installing a huge mapping in the vmalloc region. To avoid racing with these
>> cases, take the memory hotplug lock when walking the kernel page table.
>>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Can we please move this after the next patch (which addresses the huge
> vmap case), and change the last paragraph to:
> 
>   Intermediate levels of table may by freed during memory hot-remove,
>   which will be enabled by a subsequent patch. To avoid racing with
>   this, take the memory hotplug lock when walking the kernel page table.
> 
> With that, this looks good to me.

Sure will do.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-14  9:00 ` [PATCH V3 4/4] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-05-14  9:08   ` David Hildenbrand
@ 2019-05-15 11:49   ` Mark Rutland
  2019-05-16  5:34     ` Anshuman Khandual
  1 sibling, 1 reply; 28+ messages in thread
From: Mark Rutland @ 2019-05-15 11:49 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny

Hi Anshuman,

On Tue, May 14, 2019 at 02:30:07PM +0530, Anshuman Khandual wrote:
> Memory removal from an arch perspective involves tearing down two different
> kernel based mappings i.e vmemmap and linear while releasing related page
> table and any mapped pages allocated for given physical memory range to be
> removed.
> 
> Define a common kernel page table tear down helper remove_pagetable() which
> can be used to unmap given kernel virtual address range. In effect it can
> tear down both vmemap or kernel linear mappings. This new helper is called
> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
> 
> For linear mapping there are no actual allocated pages which are mapped to
> create the translation. Any pfn on a given entry is derived from physical
> address (__va(PA) --> PA) whose linear translation is to be created. They
> need not be freed as they were never allocated in the first place. But for
> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
> allocated either from buddy or memblock which get mapped in the kernel page
> table. These allocated and mapped pages need to be freed during translation
> tear down. But page table pages need to be freed in both these cases.

As previously discussed, we should only hot-remove memory which was
hot-added, so we shouldn't encounter memory allocated from memblock.

> These mappings need to be differentiated while deciding if a mapped page at
> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
> mapping tear down process should pass on 'sparse_vmap' variable identifying
> kernel vmemmap mappings.

I think that you can simplify the paragraphs above down to:

  The arch code for hot-remove must tear down portions of the linear map
  and vmemmap corresponding to memory being removed. In both cases the
  page tables mapping these regions must be freed, and when sparse
  vmemmap is in use the memory backing the vmemmap must also be freed.

  This patch adds a new remove_pagetable() helper which can be used to
  tear down either region, and calls it from vmemmap_free() and
  ___remove_pgd_mapping(). The sparse_vmap argument determines whether
  the backing memory will be freed.

Could you add a paragraph describing when we can encounter partial
tables (for which we need the p??_none() checks? IIUC that's not just
for cleaning up a failed hot-add, and it would be good to call that out.

> While here update arch_add_mempory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

Nit: s/arch_add_mempory/arch_add_memory/.

[...]

> +#if (CONFIG_PGTABLE_LEVELS > 2)
> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
> +{
> +	struct page *page;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		if (!pmd_none(pmdp[i]))
> +			return;
> +	}
> +
> +	page = pud_page(*pudp);
> +	pud_clear(pudp);
> +	__flush_tlb_kernel_pgtable(addr);
> +	free_hotplug_pgtable_page(page);
> +}
> +#else
> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr) { }
> +#endif

Can we fold the check in and remove the ifdeferry? e.g.

static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
{
	struct page *page;
	int i;

	if (CONFIG_PGTABLE_LEVELS <= 2)
		return;
	
	...
}

... that would ensure that we always got build coverage here, and
minimize duplication. We do similar in map_kernel() and
early_fixmap_init() today.

Likewise for the other levels.

For arm64, the general policy is to use READ_ONCE() when reading a page
table entry (even if not strictly necessary), so please do so
consistently.

[...]

> +static void
> +remove_pte_table(pmd_t *pmdp, unsigned long addr,
> +			unsigned long end, bool sparse_vmap)
> +{
> +	struct page *page;
> +	pte_t *ptep;
> +	unsigned long start = addr;
> +
> +	for (; addr < end; addr += PAGE_SIZE) {
> +		ptep = pte_offset_kernel(pmdp, addr);
> +		if (!pte_present(*ptep))
> +			continue;
> +
> +		if (sparse_vmap) {
> +			page = pte_page(READ_ONCE(*ptep));
> +			free_hotplug_page_range(page, PAGE_SIZE);
> +		}
> +		pte_clear(&init_mm, addr, ptep);
> +	}
> +	flush_tlb_kernel_range(start, end);
> +}

Please use a temporary pte variable here, e.g.

static void remove_pte_table(pmd_t *pmdp, unsigned long addr,
			     unsigned long end, bool sparse_vmap)
{
	unsigned long start = addr;
	struct page *page;
	pte_t *ptep, pte;

	for (; addr < end; addr += PAGE_SIZE) {
		ptep = pte_offset_kernel(pmdp, addr);
		pte = READ_ONCE(*ptep);

		if (!pte_present(pte))
			continue;
		
		if (sparse_vmap) {
			page = pte_page(pte);
			free_hotplug_page_range(page, PAGE_SIZE);
		}

		pte_clear(&init_mm, addr, ptep);
	}

	flush_tlb_kernel_range(start, end);
}

Likewise for the other levels.

[...]

> +static void
> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
> +{
> +	unsigned long addr, next;
> +	pud_t *pudp_base;
> +	pgd_t *pgdp;
> +
> +	spin_lock(&init_mm.page_table_lock);

It would be good to explain why we need to take the ptl here.

IIUC that shouldn't be necessary for the linear map. Am I mistaken?

Is there a specific race when tearing down the vmemmap?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
  2019-05-14  9:16   ` David Hildenbrand
  2019-05-14 15:40   ` Mark Rutland
@ 2019-05-15 16:58   ` Michal Hocko
  2019-05-16 10:23     ` Mark Rutland
  2 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-05-15 16:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.

Why is this a problem only on arm64 and why do we even care for debugfs?
Does anybody rely on this thing to be reliable? Do we even need it? Who
is using the file?

I am asking because I would really love to make mem hotplug locking less
scattered outside of the core MM than more. Most users simply shouldn't
care. Pfn walkers should rely on pfn_to_online_page.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-15 11:49   ` Mark Rutland
@ 2019-05-16  5:34     ` Anshuman Khandual
  2019-05-16 10:57       ` Mark Rutland
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-16  5:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny

On 05/15/2019 05:19 PM, Mark Rutland wrote:
> Hi Anshuman,
> 
> On Tue, May 14, 2019 at 02:30:07PM +0530, Anshuman Khandual wrote:
>> Memory removal from an arch perspective involves tearing down two different
>> kernel based mappings i.e vmemmap and linear while releasing related page
>> table and any mapped pages allocated for given physical memory range to be
>> removed.
>>
>> Define a common kernel page table tear down helper remove_pagetable() which
>> can be used to unmap given kernel virtual address range. In effect it can
>> tear down both vmemap or kernel linear mappings. This new helper is called
>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>>
>> For linear mapping there are no actual allocated pages which are mapped to
>> create the translation. Any pfn on a given entry is derived from physical
>> address (__va(PA) --> PA) whose linear translation is to be created. They
>> need not be freed as they were never allocated in the first place. But for
>> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
>> allocated either from buddy or memblock which get mapped in the kernel page
>> table. These allocated and mapped pages need to be freed during translation
>> tear down. But page table pages need to be freed in both these cases.
> 
> As previously discussed, we should only hot-remove memory which was
> hot-added, so we shouldn't encounter memory allocated from memblock.

Right, not applicable any more. Will drop this word.

> 
>> These mappings need to be differentiated while deciding if a mapped page at
>> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
>> mapping tear down process should pass on 'sparse_vmap' variable identifying
>> kernel vmemmap mappings.
> 
> I think that you can simplify the paragraphs above down to:
> 
>   The arch code for hot-remove must tear down portions of the linear map
>   and vmemmap corresponding to memory being removed. In both cases the
>   page tables mapping these regions must be freed, and when sparse
>   vmemmap is in use the memory backing the vmemmap must also be freed.
> 
>   This patch adds a new remove_pagetable() helper which can be used to
>   tear down either region, and calls it from vmemmap_free() and
>   ___remove_pgd_mapping(). The sparse_vmap argument determines whether
>   the backing memory will be freed.

The current one is bit more descriptive on detail. Anyways will replace with
the above writeup if that is preferred.

> 
> Could you add a paragraph describing when we can encounter partial
> tables (for which we need the p??_none() checks? IIUC that's not just> for cleaning up a failed hot-add, and it would be good to call that out.

Sure, will do.

> 
>> While here update arch_add_mempory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> Nit: s/arch_add_mempory/arch_add_memory/.

Oops, will do.

> 
> [...]
> 
>> +#if (CONFIG_PGTABLE_LEVELS > 2)
>> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
>> +{
>> +	struct page *page;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		if (!pmd_none(pmdp[i]))
>> +			return;
>> +	}
>> +
>> +	page = pud_page(*pudp);
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	free_hotplug_pgtable_page(page);
>> +}
>> +#else
>> +static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr) { }
>> +#endif
> 
> Can we fold the check in and remove the ifdeferry? e.g.
> 
> static void free_pmd_table(pmd_t *pmdp, pud_t *pudp, unsigned long addr)
> {
> 	struct page *page;
> 	int i;
> 
> 	if (CONFIG_PGTABLE_LEVELS <= 2)
> 		return;
> 	
> 	...
> }
> 
> ... that would ensure that we always got build coverage here, and

Thats true. This will get compiled for all combinations.

> minimize duplication. We do similar in map_kernel() and
> early_fixmap_init() today.
> 
> Likewise for the other levels.

Sure, will do.

> 
> For arm64, the general policy is to use READ_ONCE() when reading a page
> table entry (even if not strictly necessary), so please do so
> consistently.

For the likes "page = p???_page(*p???p)" which got missed ? Will fix it.

> 
> [...]
> 
>> +static void
>> +remove_pte_table(pmd_t *pmdp, unsigned long addr,
>> +			unsigned long end, bool sparse_vmap)
>> +{
>> +	struct page *page;
>> +	pte_t *ptep;
>> +	unsigned long start = addr;
>> +
>> +	for (; addr < end; addr += PAGE_SIZE) {
>> +		ptep = pte_offset_kernel(pmdp, addr);
>> +		if (!pte_present(*ptep))
>> +			continue;
>> +
>> +		if (sparse_vmap) {
>> +			page = pte_page(READ_ONCE(*ptep));
>> +			free_hotplug_page_range(page, PAGE_SIZE);
>> +		}
>> +		pte_clear(&init_mm, addr, ptep);
>> +	}
>> +	flush_tlb_kernel_range(start, end);
>> +}
> 
> Please use a temporary pte variable here, e.g.
> 
> static void remove_pte_table(pmd_t *pmdp, unsigned long addr,
> 			     unsigned long end, bool sparse_vmap)
> {
> 	unsigned long start = addr;
> 	struct page *page;
> 	pte_t *ptep, pte;
> 
> 	for (; addr < end; addr += PAGE_SIZE) {
> 		ptep = pte_offset_kernel(pmdp, addr);
> 		pte = READ_ONCE(*ptep);
> 
> 		if (!pte_present(pte))
> 			continue;
> 		
> 		if (sparse_vmap) {
> 			page = pte_page(pte);
> 			free_hotplug_page_range(page, PAGE_SIZE);
> 		}
> 
> 		pte_clear(&init_mm, addr, ptep);
> 	}
> 
> 	flush_tlb_kernel_range(start, end);
> }
> 
> Likewise for the other levels.

Makes sense. Will do.

> 
> [...]
> 
>> +static void
>> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
>> +{
>> +	unsigned long addr, next;
>> +	pud_t *pudp_base;
>> +	pgd_t *pgdp;
>> +
>> +	spin_lock(&init_mm.page_table_lock);
> 
> It would be good to explain why we need to take the ptl here.

Will update both commit message and add an in-code comment here.

> 
> IIUC that shouldn't be necessary for the linear map. Am I mistaken?

Its not absolutely necessary for linear map right now because both memory hot
plug & ptdump which modifies or walks the page table ranges respectively take
memory hotplug lock. That apart, no other callers creates or destroys linear
mapping at runtime.

> 
> Is there a specific race when tearing down the vmemmap?

This is trickier than linear map. vmemmap additions would be protected with
memory hotplug lock but this can potential collide with vmalloc/IO regions.
Even if they dont right now that will be because they dont share intermediate
page table levels.

Memory hot-remove is not a very performance critical path. Not even as critical
as memory hot add. Hence its not worth relying on current non-overlapping kernel
virtual address range placement and reason it for not taking this critical lock
which might be problematic if things change later. Lets be on the safer side and
keep this lock.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump
  2019-05-14  9:00 ` [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump Anshuman Khandual
  2019-05-14  9:20   ` David Hildenbrand
@ 2019-05-16  8:38   ` Ard Biesheuvel
  1 sibling, 0 replies; 28+ messages in thread
From: Ard Biesheuvel @ 2019-05-16  8:38 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux Kernel Mailing List, linux-arm-kernel, Andrew Morton,
	Catalin Marinas, Will Deacon, Mark Rutland, Michal Hocko,
	ira.weiny, david, Robin Murphy, Qian Cai, Logan Gunthorpe,
	James Morse, cpandya, arunks, Williams, Dan J, Mel Gorman,
	osalvador

On Tue, 14 May 2019 at 11:00, Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> From: Mark Rutland <mark.rutland@arm.com>
>
> The arm64 ptdump code can race with concurrent modification of the
> kernel page tables. At the time this was added, this was sound as:
>
> * Modifications to leaf entries could result in stale information being
>   logged, but would not result in a functional problem.
>
> * Boot time modifications to non-leaf entries (e.g. freeing of initmem)
>   were performed when the ptdump code cannot be invoked.
>
> * At runtime, modifications to non-leaf entries only occurred in the
>   vmalloc region, and these were strictly additive, as intermediate
>   entries were never freed.
>
> However, since commit:
>
>   commit 324420bf91f6 ("arm64: add support for ioremap() block mappings")
>
> ... it has been possible to create huge mappings in the vmalloc area at
> runtime, and as part of this existing intermediate levels of table my be
> removed and freed.
>
> It's possible for the ptdump code to race with this, and continue to
> walk tables which have been freed (and potentially poisoned or
> reallocated). As a result of this, the ptdump code may dereference bogus
> addresses, which could be fatal.
>
> Since huge-vmap is a TLB and memory optimization, we can disable it when
> the runtime ptdump code is in use to avoid this problem.
>

The reason I added this originally is so that we don't trigger a
warning when unmapping vmappings that use 2 MB block mappings, which
happens when we free the kernel's .init segment. The ability to create
such mappings is indeed an optimization that the kernel mapping code
does not rely on.


> Fixes: 324420bf91f60582 ("arm64: add support for ioremap() block mappings")
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>

Acked-by: Ard Biesheuvel <ard.biesheuvel@arm.com>

> ---
>  arch/arm64/mm/mmu.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ef82312..37a902c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -955,13 +955,18 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys)
>
>  int __init arch_ioremap_pud_supported(void)
>  {
> -       /* only 4k granule supports level 1 block mappings */
> -       return IS_ENABLED(CONFIG_ARM64_4K_PAGES);
> +       /*
> +        * Only 4k granule supports level 1 block mappings.
> +        * SW table walks can't handle removal of intermediate entries.
> +        */
> +       return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
> +              !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
>  }
>
>  int __init arch_ioremap_pmd_supported(void)
>  {
> -       return 1;
> +       /* See arch_ioremap_pud_supported() */
> +       return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
>  }
>
>  int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
> --
> 2.7.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-15 16:58   ` Michal Hocko
@ 2019-05-16 10:23     ` Mark Rutland
  2019-05-16 11:05       ` Michal Hocko
  2019-05-16 11:06       ` Anshuman Khandual
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Rutland @ 2019-05-16 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

Hi Michal,

On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > The arm64 pagetable dump code can race with concurrent modification of the
> > kernel page tables. When a leaf entries are modified concurrently, the dump
> > code may log stale or inconsistent information for a VA range, but this is
> > otherwise not harmful.
> > 
> > When intermediate levels of table are freed, the dump code will continue to
> > use memory which has been freed and potentially reallocated for another
> > purpose. In such cases, the dump code may dereference bogus addressses,
> > leading to a number of potential problems.
> > 
> > Intermediate levels of table may by freed during memory hot-remove, or when
> > installing a huge mapping in the vmalloc region. To avoid racing with these
> > cases, take the memory hotplug lock when walking the kernel page table.
> 
> Why is this a problem only on arm64 

It looks like it's not -- I think we're just the first to realise this.

AFAICT x86's debugfs ptdump has the same issue if run conccurently with
memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
would have the same issue.

> and why do we even care for debugfs? Does anybody rely on this thing
> to be reliable? Do we even need it? Who is using the file?

The debugfs part is used intermittently by a few people working on the
arm64 kernel page tables. We use that both to sanity-check that kernel
page tables are created/updated correctly after changes to the arm64 mmu
code, and also to debug issues if/when we encounter issues that appear
to be the result of kernel page table corruption.

So while it's rare to need it, it's really useful to have when we do
need it, and I'd rather not remove it. I'd also rather that it didn't
have latent issues where we can accidentally crash the kernel when using
it, which is what this patch is addressing.

> I am asking because I would really love to make mem hotplug locking less
> scattered outside of the core MM than more. Most users simply shouldn't
> care. Pfn walkers should rely on pfn_to_online_page.

I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
doesn't ensure that the page remains online. Is there a way to achieve
that other than get_online_mems()?

The big problem for the ptdump code is when tables are freed, since the
pages can be reused elsewhere (or hot-removed), causing the ptdump code
to explode.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-16  5:34     ` Anshuman Khandual
@ 2019-05-16 10:57       ` Mark Rutland
  2019-05-17  3:15         ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Rutland @ 2019-05-16 10:57 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny

On Thu, May 16, 2019 at 11:04:48AM +0530, Anshuman Khandual wrote:
> On 05/15/2019 05:19 PM, Mark Rutland wrote:
> > On Tue, May 14, 2019 at 02:30:07PM +0530, Anshuman Khandual wrote:
> >> Memory removal from an arch perspective involves tearing down two different
> >> kernel based mappings i.e vmemmap and linear while releasing related page
> >> table and any mapped pages allocated for given physical memory range to be
> >> removed.
> >>
> >> Define a common kernel page table tear down helper remove_pagetable() which
> >> can be used to unmap given kernel virtual address range. In effect it can
> >> tear down both vmemap or kernel linear mappings. This new helper is called
> >> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
> >>
> >> For linear mapping there are no actual allocated pages which are mapped to
> >> create the translation. Any pfn on a given entry is derived from physical
> >> address (__va(PA) --> PA) whose linear translation is to be created. They
> >> need not be freed as they were never allocated in the first place. But for
> >> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
> >> allocated either from buddy or memblock which get mapped in the kernel page
> >> table. These allocated and mapped pages need to be freed during translation
> >> tear down. But page table pages need to be freed in both these cases.
> > 
> > As previously discussed, we should only hot-remove memory which was
> > hot-added, so we shouldn't encounter memory allocated from memblock.
> 
> Right, not applicable any more. Will drop this word.
> 
> >> These mappings need to be differentiated while deciding if a mapped page at
> >> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
> >> mapping tear down process should pass on 'sparse_vmap' variable identifying
> >> kernel vmemmap mappings.
> > 
> > I think that you can simplify the paragraphs above down to:
> > 
> >   The arch code for hot-remove must tear down portions of the linear map
> >   and vmemmap corresponding to memory being removed. In both cases the
> >   page tables mapping these regions must be freed, and when sparse
> >   vmemmap is in use the memory backing the vmemmap must also be freed.
> > 
> >   This patch adds a new remove_pagetable() helper which can be used to
> >   tear down either region, and calls it from vmemmap_free() and
> >   ___remove_pgd_mapping(). The sparse_vmap argument determines whether
> >   the backing memory will be freed.
> 
> The current one is bit more descriptive on detail. Anyways will replace with
> the above writeup if that is preferred.

I would prefer the suggested form above, as it's easier to extract the
necessary details from it.

[...]

> >> +static void
> >> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
> >> +{
> >> +	unsigned long addr, next;
> >> +	pud_t *pudp_base;
> >> +	pgd_t *pgdp;
> >> +
> >> +	spin_lock(&init_mm.page_table_lock);
> > 
> > It would be good to explain why we need to take the ptl here.
> 
> Will update both commit message and add an in-code comment here.
> 
> > 
> > IIUC that shouldn't be necessary for the linear map. Am I mistaken?
> 
> Its not absolutely necessary for linear map right now because both memory hot
> plug & ptdump which modifies or walks the page table ranges respectively take
> memory hotplug lock. That apart, no other callers creates or destroys linear
> mapping at runtime.
> 
> > 
> > Is there a specific race when tearing down the vmemmap?
> 
> This is trickier than linear map. vmemmap additions would be protected with
> memory hotplug lock but this can potential collide with vmalloc/IO regions.
> Even if they dont right now that will be because they dont share intermediate
> page table levels.

Sure; if we could just state something like:

  The vmemmap region may share levels of table with the vmalloc region.
  Take the ptl so that we can safely free potentially-sahred tables.

... I think that would be sufficient.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-16 10:23     ` Mark Rutland
@ 2019-05-16 11:05       ` Michal Hocko
  2019-05-22 16:42         ` Mark Rutland
  2019-05-16 11:06       ` Anshuman Khandual
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-05-16 11:05 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> Hi Michal,
> 
> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > The arm64 pagetable dump code can race with concurrent modification of the
> > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > code may log stale or inconsistent information for a VA range, but this is
> > > otherwise not harmful.
> > > 
> > > When intermediate levels of table are freed, the dump code will continue to
> > > use memory which has been freed and potentially reallocated for another
> > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > leading to a number of potential problems.
> > > 
> > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > cases, take the memory hotplug lock when walking the kernel page table.
> > 
> > Why is this a problem only on arm64 
> 
> It looks like it's not -- I think we're just the first to realise this.
> 
> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> would have the same issue.
> 
> > and why do we even care for debugfs? Does anybody rely on this thing
> > to be reliable? Do we even need it? Who is using the file?
> 
> The debugfs part is used intermittently by a few people working on the
> arm64 kernel page tables. We use that both to sanity-check that kernel
> page tables are created/updated correctly after changes to the arm64 mmu
> code, and also to debug issues if/when we encounter issues that appear
> to be the result of kernel page table corruption.

OK, I see. Thanks for the clarification.

> So while it's rare to need it, it's really useful to have when we do
> need it, and I'd rather not remove it. I'd also rather that it didn't
> have latent issues where we can accidentally crash the kernel when using
> it, which is what this patch is addressing.

While I agree, do we rather want to document that you shouldn't be using
the debugging tool while the hotplug is ongoing because you might get a
garbage or crash the kernel in the worst case? In other words is the
absolute correctness worth the additional maint. burden wrt. to future
hotplug changes?

> > I am asking because I would really love to make mem hotplug locking less
> > scattered outside of the core MM than more. Most users simply shouldn't
> > care. Pfn walkers should rely on pfn_to_online_page.
> 
> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> doesn't ensure that the page remains online. Is there a way to achieve
> that other than get_online_mems()?

You have to pin the page to make sure the hotplug is not going to
offline it.

> The big problem for the ptdump code is when tables are freed, since the
> pages can be reused elsewhere (or hot-removed), causing the ptdump code
> to explode.

Yes, I see the danger. I am just wondering whether living with that is
reasonable considering this is a debugfs code.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-16 10:23     ` Mark Rutland
  2019-05-16 11:05       ` Michal Hocko
@ 2019-05-16 11:06       ` Anshuman Khandual
  2019-05-16 11:16         ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-16 11:06 UTC (permalink / raw)
  To: Mark Rutland, Michal Hocko
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mgorman, james.morse, robin.murphy, cpandya, arunks,
	dan.j.williams, osalvador, david, cai, logang, ira.weiny

On 05/16/2019 03:53 PM, Mark Rutland wrote:
> Hi Michal,
> 
> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>> The arm64 pagetable dump code can race with concurrent modification of the
>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>> code may log stale or inconsistent information for a VA range, but this is
>>> otherwise not harmful.
>>>
>>> When intermediate levels of table are freed, the dump code will continue to
>>> use memory which has been freed and potentially reallocated for another
>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>> leading to a number of potential problems.
>>>
>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>> cases, take the memory hotplug lock when walking the kernel page table.
>>
>> Why is this a problem only on arm64 
> 
> It looks like it's not -- I think we're just the first to realise this.
> 
> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> would have the same issue.
> 
>> and why do we even care for debugfs? Does anybody rely on this thing
>> to be reliable? Do we even need it? Who is using the file?
> 
> The debugfs part is used intermittently by a few people working on the
> arm64 kernel page tables. We use that both to sanity-check that kernel
> page tables are created/updated correctly after changes to the arm64 mmu
> code, and also to debug issues if/when we encounter issues that appear
> to be the result of kernel page table corruption.
> 
> So while it's rare to need it, it's really useful to have when we do
> need it, and I'd rather not remove it. I'd also rather that it didn't
> have latent issues where we can accidentally crash the kernel when using
> it, which is what this patch is addressing.
> 
>> I am asking because I would really love to make mem hotplug locking less
>> scattered outside of the core MM than more. Most users simply shouldn't
>> care. Pfn walkers should rely on pfn_to_online_page.
> 
> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> doesn't ensure that the page remains online. Is there a way to achieve
> that other than get_online_mems()?

Still wondering how pfn_to_online_page() is applicable here. It validates
a given PFN and whether its online from sparse section mapping perspective
before giving it's struct page. IIUC it is used during a linear scanning
of a physical address range not for a page table walk. So how it can solve
the problem when a struct page which was used as an intermediate level page
table page gets released back to the buddy from another concurrent thread ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-16 11:06       ` Anshuman Khandual
@ 2019-05-16 11:16         ` Michal Hocko
  2019-05-23  8:40           ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-05-16 11:16 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Mark Rutland, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
> On 05/16/2019 03:53 PM, Mark Rutland wrote:
> > Hi Michal,
> > 
> > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> >> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> >>> The arm64 pagetable dump code can race with concurrent modification of the
> >>> kernel page tables. When a leaf entries are modified concurrently, the dump
> >>> code may log stale or inconsistent information for a VA range, but this is
> >>> otherwise not harmful.
> >>>
> >>> When intermediate levels of table are freed, the dump code will continue to
> >>> use memory which has been freed and potentially reallocated for another
> >>> purpose. In such cases, the dump code may dereference bogus addressses,
> >>> leading to a number of potential problems.
> >>>
> >>> Intermediate levels of table may by freed during memory hot-remove, or when
> >>> installing a huge mapping in the vmalloc region. To avoid racing with these
> >>> cases, take the memory hotplug lock when walking the kernel page table.
> >>
> >> Why is this a problem only on arm64 
> > 
> > It looks like it's not -- I think we're just the first to realise this.
> > 
> > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > would have the same issue.
> > 
> >> and why do we even care for debugfs? Does anybody rely on this thing
> >> to be reliable? Do we even need it? Who is using the file?
> > 
> > The debugfs part is used intermittently by a few people working on the
> > arm64 kernel page tables. We use that both to sanity-check that kernel
> > page tables are created/updated correctly after changes to the arm64 mmu
> > code, and also to debug issues if/when we encounter issues that appear
> > to be the result of kernel page table corruption.
> > 
> > So while it's rare to need it, it's really useful to have when we do
> > need it, and I'd rather not remove it. I'd also rather that it didn't
> > have latent issues where we can accidentally crash the kernel when using
> > it, which is what this patch is addressing.
> > 
> >> I am asking because I would really love to make mem hotplug locking less
> >> scattered outside of the core MM than more. Most users simply shouldn't
> >> care. Pfn walkers should rely on pfn_to_online_page.
> > 
> > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > doesn't ensure that the page remains online. Is there a way to achieve
> > that other than get_online_mems()?
> 
> Still wondering how pfn_to_online_page() is applicable here. It validates
> a given PFN and whether its online from sparse section mapping perspective
> before giving it's struct page. IIUC it is used during a linear scanning
> of a physical address range not for a page table walk. So how it can solve
> the problem when a struct page which was used as an intermediate level page
> table page gets released back to the buddy from another concurrent thread ?

Well, my comment about pfn_to_online_page was more generic and it might
not apply to this specific case. I meant to say that the code outside of
the core MM shouldn't really care about the hotplug locking.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove
  2019-05-16 10:57       ` Mark Rutland
@ 2019-05-17  3:15         ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-17  3:15 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, mgorman, james.morse, robin.murphy, cpandya,
	arunks, dan.j.williams, osalvador, david, cai, logang, ira.weiny



On 05/16/2019 04:27 PM, Mark Rutland wrote:
> On Thu, May 16, 2019 at 11:04:48AM +0530, Anshuman Khandual wrote:
>> On 05/15/2019 05:19 PM, Mark Rutland wrote:
>>> On Tue, May 14, 2019 at 02:30:07PM +0530, Anshuman Khandual wrote:
>>>> Memory removal from an arch perspective involves tearing down two different
>>>> kernel based mappings i.e vmemmap and linear while releasing related page
>>>> table and any mapped pages allocated for given physical memory range to be
>>>> removed.
>>>>
>>>> Define a common kernel page table tear down helper remove_pagetable() which
>>>> can be used to unmap given kernel virtual address range. In effect it can
>>>> tear down both vmemap or kernel linear mappings. This new helper is called
>>>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>>>>
>>>> For linear mapping there are no actual allocated pages which are mapped to
>>>> create the translation. Any pfn on a given entry is derived from physical
>>>> address (__va(PA) --> PA) whose linear translation is to be created. They
>>>> need not be freed as they were never allocated in the first place. But for
>>>> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
>>>> allocated either from buddy or memblock which get mapped in the kernel page
>>>> table. These allocated and mapped pages need to be freed during translation
>>>> tear down. But page table pages need to be freed in both these cases.
>>>
>>> As previously discussed, we should only hot-remove memory which was
>>> hot-added, so we shouldn't encounter memory allocated from memblock.
>>
>> Right, not applicable any more. Will drop this word.
>>
>>>> These mappings need to be differentiated while deciding if a mapped page at
>>>> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
>>>> mapping tear down process should pass on 'sparse_vmap' variable identifying
>>>> kernel vmemmap mappings.
>>>
>>> I think that you can simplify the paragraphs above down to:
>>>
>>>   The arch code for hot-remove must tear down portions of the linear map
>>>   and vmemmap corresponding to memory being removed. In both cases the
>>>   page tables mapping these regions must be freed, and when sparse
>>>   vmemmap is in use the memory backing the vmemmap must also be freed.
>>>
>>>   This patch adds a new remove_pagetable() helper which can be used to
>>>   tear down either region, and calls it from vmemmap_free() and
>>>   ___remove_pgd_mapping(). The sparse_vmap argument determines whether
>>>   the backing memory will be freed.
>>
>> The current one is bit more descriptive on detail. Anyways will replace with
>> the above writeup if that is preferred.
> 
> I would prefer the suggested form above, as it's easier to extract the
> necessary details from it.

Fair enough.

> 
> [...]
> 
>>>> +static void
>>>> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
>>>> +{
>>>> +	unsigned long addr, next;
>>>> +	pud_t *pudp_base;
>>>> +	pgd_t *pgdp;
>>>> +
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>
>>> It would be good to explain why we need to take the ptl here.
>>
>> Will update both commit message and add an in-code comment here.
>>
>>>
>>> IIUC that shouldn't be necessary for the linear map. Am I mistaken?
>>
>> Its not absolutely necessary for linear map right now because both memory hot
>> plug & ptdump which modifies or walks the page table ranges respectively take
>> memory hotplug lock. That apart, no other callers creates or destroys linear
>> mapping at runtime.
>>
>>>
>>> Is there a specific race when tearing down the vmemmap?
>>
>> This is trickier than linear map. vmemmap additions would be protected with
>> memory hotplug lock but this can potential collide with vmalloc/IO regions.
>> Even if they dont right now that will be because they dont share intermediate
>> page table levels.
> 
> Sure; if we could just state something like:
> 
>   The vmemmap region may share levels of table with the vmalloc region.
>   Take the ptl so that we can safely free potentially-sahred tables.
> 
> ... I think that would be sufficient.

Will do.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-16 11:05       ` Michal Hocko
@ 2019-05-22 16:42         ` Mark Rutland
  2019-05-24  6:06           ` Anshuman Khandual
  2019-05-27  7:20           ` Michal Hocko
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Rutland @ 2019-05-22 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > Hi Michal,
> > 
> > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > > The arm64 pagetable dump code can race with concurrent modification of the
> > > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > > code may log stale or inconsistent information for a VA range, but this is
> > > > otherwise not harmful.
> > > > 
> > > > When intermediate levels of table are freed, the dump code will continue to
> > > > use memory which has been freed and potentially reallocated for another
> > > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > > leading to a number of potential problems.
> > > > 
> > > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > > cases, take the memory hotplug lock when walking the kernel page table.
> > > 
> > > Why is this a problem only on arm64 
> > 
> > It looks like it's not -- I think we're just the first to realise this.
> > 
> > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > would have the same issue.
> > 
> > > and why do we even care for debugfs? Does anybody rely on this thing
> > > to be reliable? Do we even need it? Who is using the file?
> > 
> > The debugfs part is used intermittently by a few people working on the
> > arm64 kernel page tables. We use that both to sanity-check that kernel
> > page tables are created/updated correctly after changes to the arm64 mmu
> > code, and also to debug issues if/when we encounter issues that appear
> > to be the result of kernel page table corruption.
> 
> OK, I see. Thanks for the clarification.
> 
> > So while it's rare to need it, it's really useful to have when we do
> > need it, and I'd rather not remove it. I'd also rather that it didn't
> > have latent issues where we can accidentally crash the kernel when using
> > it, which is what this patch is addressing.
> 
> While I agree, do we rather want to document that you shouldn't be using
> the debugging tool while the hotplug is ongoing because you might get a
> garbage or crash the kernel in the worst case? In other words is the
> absolute correctness worth the additional maint. burden wrt. to future
> hotplug changes?

I don't think that it's reasonable for this code to bring down the
kernel unless the kernel page tables are already corrupt. I agree we
should minimize the impact on other code, and I'm happy to penalize
ptdump so long as it's functional and safe.

I would like it to be possible to use the ptdump code to debug
hot-remove, so I'd rather not make the two mutually exclusive. I'd also
like it to be possible to use this in-the-field, and for that asking an
admin to potentially crash their system isn't likely to fly.

> > > I am asking because I would really love to make mem hotplug locking less
> > > scattered outside of the core MM than more. Most users simply shouldn't
> > > care. Pfn walkers should rely on pfn_to_online_page.

Jut to check, is your plan to limit access to the hotplug lock, or to
redesign the locking scheme?

> > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > doesn't ensure that the page remains online. Is there a way to achieve
> > that other than get_online_mems()?
> 
> You have to pin the page to make sure the hotplug is not going to
> offline it.

I'm not exactly sure how pinning works -- is there a particular set of
functions I should look at for that?

I guess that if/when we allocate the vmemmap from hotpluggable memory
that will require the pinning code to take the hotplug lock internally
to ensure that the struct page is accessible while we attempt to pin it?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-16 11:16         ` Michal Hocko
@ 2019-05-23  8:40           ` David Hildenbrand
  2019-05-24  5:43             ` Anshuman Khandual
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2019-05-23  8:40 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Mark Rutland, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 16.05.19 13:16, Michal Hocko wrote:
> On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
>> On 05/16/2019 03:53 PM, Mark Rutland wrote:
>>> Hi Michal,
>>>
>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>> otherwise not harmful.
>>>>>
>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>> use memory which has been freed and potentially reallocated for another
>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>> leading to a number of potential problems.
>>>>>
>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>
>>>> Why is this a problem only on arm64 
>>>
>>> It looks like it's not -- I think we're just the first to realise this.
>>>
>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>> would have the same issue.
>>>
>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>> to be reliable? Do we even need it? Who is using the file?
>>>
>>> The debugfs part is used intermittently by a few people working on the
>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>> page tables are created/updated correctly after changes to the arm64 mmu
>>> code, and also to debug issues if/when we encounter issues that appear
>>> to be the result of kernel page table corruption.
>>>
>>> So while it's rare to need it, it's really useful to have when we do
>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>> have latent issues where we can accidentally crash the kernel when using
>>> it, which is what this patch is addressing.
>>>
>>>> I am asking because I would really love to make mem hotplug locking less
>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>> care. Pfn walkers should rely on pfn_to_online_page.
>>>
>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>> doesn't ensure that the page remains online. Is there a way to achieve
>>> that other than get_online_mems()?
>>
>> Still wondering how pfn_to_online_page() is applicable here. It validates
>> a given PFN and whether its online from sparse section mapping perspective
>> before giving it's struct page. IIUC it is used during a linear scanning
>> of a physical address range not for a page table walk. So how it can solve
>> the problem when a struct page which was used as an intermediate level page
>> table page gets released back to the buddy from another concurrent thread ?
> 
> Well, my comment about pfn_to_online_page was more generic and it might
> not apply to this specific case. I meant to say that the code outside of
> the core MM shouldn't really care about the hotplug locking.
> 

What am I missing, how is it guaranteed that a page doesn't get
offlined/removed without holding a lock here?

We would at least need some RCU mechnism or similar to sync against
pages vanishing.

pfn_to_online_page() assumes that somebody touches a page he doesn't
own. There has to be some way for core-mm to realize this and defer
offlining/removinf.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-23  8:40           ` David Hildenbrand
@ 2019-05-24  5:43             ` Anshuman Khandual
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-24  5:43 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Mark Rutland, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, cai, logang,
	ira.weiny

On 05/23/2019 02:10 PM, David Hildenbrand wrote:
> On 16.05.19 13:16, Michal Hocko wrote:
>> On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
>>> On 05/16/2019 03:53 PM, Mark Rutland wrote:
>>>> Hi Michal,
>>>>
>>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>>> otherwise not harmful.
>>>>>>
>>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>>> use memory which has been freed and potentially reallocated for another
>>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>>> leading to a number of potential problems.
>>>>>>
>>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>>
>>>>> Why is this a problem only on arm64 
>>>>
>>>> It looks like it's not -- I think we're just the first to realise this.
>>>>
>>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>>> would have the same issue.
>>>>
>>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>>> to be reliable? Do we even need it? Who is using the file?
>>>>
>>>> The debugfs part is used intermittently by a few people working on the
>>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>>> page tables are created/updated correctly after changes to the arm64 mmu
>>>> code, and also to debug issues if/when we encounter issues that appear
>>>> to be the result of kernel page table corruption.
>>>>
>>>> So while it's rare to need it, it's really useful to have when we do
>>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>>> have latent issues where we can accidentally crash the kernel when using
>>>> it, which is what this patch is addressing.
>>>>
>>>>> I am asking because I would really love to make mem hotplug locking less
>>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>>> care. Pfn walkers should rely on pfn_to_online_page.
>>>>
>>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>>> doesn't ensure that the page remains online. Is there a way to achieve
>>>> that other than get_online_mems()?
>>>
>>> Still wondering how pfn_to_online_page() is applicable here. It validates
>>> a given PFN and whether its online from sparse section mapping perspective
>>> before giving it's struct page. IIUC it is used during a linear scanning
>>> of a physical address range not for a page table walk. So how it can solve
>>> the problem when a struct page which was used as an intermediate level page
>>> table page gets released back to the buddy from another concurrent thread ?
>>
>> Well, my comment about pfn_to_online_page was more generic and it might
>> not apply to this specific case. I meant to say that the code outside of
>> the core MM shouldn't really care about the hotplug locking.
>>
> 
> What am I missing, how is it guaranteed that a page doesn't get
> offlined/removed without holding a lock here?

It is not guaranteed.

> 
> We would at least need some RCU mechnism or similar to sync against
> pages vanishing.

Yes, if we dont take memory_hotplug_lock preventing memory hot remove.

> 
> pfn_to_online_page() assumes that somebody touches a page he doesn't
> own. There has to be some way for core-mm to realize this and defer
> offlining/removinf.

First of all I am not sure yet if Michal really meant that reference
should be taken on all struct pages (while dumping kernel page table)
for each range (minimum hot remove granularity) to prevent them from
being hot removed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-22 16:42         ` Mark Rutland
@ 2019-05-24  6:06           ` Anshuman Khandual
  2019-05-27  7:20           ` Michal Hocko
  1 sibling, 0 replies; 28+ messages in thread
From: Anshuman Khandual @ 2019-05-24  6:06 UTC (permalink / raw)
  To: Mark Rutland, Michal Hocko
  Cc: linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mgorman, james.morse, robin.murphy, cpandya, arunks,
	dan.j.williams, osalvador, david, cai, logang, ira.weiny



On 05/22/2019 10:12 PM, Mark Rutland wrote:
> On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
>> On Thu 16-05-19 11:23:54, Mark Rutland wrote:
>>> Hi Michal,
>>>
>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>> otherwise not harmful.
>>>>>
>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>> use memory which has been freed and potentially reallocated for another
>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>> leading to a number of potential problems.
>>>>>
>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>
>>>> Why is this a problem only on arm64 
>>>
>>> It looks like it's not -- I think we're just the first to realise this.
>>>
>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>> would have the same issue.
>>>
>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>> to be reliable? Do we even need it? Who is using the file?
>>>
>>> The debugfs part is used intermittently by a few people working on the
>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>> page tables are created/updated correctly after changes to the arm64 mmu
>>> code, and also to debug issues if/when we encounter issues that appear
>>> to be the result of kernel page table corruption.
>>
>> OK, I see. Thanks for the clarification.
>>
>>> So while it's rare to need it, it's really useful to have when we do
>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>> have latent issues where we can accidentally crash the kernel when using
>>> it, which is what this patch is addressing.
>>
>> While I agree, do we rather want to document that you shouldn't be using
>> the debugging tool while the hotplug is ongoing because you might get a
>> garbage or crash the kernel in the worst case? In other words is the
>> absolute correctness worth the additional maint. burden wrt. to future
>> hotplug changes?
> 
> I don't think that it's reasonable for this code to bring down the
> kernel unless the kernel page tables are already corrupt. I agree we
> should minimize the impact on other code, and I'm happy to penalize
> ptdump so long as it's functional and safe.
> 
> I would like it to be possible to use the ptdump code to debug
> hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> like it to be possible to use this in-the-field, and for that asking an
> admin to potentially crash their system isn't likely to fly.
> 
>>>> I am asking because I would really love to make mem hotplug locking less
>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>> care. Pfn walkers should rely on pfn_to_online_page.
> 
> Jut to check, is your plan to limit access to the hotplug lock, or to
> redesign the locking scheme?
> 
>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>> doesn't ensure that the page remains online. Is there a way to achieve
>>> that other than get_online_mems()?
>>
>> You have to pin the page to make sure the hotplug is not going to
>> offline it.
> 
> I'm not exactly sure how pinning works -- is there a particular set of
> functions I should look at for that?
> 
> I guess that if/when we allocate the vmemmap from hotpluggable memory
> that will require the pinning code to take the hotplug lock internally
> to ensure that the struct page is accessible while we attempt to pin it?

I am bit confused here.

Which pages are we trying to pin ?

1) init_mm page table pages (vmemmap + linear) for the range to be hot-removed
2) struct pages for the PFN range to be hot-removed

We need hot-remove process to be blocked enough not to free the intermediate
level page table pages which will ensure kernel does not crash during ptdump.

AFAICT

1) Holding reference on (1) prevent freeing of pgtable pages during hot-remove
2) Holding reference on (2) prevent range PFN from being hot removed which in
   turn can prevent forward progress for hot-remove process and hence possibly
   prevent freeing of intermediate level pgtable pages.

But both the above solutions are bit complex and will consume more cycles as
compared to just take a memory_hotplug_lock. In case of (1) it is bit tricker
as ptdump code has to first walk init_mm to get to all the pgtable pages for
taking a reference/lock on them. Just wondering if it is worth the trouble.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-22 16:42         ` Mark Rutland
  2019-05-24  6:06           ` Anshuman Khandual
@ 2019-05-27  7:20           ` Michal Hocko
  2019-05-28 14:09             ` Mark Rutland
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-05-27  7:20 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Wed 22-05-19 17:42:13, Mark Rutland wrote:
> On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> > On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > > Hi Michal,
> > > 
> > > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > > > The arm64 pagetable dump code can race with concurrent modification of the
> > > > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > > > code may log stale or inconsistent information for a VA range, but this is
> > > > > otherwise not harmful.
> > > > > 
> > > > > When intermediate levels of table are freed, the dump code will continue to
> > > > > use memory which has been freed and potentially reallocated for another
> > > > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > > > leading to a number of potential problems.
> > > > > 
> > > > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > > > cases, take the memory hotplug lock when walking the kernel page table.
> > > > 
> > > > Why is this a problem only on arm64 
> > > 
> > > It looks like it's not -- I think we're just the first to realise this.
> > > 
> > > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > > would have the same issue.
> > > 
> > > > and why do we even care for debugfs? Does anybody rely on this thing
> > > > to be reliable? Do we even need it? Who is using the file?
> > > 
> > > The debugfs part is used intermittently by a few people working on the
> > > arm64 kernel page tables. We use that both to sanity-check that kernel
> > > page tables are created/updated correctly after changes to the arm64 mmu
> > > code, and also to debug issues if/when we encounter issues that appear
> > > to be the result of kernel page table corruption.
> > 
> > OK, I see. Thanks for the clarification.
> > 
> > > So while it's rare to need it, it's really useful to have when we do
> > > need it, and I'd rather not remove it. I'd also rather that it didn't
> > > have latent issues where we can accidentally crash the kernel when using
> > > it, which is what this patch is addressing.
> > 
> > While I agree, do we rather want to document that you shouldn't be using
> > the debugging tool while the hotplug is ongoing because you might get a
> > garbage or crash the kernel in the worst case? In other words is the
> > absolute correctness worth the additional maint. burden wrt. to future
> > hotplug changes?
> 
> I don't think that it's reasonable for this code to bring down the
> kernel unless the kernel page tables are already corrupt. I agree we
> should minimize the impact on other code, and I'm happy to penalize
> ptdump so long as it's functional and safe.
> 
> I would like it to be possible to use the ptdump code to debug
> hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> like it to be possible to use this in-the-field, and for that asking an
> admin to potentially crash their system isn't likely to fly.

OK, fair enough.

> > > > I am asking because I would really love to make mem hotplug locking less
> > > > scattered outside of the core MM than more. Most users simply shouldn't
> > > > care. Pfn walkers should rely on pfn_to_online_page.
> 
> Jut to check, is your plan to limit access to the hotplug lock, or to
> redesign the locking scheme?

To change the locking to lock hotpluged ranges rather than having a
global lock as the operation is inherently pfn range scoped.

> > > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > > doesn't ensure that the page remains online. Is there a way to achieve
> > > that other than get_online_mems()?
> > 
> > You have to pin the page to make sure the hotplug is not going to
> > offline it.
> 
> I'm not exactly sure how pinning works -- is there a particular set of
> functions I should look at for that?

Pinning (get_page) on any page of the range will deffer the hotremove
operation and therefore the page tables cannot go away as well.

That being said, I thought the API is mostly for debugging and "you
should better know what you are doing" kinda thing (based on debugfs
being used here). If this is really useful in its current form and
should be used also while the hotremove is in progress then ok.
Once we actually get to rework the locking then we will have another
spot to handle but that's the life.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-05-27  7:20           ` Michal Hocko
@ 2019-05-28 14:09             ` Mark Rutland
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Rutland @ 2019-05-28 14:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will.deacon, mgorman, james.morse, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, david, cai, logang,
	ira.weiny

On Mon, May 27, 2019 at 09:20:01AM +0200, Michal Hocko wrote:
> On Wed 22-05-19 17:42:13, Mark Rutland wrote:
> > On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> > > On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > > > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:

> > I don't think that it's reasonable for this code to bring down the
> > kernel unless the kernel page tables are already corrupt. I agree we
> > should minimize the impact on other code, and I'm happy to penalize
> > ptdump so long as it's functional and safe.
> > 
> > I would like it to be possible to use the ptdump code to debug
> > hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> > like it to be possible to use this in-the-field, and for that asking an
> > admin to potentially crash their system isn't likely to fly.
> 
> OK, fair enough.
> 
> > > > > I am asking because I would really love to make mem hotplug locking less
> > > > > scattered outside of the core MM than more. Most users simply shouldn't
> > > > > care. Pfn walkers should rely on pfn_to_online_page.
> > 
> > Jut to check, is your plan to limit access to the hotplug lock, or to
> > redesign the locking scheme?
> 
> To change the locking to lock hotpluged ranges rather than having a
> global lock as the operation is inherently pfn range scoped.

Ok. That sounds like something we could adapt the ptdump code to handle
without too much pain (modulo how much of that you want to expose
outside of the core mm code).

> > > > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > > > doesn't ensure that the page remains online. Is there a way to achieve
> > > > that other than get_online_mems()?
> > > 
> > > You have to pin the page to make sure the hotplug is not going to
> > > offline it.
> > 
> > I'm not exactly sure how pinning works -- is there a particular set of
> > functions I should look at for that?
> 
> Pinning (get_page) on any page of the range will deffer the hotremove
> operation and therefore the page tables cannot go away as well.
> 
> That being said, I thought the API is mostly for debugging and "you
> should better know what you are doing" kinda thing (based on debugfs
> being used here). If this is really useful in its current form and
> should be used also while the hotremove is in progress then ok.
> Once we actually get to rework the locking then we will have another
> spot to handle but that's the life.

Great.

FWIW, I'm happy to rework the ptdump code to help with that.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-05-28 14:09 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-14  9:00 [PATCH V3 0/4] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-05-14  9:00 ` [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
2019-05-14  9:05   ` David Hildenbrand
2019-05-14  9:00 ` [PATCH V3 2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
2019-05-14  9:16   ` David Hildenbrand
2019-05-14 15:40   ` Mark Rutland
2019-05-15  1:56     ` Anshuman Khandual
2019-05-15 16:58   ` Michal Hocko
2019-05-16 10:23     ` Mark Rutland
2019-05-16 11:05       ` Michal Hocko
2019-05-22 16:42         ` Mark Rutland
2019-05-24  6:06           ` Anshuman Khandual
2019-05-27  7:20           ` Michal Hocko
2019-05-28 14:09             ` Mark Rutland
2019-05-16 11:06       ` Anshuman Khandual
2019-05-16 11:16         ` Michal Hocko
2019-05-23  8:40           ` David Hildenbrand
2019-05-24  5:43             ` Anshuman Khandual
2019-05-14  9:00 ` [PATCH V3 3/4] arm64/mm: Inhibit huge-vmap with ptdump Anshuman Khandual
2019-05-14  9:20   ` David Hildenbrand
2019-05-16  8:38   ` Ard Biesheuvel
2019-05-14  9:00 ` [PATCH V3 4/4] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-05-14  9:08   ` David Hildenbrand
2019-05-15 11:49   ` Mark Rutland
2019-05-16  5:34     ` Anshuman Khandual
2019-05-16 10:57       ` Mark Rutland
2019-05-17  3:15         ` Anshuman Khandual
2019-05-14  9:10 ` [PATCH V3 0/4] " David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).