LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] mm: fix panic in __alloc_pages
@ 2021-11-01 20:13 Alexey Makhalov
  2021-11-01 20:38 ` Matthew Wilcox
  2021-11-02  7:47 ` Michal Hocko
  0 siblings, 2 replies; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-01 20:13 UTC (permalink / raw)
  To: linux-mm; +Cc: Alexey Makhalov, Andrew Morton, linux-kernel, stable

There is a kernel panic caused by __alloc_pages() accessing
uninitialized NODE_DATA(nid). Uninitialized node data exists
during the time when CPU with memoryless node was added but
not onlined yet. Panic can be easy reproduced by disabling
udev rule for automatic onlining hot added CPU followed by
CPU with memoryless node hot add.

This is a panic caused by percpu code doing allocations for
all possible CPUs and hitting this issue:

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Node can be in one of the following states:
1. not present (nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

alloc_page_{bulk_array}node() functions verify for nid validity only
and do not check if nid is online. Enhanced verification check allows
to handle page allocation when node is in 2nd state.

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
---
 include/linux/gfp.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f9..34a5a7def 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr
 static inline unsigned long
 alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
 {
-	if (nid == NUMA_NO_NODE)
+	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
+					!(gfp & __GFP_THISNODE)))
 		nid = numa_mem_id();
 
 	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
@@ -578,7 +579,8 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	if (nid == NUMA_NO_NODE)
+	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
+					!(gfp_mask & __GFP_THISNODE)))
 		nid = numa_mem_id();
 
 	return __alloc_pages_node(nid, gfp_mask, order);
-- 
2.30.0


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-01 20:13 [PATCH] mm: fix panic in __alloc_pages Alexey Makhalov
@ 2021-11-01 20:38 ` Matthew Wilcox
  2021-11-02  7:47 ` Michal Hocko
  1 sibling, 0 replies; 98+ messages in thread
From: Matthew Wilcox @ 2021-11-01 20:38 UTC (permalink / raw)
  To: Alexey Makhalov; +Cc: linux-mm, Andrew Morton, linux-kernel, stable

On Mon, Nov 01, 2021 at 01:13:12PM -0700, Alexey Makhalov wrote:
> +++ b/include/linux/gfp.h
> @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr
>  static inline unsigned long
>  alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);

I don't think it's a great idea to push node_online() and the gfp check
into the caller.  Can't we put this check in __alloc_pages_bulk() instead?


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-01 20:13 [PATCH] mm: fix panic in __alloc_pages Alexey Makhalov
  2021-11-01 20:38 ` Matthew Wilcox
@ 2021-11-02  7:47 ` Michal Hocko
  2021-11-02  8:12   ` David Hildenbrand
  1 sibling, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-02  7:47 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: linux-mm, Andrew Morton, linux-kernel, stable, Oscar Salvador,
	David Hildenbrand

[CC Oscar and David]

On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
> There is a kernel panic caused by __alloc_pages() accessing
> uninitialized NODE_DATA(nid). Uninitialized node data exists
> during the time when CPU with memoryless node was added but
> not onlined yet. Panic can be easy reproduced by disabling
> udev rule for automatic onlining hot added CPU followed by
> CPU with memoryless node hot add.
> 
> This is a panic caused by percpu code doing allocations for
> all possible CPUs and hitting this issue:
> 
>  CPU2 has been hot-added
>  BUG: unable to handle page fault for address: 0000000000001608
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] SMP PTI
>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> 
>  RIP: 0010:__alloc_pages+0x127/0x290

Could you resolve this into a specific line of the source code please?

>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>  Call Trace:
>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>   pcpu_populate_chunk+0x33/0xb0
>   pcpu_alloc+0x4d3/0x6f0
>   __alloc_percpu_gfp+0xd/0x10
>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>   mem_cgroup_alloc+0xed/0x2f0
>   mem_cgroup_css_alloc+0x33/0x2f0
>   css_create+0x3a/0x1f0
>   cgroup_apply_control_enable+0x12b/0x150
>   cgroup_mkdir+0xdd/0x110
>   kernfs_iop_mkdir+0x4f/0x80
>   vfs_mkdir+0x178/0x230
>   do_mkdirat+0xfd/0x120
>   __x64_sys_mkdir+0x47/0x70
>   ? syscall_exit_to_user_mode+0x21/0x50
>   do_syscall_64+0x43/0x90
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Node can be in one of the following states:
> 1. not present (nid == NUMA_NO_NODE)
> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> 				NODE_DATA(nid) == NULL)
> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> 				NODE_DATA(nid) != NULL)
> 
> alloc_page_{bulk_array}node() functions verify for nid validity only
> and do not check if nid is online. Enhanced verification check allows
> to handle page allocation when node is in 2nd state.

I do not think this is a correct approach. We should make sure that the
proper fallback node is used instead. This means that the zone list is
initialized properly. IIRC this has been a problem in the past and it
has been fixed. The initialization code is quite subtle though so it is
possible that this got broken again.

> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  include/linux/gfp.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f9..34a5a7def 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr
>  static inline unsigned long
>  alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
> @@ -578,7 +579,8 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> -	if (nid == NUMA_NO_NODE)
> +	if (nid == NUMA_NO_NODE || (!node_online(nid) &&
> +					!(gfp_mask & __GFP_THISNODE)))
>  		nid = numa_mem_id();
>  
>  	return __alloc_pages_node(nid, gfp_mask, order);
> -- 
> 2.30.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  7:47 ` Michal Hocko
@ 2021-11-02  8:12   ` David Hildenbrand
  2021-11-02  8:48     ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02  8:12 UTC (permalink / raw)
  To: Michal Hocko, Alexey Makhalov
  Cc: linux-mm, Andrew Morton, linux-kernel, stable, Oscar Salvador

On 02.11.21 08:47, Michal Hocko wrote:
> [CC Oscar and David]
> 
> On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
>> There is a kernel panic caused by __alloc_pages() accessing
>> uninitialized NODE_DATA(nid). Uninitialized node data exists
>> during the time when CPU with memoryless node was added but
>> not onlined yet. Panic can be easy reproduced by disabling
>> udev rule for automatic onlining hot added CPU followed by
>> CPU with memoryless node hot add.
>>
>> This is a panic caused by percpu code doing allocations for
>> all possible CPUs and hitting this issue:
>>
>>  CPU2 has been hot-added
>>  BUG: unable to handle page fault for address: 0000000000001608
>>  #PF: supervisor read access in kernel mode
>>  #PF: error_code(0x0000) - not-present page
>>  PGD 0 P4D 0
>>  Oops: 0000 [#1] SMP PTI
>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>
>>  RIP: 0010:__alloc_pages+0x127/0x290
> 
> Could you resolve this into a specific line of the source code please?
> 
>>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>>  Call Trace:
>>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>>   pcpu_populate_chunk+0x33/0xb0
>>   pcpu_alloc+0x4d3/0x6f0
>>   __alloc_percpu_gfp+0xd/0x10
>>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>>   mem_cgroup_alloc+0xed/0x2f0
>>   mem_cgroup_css_alloc+0x33/0x2f0
>>   css_create+0x3a/0x1f0
>>   cgroup_apply_control_enable+0x12b/0x150
>>   cgroup_mkdir+0xdd/0x110
>>   kernfs_iop_mkdir+0x4f/0x80
>>   vfs_mkdir+0x178/0x230
>>   do_mkdirat+0xfd/0x120
>>   __x64_sys_mkdir+0x47/0x70
>>   ? syscall_exit_to_user_mode+0x21/0x50
>>   do_syscall_64+0x43/0x90
>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> Node can be in one of the following states:
>> 1. not present (nid == NUMA_NO_NODE)
>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>> 				NODE_DATA(nid) == NULL)
>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>> 				NODE_DATA(nid) != NULL)
>>
>> alloc_page_{bulk_array}node() functions verify for nid validity only
>> and do not check if nid is online. Enhanced verification check allows
>> to handle page allocation when node is in 2nd state.
> 
> I do not think this is a correct approach. We should make sure that the
> proper fallback node is used instead. This means that the zone list is
> initialized properly. IIRC this has been a problem in the past and it
> has been fixed. The initialization code is quite subtle though so it is
> possible that this got broken again.

I'm a little confused:

In add_memory_resource() we hotplug the new node if required and set it
online. Memory might get onlined later, via online_pages().

So after add_memory_resource()->__try_online_node() succeeded, we have
an online pgdat -- essentially 3.

This patch detects if we're past 3. but says that it reproduced by
disabling *memory* onlining.

Before we online memory for a hotplugged node, all zones are !populated.
So once we online memory for a !populated zone in online_pages(), we
trigger setup_zone_pageset().


The confusing part is that this patch checks for 3. but says it can be
reproduced by not onlining *memory*. There seems to be something missing.

Do we maybe need a proper populated_zone() check before accessing zone data?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  8:12   ` David Hildenbrand
@ 2021-11-02  8:48     ` Alexey Makhalov
  2021-11-02  9:04       ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-02  8:48 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: linux-mm, Andrew Morton, linux-kernel, stable, Oscar Salvador



On 11/2/21, 1:12 AM, "David Hildenbrand" <david@redhat.com> wrote:

Thanks for reviews,

    On 02.11.21 08:47, Michal Hocko wrote:
    > [CC Oscar and David]
    > 
    > On Mon 01-11-21 13:13:12, Alexey Makhalov wrote:
    >> There is a kernel panic caused by __alloc_pages() accessing
    >> uninitialized NODE_DATA(nid). Uninitialized node data exists
    >> during the time when CPU with memoryless node was added but
    >> not onlined yet. Panic can be easy reproduced by disabling
    >> udev rule for automatic onlining hot added CPU followed by
    >> CPU with memoryless node hot add.
    >>
    >> This is a panic caused by percpu code doing allocations for
    >> all possible CPUs and hitting this issue:
    >>
    >>  CPU2 has been hot-added
    >>  BUG: unable to handle page fault for address: 0000000000001608
    >>  #PF: supervisor read access in kernel mode
    >>  #PF: error_code(0x0000) - not-present page
    >>  PGD 0 P4D 0
    >>  Oops: 0000 [#1] SMP PTI
    >>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
    >>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
    >>
    >>  RIP: 0010:__alloc_pages+0x127/0x290
    > 
    > Could you resolve this into a specific line of the source code please?
    > 
    >>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
    >>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
    >>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
    >>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
    >>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
    >>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
    >>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
    >>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
    >>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    >>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
    >>  Call Trace:
    >>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
    >>   pcpu_populate_chunk+0x33/0xb0
    >>   pcpu_alloc+0x4d3/0x6f0
    >>   __alloc_percpu_gfp+0xd/0x10
    >>   alloc_mem_cgroup_per_node_info+0x54/0xb0
    >>   mem_cgroup_alloc+0xed/0x2f0
    >>   mem_cgroup_css_alloc+0x33/0x2f0
    >>   css_create+0x3a/0x1f0
    >>   cgroup_apply_control_enable+0x12b/0x150
    >>   cgroup_mkdir+0xdd/0x110
    >>   kernfs_iop_mkdir+0x4f/0x80
    >>   vfs_mkdir+0x178/0x230
    >>   do_mkdirat+0xfd/0x120
    >>   __x64_sys_mkdir+0x47/0x70
    >>   ? syscall_exit_to_user_mode+0x21/0x50
    >>   do_syscall_64+0x43/0x90
    >>   entry_SYSCALL_64_after_hwframe+0x44/0xae
    >>
    >> Node can be in one of the following states:
    >> 1. not present (nid == NUMA_NO_NODE)
    >> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
    >> 				NODE_DATA(nid) == NULL)
    >> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
    >> 				NODE_DATA(nid) != NULL)
    >>
    >> alloc_page_{bulk_array}node() functions verify for nid validity only
    >> and do not check if nid is online. Enhanced verification check allows
    >> to handle page allocation when node is in 2nd state.
    > 
    > I do not think this is a correct approach. We should make sure that the
    > proper fallback node is used instead. This means that the zone list is
    > initialized properly. IIRC this has been a problem in the past and it
    > has been fixed. The initialization code is quite subtle though so it is
    > possible that this got broken again.
This approach behaves in the same way as CPU was not yet added. (state #1).
So, we can think of state #2 as state #1 when CPU is not present.

    I'm a little confused:

    In add_memory_resource() we hotplug the new node if required and set it
    online. Memory might get onlined later, via online_pages().
You are correct. In case of memory hot add, it is true. But in case of adding
CPU with memoryless node, try_node_online() will be called only during CPU
onlining, see cpu_up().

Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
I think it would be correct to online node during the CPU hot add to align with
memory hot add.

    So after add_memory_resource()->__try_online_node() succeeded, we have
    an online pgdat -- essentially 3.

    This patch detects if we're past 3. but says that it reproduced by
    disabling *memory* onlining.
This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
And onlining CPU makes its node online. Disabling CPU onlining puts new node
into state #2, which leads to repro.    

    Before we online memory for a hotplugged node, all zones are !populated.
    So once we online memory for a !populated zone in online_pages(), we
    trigger setup_zone_pageset().


    The confusing part is that this patch checks for 3. but says it can be
    reproduced by not onlining *memory*. There seems to be something missing.

    Do we maybe need a proper populated_zone() check before accessing zone data?

Thanks,
--Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  8:48     ` Alexey Makhalov
@ 2021-11-02  9:04       ` Michal Hocko
  2021-11-02  9:24         ` David Hildenbrand
                           ` (2 more replies)
  0 siblings, 3 replies; 98+ messages in thread
From: Michal Hocko @ 2021-11-02  9:04 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: David Hildenbrand, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

It is hard to follow your reply as your email client is not quoting
properly. Let me try to reconstruct

On Tue 02-11-21 08:48:27, Alexey Makhalov wrote:
> On 02.11.21 08:47, Michal Hocko wrote:
[...]
>>>>  CPU2 has been hot-added
>>>>  BUG: unable to handle page fault for address: 0000000000001608
>>>>  #PF: supervisor read access in kernel mode
>>>>  #PF: error_code(0x0000) - not-present page
>>>>  PGD 0 P4D 0
>>>>  Oops: 0000 [#1] SMP PTI
>>>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>>>
>>>>  RIP: 0010:__alloc_pages+0x127/0x290
>>> 
>>> Could you resolve this into a specific line of the source code please?

This got probably unnoticed. I would be really curious whether this is
a broken zonelist or something else.
 
>>>> Node can be in one of the following states:
>>>> 1. not present (nid == NUMA_NO_NODE)
>>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>>>> 				NODE_DATA(nid) == NULL)
>>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>>>> 				NODE_DATA(nid) != NULL)
>>>>
>>>> alloc_page_{bulk_array}node() functions verify for nid validity only
>>>> and do not check if nid is online. Enhanced verification check allows
>>>> to handle page allocation when node is in 2nd state.
>>> 
>>> I do not think this is a correct approach. We should make sure that the
>>> proper fallback node is used instead. This means that the zone list is
>>> initialized properly. IIRC this has been a problem in the past and it
>>> has been fixed. The initialization code is quite subtle though so it is
>>> possible that this got broken again.

> This approach behaves in the same way as CPU was not yet added. (state #1).
> So, we can think of state #2 as state #1 when CPU is not present.

>> I'm a little confused:
>> 
>> In add_memory_resource() we hotplug the new node if required and set it
>> online. Memory might get onlined later, via online_pages().
>
> You are correct. In case of memory hot add, it is true. But in case of adding
> CPU with memoryless node, try_node_online() will be called only during CPU
> onlining, see cpu_up().
> 
> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
> I think it would be correct to online node during the CPU hot add to align with
> memory hot add.

I am not familiar with cpu hotplug, but this doesn't seem to be anything
new so how come this became problem only now?

>> So after add_memory_resource()->__try_online_node() succeeded, we have
>> an online pgdat -- essentially 3.
>> 
> This patch detects if we're past 3. but says that it reproduced by
> disabling *memory* onlining.
> This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
> And onlining CPU makes its node online. Disabling CPU onlining puts new node
> into state #2, which leads to repro.    
> 
>> Before we online memory for a hotplugged node, all zones are !populated.
>> So once we online memory for a !populated zone in online_pages(), we
>> trigger setup_zone_pageset().
>> 
>> 
>> The confusing part is that this patch checks for 3. but says it can be
>> reproduced by not onlining *memory*. There seems to be something missing.
> 
> Do we maybe need a proper populated_zone() check before accessing zone data?

No, we need them initialize properly.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  9:04       ` Michal Hocko
@ 2021-11-02  9:24         ` David Hildenbrand
  2021-11-02 10:34           ` Alexey Makhalov
  2021-11-02  9:40         ` [PATCH] " Alexey Makhalov
  2021-11-02  9:40         ` Michal Hocko
  2 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02  9:24 UTC (permalink / raw)
  To: Michal Hocko, Alexey Makhalov
  Cc: linux-mm, Andrew Morton, linux-kernel, stable, Oscar Salvador

>>> In add_memory_resource() we hotplug the new node if required and set it
>>> online. Memory might get onlined later, via online_pages().
>>
>> You are correct. In case of memory hot add, it is true. But in case of adding
>> CPU with memoryless node, try_node_online() will be called only during CPU
>> onlining, see cpu_up().
>>
>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>> I think it would be correct to online node during the CPU hot add to align with
>> memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

So IIUC, the issue is that we have a node

a) That has no memory
b) That is offline

This node will get onlined when onlining the CPU as Alexey says. Yet we
have some code that stumbles over the node and goes ahead trying to use
the pgdat -- that code is broken.


If we take a look at build_zonelists() we indeed skip any
!node_online(node). Any other code should do the same. If the node is
not online, it shall be ignored because we might not even have a pgdat
yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
stale or non-existant.


The node onlining logic when onlining a CPU sounds bogus as well: Let's
take a look at try_offline_node(). It checks that:
1) That no memory is *present*
2) That no CPU is *present*

We should online the node when adding the CPU ("present"), not when
onlining the CPU.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  9:04       ` Michal Hocko
  2021-11-02  9:24         ` David Hildenbrand
@ 2021-11-02  9:40         ` Alexey Makhalov
  2021-11-02  9:40         ` Michal Hocko
  2 siblings, 0 replies; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-02  9:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

[-- Attachment #1: Type: text/plain, Size: 4206 bytes --]

> 
> It is hard to follow your reply as your email client is not quoting
> properly. Let me try to reconstruct
> 
> On Tue 02-11-21 08:48:27, Alexey Makhalov wrote:
>> On 02.11.21 08:47, Michal Hocko wrote:
> [...]
>>>>> CPU2 has been hot-added
>>>>> BUG: unable to handle page fault for address: 0000000000001608
>>>>> #PF: supervisor read access in kernel mode
>>>>> #PF: error_code(0x0000) - not-present page
>>>>> PGD 0 P4D 0
>>>>> Oops: 0000 [#1] SMP PTI
>>>>> CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>>>> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>>>> 
>>>>> RIP: 0010:__alloc_pages+0x127/0x290
>>>> 
>>>> Could you resolve this into a specific line of the source code please?
> 
> This got probably unnoticed. I would be really curious whether this is
> a broken zonelist or something else.

backtrace (including inline functions)
pcpu_alloc_pages()
alloc_pages_node()
  __alloc_pages_node()
    __alloc_pages()
      prepare_alloc_pages()
        node_zonelist()

Panic happens in node_zonelist(), dereferencing NULL pointer of NODE_DATA(nid) in
include/linux/gfp.h:514
512 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
513 {
514         return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
515 }


> 
>>>>> Node can be in one of the following states:
>>>>> 1. not present (nid == NUMA_NO_NODE)
>>>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>>>>> 				NODE_DATA(nid) == NULL)
>>>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>>>>> 				NODE_DATA(nid) != NULL)
>>>>> 
>>>>> alloc_page_{bulk_array}node() functions verify for nid validity only
>>>>> and do not check if nid is online. Enhanced verification check allows
>>>>> to handle page allocation when node is in 2nd state.
>>>> 
>>>> I do not think this is a correct approach. We should make sure that the
>>>> proper fallback node is used instead. This means that the zone list is
>>>> initialized properly. IIRC this has been a problem in the past and it
>>>> has been fixed. The initialization code is quite subtle though so it is
>>>> possible that this got broken again.
> 
>> This approach behaves in the same way as CPU was not yet added. (state #1).
>> So, we can think of state #2 as state #1 when CPU is not present.
> 
>>> I'm a little confused:
>>> 
>>> In add_memory_resource() we hotplug the new node if required and set it
>>> online. Memory might get onlined later, via online_pages().
>> 
>> You are correct. In case of memory hot add, it is true. But in case of adding
>> CPU with memoryless node, try_node_online() will be called only during CPU
>> onlining, see cpu_up().
>> 
>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>> I think it would be correct to online node during the CPU hot add to align with
>> memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

This is not CPU only hotplug, but CPU + NUMA node, and this new node is with no memory.
We accidentally found it by not unlining the CPU immediately.
> 
>>> So after add_memory_resource()->__try_online_node() succeeded, we have
>>> an online pgdat -- essentially 3.
>>> 
>> This patch detects if we're past 3. but says that it reproduced by
>> disabling *memory* onlining.
>> This is the hot adding of both new CPU and new _memoryless_ node (with CPU only)
>> And onlining CPU makes its node online. Disabling CPU onlining puts new node
>> into state #2, which leads to repro.
>> 
>>> Before we online memory for a hotplugged node, all zones are !populated.
>>> So once we online memory for a !populated zone in online_pages(), we
>>> trigger setup_zone_pageset().
>>> 
>>> 
>>> The confusing part is that this patch checks for 3. but says it can be
>>> reproduced by not onlining *memory*. There seems to be something missing.
>> 
>> Do we maybe need a proper populated_zone() check before accessing zone data?
> 
> No, we need them initialize properly.
> 

Thanks,
—Alexey


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  9:04       ` Michal Hocko
  2021-11-02  9:24         ` David Hildenbrand
  2021-11-02  9:40         ` [PATCH] " Alexey Makhalov
@ 2021-11-02  9:40         ` Michal Hocko
  2 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-11-02  9:40 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: David Hildenbrand, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On Tue 02-11-21 10:04:23, Michal Hocko wrote:
[...]
> > Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
> > I think it would be correct to online node during the CPU hot add to align with
> > memory hot add.
> 
> I am not familiar with cpu hotplug, but this doesn't seem to be anything
> new so how come this became problem only now?

Just looked at the cpu hotplug part. I do not see add_cpu to add much
here. Here is what I can see in the current Linus tree
add_cpu
  device_online() # cpu device - cpu_sys_devices with cpu_subsys bus
    dev->bus->online -> cpu_subsys_online
      cpu_device_up
        cpu_up
	  try_online_node

So we should be bringing up the node during add_cpu. Unless something
fails on the way - e.g. cpu_possible check or something similar.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02  9:24         ` David Hildenbrand
@ 2021-11-02 10:34           ` Alexey Makhalov
  2021-11-02 11:00             ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-02 10:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador


>>>> In add_memory_resource() we hotplug the new node if required and set it
>>>> online. Memory might get onlined later, via online_pages().
>>> 
>>> You are correct. In case of memory hot add, it is true. But in case of adding
>>> CPU with memoryless node, try_node_online() will be called only during CPU
>>> onlining, see cpu_up().
>>> 
>>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>>> I think it would be correct to online node during the CPU hot add to align with
>>> memory hot add.
>> 
>> I am not familiar with cpu hotplug, but this doesn't seem to be anything
>> new so how come this became problem only now?
> 
> So IIUC, the issue is that we have a node
> 
> a) That has no memory
> b) That is offline
> 
> This node will get onlined when onlining the CPU as Alexey says. Yet we
> have some code that stumbles over the node and goes ahead trying to use
> the pgdat -- that code is broken.

You are correct.

> 
> 
> If we take a look at build_zonelists() we indeed skip any
> !node_online(node). Any other code should do the same. If the node is
> not online, it shall be ignored because we might not even have a pgdat
> yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
> stale or non-existant.

Agree, alloc_pages_node() should also do the same. Not exactly to skip the node,
but to fallback to another node if !node_online(node).
alloc_pages_node() can also be hit while onlining the node, creating chicken-egg
problem, see below.

> 
> 
> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> take a look at try_offline_node(). It checks that:
> 1) That no memory is *present*
> 2) That no CPU is *present*
> 
> We should online the node when adding the CPU ("present"), not when
> onlining the CPU.

Possible.
Assuming try_online_node was moved under add_cpu(), let’s
take look on this call stack:
add_cpu()
  try_online_node()
    __try_online_node()
      hotadd_new_pgdat()
At line 1190 we'll have a problem:
1183         pgdat = NODE_DATA(nid);
1184         if (!pgdat) {
1185                 pgdat = arch_alloc_nodedata(nid);
1186                 if (!pgdat)
1187                         return NULL;
1188
1189                 pgdat->per_cpu_nodestats =
1190                         alloc_percpu(struct per_cpu_nodestat);
1191                 arch_refresh_nodedata(nid, pgdat);

alloc_percpu() will go for all possible CPUs and will eventually end up
calling alloc_pages_node() trying to use subject nid for corresponding CPU
hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
is not yet online.

I like the idea of onlining the node when adding the CPU rather then when
CPU get online. It will require current patch or another solution to resolve
described above chicken-egg problem.

PS, earlier this year I initiated discussion about redesigning per_cpu allocator
to do not allocate/waste memory chunks for not present CPUs, but it has another
complications.

Thanks,
—Alexey


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 10:34           ` Alexey Makhalov
@ 2021-11-02 11:00             ` David Hildenbrand
  2021-11-02 11:44               ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02 11:00 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Michal Hocko, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On 02.11.21 11:34, Alexey Makhalov wrote:
> 
>>>>> In add_memory_resource() we hotplug the new node if required and set it
>>>>> online. Memory might get onlined later, via online_pages().
>>>>
>>>> You are correct. In case of memory hot add, it is true. But in case of adding
>>>> CPU with memoryless node, try_node_online() will be called only during CPU
>>>> onlining, see cpu_up().
>>>>
>>>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()?
>>>> I think it would be correct to online node during the CPU hot add to align with
>>>> memory hot add.
>>>
>>> I am not familiar with cpu hotplug, but this doesn't seem to be anything
>>> new so how come this became problem only now?
>>
>> So IIUC, the issue is that we have a node
>>
>> a) That has no memory
>> b) That is offline
>>
>> This node will get onlined when onlining the CPU as Alexey says. Yet we
>> have some code that stumbles over the node and goes ahead trying to use
>> the pgdat -- that code is broken.
> 
> You are correct.
> 
>>
>>
>> If we take a look at build_zonelists() we indeed skip any
>> !node_online(node). Any other code should do the same. If the node is
>> not online, it shall be ignored because we might not even have a pgdat
>> yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be
>> stale or non-existant.
> 
> Agree, alloc_pages_node() should also do the same. Not exactly to skip the node,
> but to fallback to another node if !node_online(node).
> alloc_pages_node() can also be hit while onlining the node, creating chicken-egg
> problem, see below

Right, the issue is also a bit involved when calling alloc_pages_node()
on an offline NID. See below.

> 
>>
>>
>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
>> take a look at try_offline_node(). It checks that:
>> 1) That no memory is *present*
>> 2) That no CPU is *present*
>>
>> We should online the node when adding the CPU ("present"), not when
>> onlining the CPU.
> 
> Possible.
> Assuming try_online_node was moved under add_cpu(), let’s
> take look on this call stack:
> add_cpu()
>   try_online_node()
>     __try_online_node()
>       hotadd_new_pgdat()
> At line 1190 we'll have a problem:
> 1183         pgdat = NODE_DATA(nid);
> 1184         if (!pgdat) {
> 1185                 pgdat = arch_alloc_nodedata(nid);
> 1186                 if (!pgdat)
> 1187                         return NULL;
> 1188
> 1189                 pgdat->per_cpu_nodestats =
> 1190                         alloc_percpu(struct per_cpu_nodestat);
> 1191                 arch_refresh_nodedata(nid, pgdat);
> 
> alloc_percpu() will go for all possible CPUs and will eventually end up
> calling alloc_pages_node() trying to use subject nid for corresponding CPU
> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> is not yet online.

Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
each possible CPU. We use cpu_to_node() to come up with the NID.

I can only assume that we usually don't get an offline NID for an
offline CPU, but instead either NODE=0 or NODE=NUMA_NO_NODE, because ...


alloc_pages_node()->__alloc_pages_node() will:

VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));

BUT: prepare_alloc_pages()

ac->zonelist = node_zonelist(preferred_nid, gfp_mask);

should similarly fail. when de-referencing NULL.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 11:00             ` David Hildenbrand
@ 2021-11-02 11:44               ` Michal Hocko
  2021-11-02 12:06                 ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-02 11:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
> On 02.11.21 11:34, Alexey Makhalov wrote:
[...]
> >> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> >> take a look at try_offline_node(). It checks that:
> >> 1) That no memory is *present*
> >> 2) That no CPU is *present*
> >>
> >> We should online the node when adding the CPU ("present"), not when
> >> onlining the CPU.
> > 
> > Possible.
> > Assuming try_online_node was moved under add_cpu(), let’s
> > take look on this call stack:
> > add_cpu()
> >   try_online_node()
> >     __try_online_node()
> >       hotadd_new_pgdat()
> > At line 1190 we'll have a problem:
> > 1183         pgdat = NODE_DATA(nid);
> > 1184         if (!pgdat) {
> > 1185                 pgdat = arch_alloc_nodedata(nid);
> > 1186                 if (!pgdat)
> > 1187                         return NULL;
> > 1188
> > 1189                 pgdat->per_cpu_nodestats =
> > 1190                         alloc_percpu(struct per_cpu_nodestat);
> > 1191                 arch_refresh_nodedata(nid, pgdat);
> > 
> > alloc_percpu() will go for all possible CPUs and will eventually end up
> > calling alloc_pages_node() trying to use subject nid for corresponding CPU
> > hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> > is not yet online.
> 
> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
> each possible CPU. We use cpu_to_node() to come up with the NID.

Shouldn't this be numa_mem_id instead? Memory less nodes are odd little
critters crafted into the MM code without wider considerations. From
time to time we are struggling with some fallouts but the primary thing
is that zonelists should be valid for all memory less nodes. If that is
not the case then there is a problem with the initialization code. If
somebody is providing a bogus node to allocate from then this should be
fixed. It is still not clear to me which case are we hitting here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 11:44               ` Michal Hocko
@ 2021-11-02 12:06                 ` David Hildenbrand
  2021-11-02 12:27                   ` Michal Hocko
  2021-11-08  6:12                   ` Alexey Makhalov
  0 siblings, 2 replies; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02 12:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On 02.11.21 12:44, Michal Hocko wrote:
> On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
>> On 02.11.21 11:34, Alexey Makhalov wrote:
> [...]
>>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
>>>> take a look at try_offline_node(). It checks that:
>>>> 1) That no memory is *present*
>>>> 2) That no CPU is *present*
>>>>
>>>> We should online the node when adding the CPU ("present"), not when
>>>> onlining the CPU.
>>>
>>> Possible.
>>> Assuming try_online_node was moved under add_cpu(), let’s
>>> take look on this call stack:
>>> add_cpu()
>>>   try_online_node()
>>>     __try_online_node()
>>>       hotadd_new_pgdat()
>>> At line 1190 we'll have a problem:
>>> 1183         pgdat = NODE_DATA(nid);
>>> 1184         if (!pgdat) {
>>> 1185                 pgdat = arch_alloc_nodedata(nid);
>>> 1186                 if (!pgdat)
>>> 1187                         return NULL;
>>> 1188
>>> 1189                 pgdat->per_cpu_nodestats =
>>> 1190                         alloc_percpu(struct per_cpu_nodestat);
>>> 1191                 arch_refresh_nodedata(nid, pgdat);
>>>
>>> alloc_percpu() will go for all possible CPUs and will eventually end up
>>> calling alloc_pages_node() trying to use subject nid for corresponding CPU
>>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
>>> is not yet online.
>>
>> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
>> each possible CPU. We use cpu_to_node() to come up with the NID.
> 
> Shouldn't this be numa_mem_id instead? Memory less nodes are odd little

Hm, good question. Most probably yes for offline nodes.

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 2054c9213c43..c21ff5bb91dc 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
                            gfp_t gfp)
 {
        unsigned int cpu, tcpu;
-       int i;
+       int i, nid;
 
        gfp |= __GFP_HIGHMEM;
 
        for_each_possible_cpu(cpu) {
+               nid = cpu_to_node(cpu);
+
+               if (nid == NUMA_NO_NODE || !node_online(nid))
+                       nid = numa_mem_id();
                for (i = page_start; i < page_end; i++) {
                        struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
-                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+                       *pagep = alloc_pages_node(nid, gfp, 0);
                        if (!*pagep)
                                goto err;
                }


> critters crafted into the MM code without wider considerations. From
> time to time we are struggling with some fallouts but the primary thing
> is that zonelists should be valid for all memory less nodes.

Yes, but a zonelist cannot be correct for an offline node, where we might
not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().

I agree that someone passing an offline NID into an allocator function
should be fixed.

Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
(VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
preferred nid.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 12:06                 ` David Hildenbrand
@ 2021-11-02 12:27                   ` Michal Hocko
  2021-11-02 12:39                     ` David Hildenbrand
  2021-11-08  6:12                   ` Alexey Makhalov
  1 sibling, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-02 12:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On Tue 02-11-21 13:06:06, David Hildenbrand wrote:
> On 02.11.21 12:44, Michal Hocko wrote:
> > On Tue 02-11-21 12:00:57, David Hildenbrand wrote:
> >> On 02.11.21 11:34, Alexey Makhalov wrote:
> > [...]
> >>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's
> >>>> take a look at try_offline_node(). It checks that:
> >>>> 1) That no memory is *present*
> >>>> 2) That no CPU is *present*
> >>>>
> >>>> We should online the node when adding the CPU ("present"), not when
> >>>> onlining the CPU.
> >>>
> >>> Possible.
> >>> Assuming try_online_node was moved under add_cpu(), let’s
> >>> take look on this call stack:
> >>> add_cpu()
> >>>   try_online_node()
> >>>     __try_online_node()
> >>>       hotadd_new_pgdat()
> >>> At line 1190 we'll have a problem:
> >>> 1183         pgdat = NODE_DATA(nid);
> >>> 1184         if (!pgdat) {
> >>> 1185                 pgdat = arch_alloc_nodedata(nid);
> >>> 1186                 if (!pgdat)
> >>> 1187                         return NULL;
> >>> 1188
> >>> 1189                 pgdat->per_cpu_nodestats =
> >>> 1190                         alloc_percpu(struct per_cpu_nodestat);
> >>> 1191                 arch_refresh_nodedata(nid, pgdat);
> >>>
> >>> alloc_percpu() will go for all possible CPUs and will eventually end up
> >>> calling alloc_pages_node() trying to use subject nid for corresponding CPU
> >>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid
> >>> is not yet online.
> >>
> >> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for
> >> each possible CPU. We use cpu_to_node() to come up with the NID.
> > 
> > Shouldn't this be numa_mem_id instead? Memory less nodes are odd little
> 
> Hm, good question. Most probably yes for offline nodes.
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213c43..c21ff5bb91dc 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>                             gfp_t gfp)
>  {
>         unsigned int cpu, tcpu;
> -       int i;
> +       int i, nid;
>  
>         gfp |= __GFP_HIGHMEM;
>  
>         for_each_possible_cpu(cpu) {
> +               nid = cpu_to_node(cpu);
> +
> +               if (nid == NUMA_NO_NODE || !node_online(nid))
> +                       nid = numa_mem_id();

or simply nid = cpu_to_mem(cpu)

>                 for (i = page_start; i < page_end; i++) {
>                         struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>  
> -                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +                       *pagep = alloc_pages_node(nid, gfp, 0);
>                         if (!*pagep)
>                                 goto err;
>                 }
> 
> 
> > critters crafted into the MM code without wider considerations. From
> > time to time we are struggling with some fallouts but the primary thing
> > is that zonelists should be valid for all memory less nodes.
> 
> Yes, but a zonelist cannot be correct for an offline node, where we might
> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().

Yes, that is what I had in mind. We are talking about two things here.
Memoryless nodes and offline nodes. The later sounds like a bug to me.

> I agree that someone passing an offline NID into an allocator function
> should be fixed.

Right

> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
> preferred nid.

Historically, those allocation interfaces were not trying to be robust
against wrong inputs because that adds cpu cycles for everybody for
"what if buggy" code. This has worked (surprisingly) well. Memory less
nodes have brought in some confusion but this is still something that we
can address on a higher level. Nobody give arbitrary nodes as an input.
cpu_to_node might be tricky because it can point to a memory less node
which along with __GFP_THISNODE is very likely not something anybody
wants. Hence cpu_to_mem should be used for allocations. I hate we have
two very similar APIs...

But something seems wrong in this case. cpu_to_node shouldn't return
offline nodes. That is just a land mine. It is not clear to me how the
cpu has been brought up so that the numa node allocation was left
behind. As pointed in other email add_cpu resp. cpu_up is not it.
Is it possible that the cpu bring up was only half way?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 12:27                   ` Michal Hocko
@ 2021-11-02 12:39                     ` David Hildenbrand
  2021-11-02 13:25                       ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

>> Yes, but a zonelist cannot be correct for an offline node, where we might
>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
> 
> Yes, that is what I had in mind. We are talking about two things here.
> Memoryless nodes and offline nodes. The later sounds like a bug to me.

Agreed. memoryless nodes should just have proper zonelists -- which
seems to be the case.

>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
>> preferred nid.
> 
> Historically, those allocation interfaces were not trying to be robust
> against wrong inputs because that adds cpu cycles for everybody for
> "what if buggy" code. This has worked (surprisingly) well. Memory less
> nodes have brought in some confusion but this is still something that we
> can address on a higher level. Nobody give arbitrary nodes as an input.
> cpu_to_node might be tricky because it can point to a memory less node
> which along with __GFP_THISNODE is very likely not something anybody
> wants. Hence cpu_to_mem should be used for allocations. I hate we have
> two very similar APIs...

To be precise, I'm wondering if we should do:

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..8c49b88336ee 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -565,7 +565,7 @@ static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
        VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
-       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
+       VM_WARN_ON(!node_online(nid));

        return __alloc_pages(gfp_mask, order, nid, NULL);
 }

(Or maybe VM_BUG_ON)

Because it cannot possibly work and we'll dereference NULL later.

> 
> But something seems wrong in this case. cpu_to_node shouldn't return
> offline nodes. That is just a land mine. It is not clear to me how the
> cpu has been brought up so that the numa node allocation was left
> behind. As pointed in other email add_cpu resp. cpu_up is not it.
> Is it possible that the cpu bring up was only half way?

I tried to follow the code (what sets a CPU present, what sets a CPU
online, when do we update cpu_to_node() mapping) and IMHO it's all a big
mess. Maybe it's clearer to people familiar with that code, but CPU
hotplug in general seems to be a confusing piece of (arch-specific) code.

Also, I have no clue if cpu_to_node() mapping will get invalidated after
unplugging that CPU, or if the mapping will simply stay around for all
eternity ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 12:39                     ` David Hildenbrand
@ 2021-11-02 13:25                       ` Michal Hocko
  2021-11-02 13:41                         ` David Hildenbrand
  2021-11-02 13:52                         ` Oscar Salvador
  0 siblings, 2 replies; 98+ messages in thread
From: Michal Hocko @ 2021-11-02 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On Tue 02-11-21 13:39:06, David Hildenbrand wrote:
> >> Yes, but a zonelist cannot be correct for an offline node, where we might
> >> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
> >> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
> > 
> > Yes, that is what I had in mind. We are talking about two things here.
> > Memoryless nodes and offline nodes. The later sounds like a bug to me.
> 
> Agreed. memoryless nodes should just have proper zonelists -- which
> seems to be the case.
> 
> >> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
> >> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
> >> preferred nid.
> > 
> > Historically, those allocation interfaces were not trying to be robust
> > against wrong inputs because that adds cpu cycles for everybody for
> > "what if buggy" code. This has worked (surprisingly) well. Memory less
> > nodes have brought in some confusion but this is still something that we
> > can address on a higher level. Nobody give arbitrary nodes as an input.
> > cpu_to_node might be tricky because it can point to a memory less node
> > which along with __GFP_THISNODE is very likely not something anybody
> > wants. Hence cpu_to_mem should be used for allocations. I hate we have
> > two very similar APIs...
> 
> To be precise, I'm wondering if we should do:
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..8c49b88336ee 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -565,7 +565,7 @@ static inline struct page *
>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  {
>         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> -       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
> +       VM_WARN_ON(!node_online(nid));
> 
>         return __alloc_pages(gfp_mask, order, nid, NULL);
>  }
> 
> (Or maybe VM_BUG_ON)
> 
> Because it cannot possibly work and we'll dereference NULL later.

VM_BUG_ON would be silent for most configurations and crash would happen
even without it so I am not sure about the additional value. VM_WARN_ON
doesn't really add much on top - except it would crash in some
configurations. If we really care to catch this case then we would have
to do a reasonable fallback with a printk note and a dumps stack.
Something like
	if (unlikely(!node_online(nid))) {
		pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n");
		dump_stack();
		nid = numa_mem_id();
	}

But again this is adding quite some cycles to a hotpath of the page
allocator. Is this worth it?

> > But something seems wrong in this case. cpu_to_node shouldn't return
> > offline nodes. That is just a land mine. It is not clear to me how the
> > cpu has been brought up so that the numa node allocation was left
> > behind. As pointed in other email add_cpu resp. cpu_up is not it.
> > Is it possible that the cpu bring up was only half way?
> 
> I tried to follow the code (what sets a CPU present, what sets a CPU
> online, when do we update cpu_to_node() mapping) and IMHO it's all a big
> mess. Maybe it's clearer to people familiar with that code, but CPU
> hotplug in general seems to be a confusing piece of (arch-specific) code.

Yes there are different arch specific parts that make this quite hard to
follow.

I think we want to learn how exactly Alexey brought that cpu up. Because
his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
Or I am just not following the code properly. Once we know all those
details we can get in touch with cpu hotplug maintainers and see what
can we do.

Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
One last thing, there were some mentions of __GFP_THISNODE but I fail to
see connection with the pcp allocator...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 13:25                       ` Michal Hocko
@ 2021-11-02 13:41                         ` David Hildenbrand
  2021-11-02 14:12                           ` Michal Hocko
  2021-11-02 13:52                         ` Oscar Salvador
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On 02.11.21 14:25, Michal Hocko wrote:
> On Tue 02-11-21 13:39:06, David Hildenbrand wrote:
>>>> Yes, but a zonelist cannot be correct for an offline node, where we might
>>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as
>>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat().
>>>
>>> Yes, that is what I had in mind. We are talking about two things here.
>>> Memoryless nodes and offline nodes. The later sounds like a bug to me.
>>
>> Agreed. memoryless nodes should just have proper zonelists -- which
>> seems to be the case.
>>
>>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly
>>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as
>>>> preferred nid.
>>>
>>> Historically, those allocation interfaces were not trying to be robust
>>> against wrong inputs because that adds cpu cycles for everybody for
>>> "what if buggy" code. This has worked (surprisingly) well. Memory less
>>> nodes have brought in some confusion but this is still something that we
>>> can address on a higher level. Nobody give arbitrary nodes as an input.
>>> cpu_to_node might be tricky because it can point to a memory less node
>>> which along with __GFP_THISNODE is very likely not something anybody
>>> wants. Hence cpu_to_mem should be used for allocations. I hate we have
>>> two very similar APIs...
>>
>> To be precise, I'm wondering if we should do:
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 55b2ec1f965a..8c49b88336ee 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -565,7 +565,7 @@ static inline struct page *
>>  __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>>  {
>>         VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> -       VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
>> +       VM_WARN_ON(!node_online(nid));
>>
>>         return __alloc_pages(gfp_mask, order, nid, NULL);
>>  }
>>
>> (Or maybe VM_BUG_ON)
>>
>> Because it cannot possibly work and we'll dereference NULL later.
> 
> VM_BUG_ON would be silent for most configurations and crash would happen
> even without it so I am not sure about the additional value. VM_WARN_ON
> doesn't really add much on top - except it would crash in some
> configurations. If we really care to catch this case then we would have
> to do a reasonable fallback with a printk note and a dumps stack.

As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which
can catch quite some issues early, before they end up in enterprise
distro kernels. I think it has value.

> Something like
> 	if (unlikely(!node_online(nid))) {
> 		pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n");
> 		dump_stack();
> 		nid = numa_mem_id();
> 	}
> 
> But again this is adding quite some cycles to a hotpath of the page
> allocator. Is this worth it?

Don't think a fallback makes sense.

> 
>>> But something seems wrong in this case. cpu_to_node shouldn't return
>>> offline nodes. That is just a land mine. It is not clear to me how the
>>> cpu has been brought up so that the numa node allocation was left
>>> behind. As pointed in other email add_cpu resp. cpu_up is not it.
>>> Is it possible that the cpu bring up was only half way?
>>
>> I tried to follow the code (what sets a CPU present, what sets a CPU
>> online, when do we update cpu_to_node() mapping) and IMHO it's all a big
>> mess. Maybe it's clearer to people familiar with that code, but CPU
>> hotplug in general seems to be a confusing piece of (arch-specific) code.
> 
> Yes there are different arch specific parts that make this quite hard to
> follow.
> 
> I think we want to learn how exactly Alexey brought that cpu up. Because
> his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> Or I am just not following the code properly. Once we know all those
> details we can get in touch with cpu hotplug maintainers and see what
> can we do.

Yes.

> 
> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?

You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?

cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
won't help for this very report.

> One last thing, there were some mentions of __GFP_THISNODE but I fail to
> see connection with the pcp allocator...

Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the
VM_WARN_ON but still crash.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 13:25                       ` Michal Hocko
  2021-11-02 13:41                         ` David Hildenbrand
@ 2021-11-02 13:52                         ` Oscar Salvador
  2021-11-02 14:35                           ` Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: Oscar Salvador @ 2021-11-02 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Alexey Makhalov, linux-mm, Andrew Morton,
	linux-kernel, stable, Oscar Salvador

On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote:
> I think we want to learn how exactly Alexey brought that cpu up. Because
> his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> Or I am just not following the code properly. Once we know all those
> details we can get in touch with cpu hotplug maintainers and see what
> can we do.

I am not really familiar with CPU hot-onlining, but I have been taking a look.
As with memory, there are two different stages, hot-adding and onlining (and the
counterparts).

Part of the hot-adding being:

acpi_processor_get_info
 acpi_processor_hotadd_init
  arch_register_cpu
   register_cpu

One of the things that register_cpu() does is to set cpu->dev.bus pointing to
&cpu_subsys, which is:

struct bus_type cpu_subsys = {
	.name = "cpu",
	.dev_name = "cpu",
	.match = cpu_subsys_match,
#ifdef CONFIG_HOTPLUG_CPU
	.online = cpu_subsys_online,
	.offline = cpu_subsys_offline,
#endif
};

Then, the onlining part (in case of a udev rule or someone onlining the device)
would be:

online_store
 device_online
  cpu_subsys_online
   cpu_device_up
    cpu_up
     ...
     online node

Since Alexey disabled the udev rule and no one onlined the CPU, online_store()->
device_online() wasn't really called.

The following only applies to x86_64:
I think we got confused because cpu_device_up() is also called from add_cpu(),
but that is an exported function and x86 does not call add_cpu() unless for
debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c).
It does the onlining through online_store()...
So we can take add_cpu() off the equation here.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 13:41                         ` David Hildenbrand
@ 2021-11-02 14:12                           ` Michal Hocko
  2021-11-02 14:44                             ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-02 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On Tue 02-11-21 14:41:25, David Hildenbrand wrote:
> On 02.11.21 14:25, Michal Hocko wrote:
[...]
> > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
> 
> You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?

just cpu_to_mem

> cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
> won't help for this very report.

Weird, x86 allows memory less nodes as well. But you are right
there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any
arch specific implementation. I have to say that I have forgot all those
nasty details... Sigh
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 13:52                         ` Oscar Salvador
@ 2021-11-02 14:35                           ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-11-02 14:35 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Alexey Makhalov, linux-mm, Andrew Morton,
	linux-kernel, stable, Oscar Salvador

On Tue 02-11-21 14:52:01, Oscar Salvador wrote:
> On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote:
> > I think we want to learn how exactly Alexey brought that cpu up. Because
> > his initial thought on add_cpu resp cpu_up doesn't seem to be correct.
> > Or I am just not following the code properly. Once we know all those
> > details we can get in touch with cpu hotplug maintainers and see what
> > can we do.
> 
> I am not really familiar with CPU hot-onlining, but I have been taking a look.
> As with memory, there are two different stages, hot-adding and onlining (and the
> counterparts).
> 
> Part of the hot-adding being:
> 
> acpi_processor_get_info
>  acpi_processor_hotadd_init
>   arch_register_cpu
>    register_cpu
> 
> One of the things that register_cpu() does is to set cpu->dev.bus pointing to
> &cpu_subsys, which is:
> 
> struct bus_type cpu_subsys = {
> 	.name = "cpu",
> 	.dev_name = "cpu",
> 	.match = cpu_subsys_match,
> #ifdef CONFIG_HOTPLUG_CPU
> 	.online = cpu_subsys_online,
> 	.offline = cpu_subsys_offline,
> #endif
> };
> 
> Then, the onlining part (in case of a udev rule or someone onlining the device)
> would be:
> 
> online_store
>  device_online
>   cpu_subsys_online
>    cpu_device_up
>     cpu_up
>      ...
>      online node
> 
> Since Alexey disabled the udev rule and no one onlined the CPU, online_store()->
> device_online() wasn't really called.
> 
> The following only applies to x86_64:
> I think we got confused because cpu_device_up() is also called from add_cpu(),
> but that is an exported function and x86 does not call add_cpu() unless for
> debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c).
> It does the onlining through online_store()...
> So we can take add_cpu() off the equation here.

Yes, so the real problem is (thanks for pointing me to the acpi code).
The cpu->node association is done in acpi_map_cpu2node and I suspect
this expects that the node is already present as it gets the information
from SRAT/PXM tables which are parsed during boot. But I might be just
confused or maybe just VMware inject new entries here somehow.

Another interesting thing is that acpi_map_cpu2node skips over
association if there is no node found in SRAT but that should only mean
it would use the default initialization which should be hopefuly 0.

Anyway, I have found in my notes
https://www.spinics.net/lists/kernel/msg3010886.html which is a slightly
different problem but it has some notes about how the initialization
mess works (that one was boot time though and hotplug might be different
actually).

I have ran out of time for this today so hopefully somebody can re-learn
that from there...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 14:12                           ` Michal Hocko
@ 2021-11-02 14:44                             ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2021-11-02 14:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

On 02.11.21 15:12, Michal Hocko wrote:
> On Tue 02-11-21 14:41:25, David Hildenbrand wrote:
>> On 02.11.21 14:25, Michal Hocko wrote:
> [...]
>>> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem?
>>
>> You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids?
> 
> just cpu_to_mem
> 
>> cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it
>> won't help for this very report.
> 
> Weird, x86 allows memory less nodes as well. But you are right
> there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any
> arch specific implementation. I have to say that I have forgot all those
> nasty details... Sigh
> 

I assume HAVE_MEMORYLESS_NODES is just an optimization to set a
preferred memory node for memoryless nodes. It doesn't imply that we
cannot have memoryless nodes otherwise.

I suspect just as so often, the config option name doesn't express what
it really does.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] mm: fix panic in __alloc_pages
  2021-11-02 12:06                 ` David Hildenbrand
  2021-11-02 12:27                   ` Michal Hocko
@ 2021-11-08  6:12                   ` Alexey Makhalov
  2021-11-08  6:36                     ` [PATCH v2] " Alexey Makhalov
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-08  6:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, Andrew Morton, linux-kernel, stable,
	Oscar Salvador

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]

I’m going to send patch v2, with node_online check moved to caller (pcpu_alloc_pages() function)
as was suggested by David. It seems as it is only one place which passes present but offlined
node to alloc-pages_node(). Moving node online check to the caller keeps hot path (alloc_pages)
simple and performant.

> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213c43..c21ff5bb91dc 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>                           gfp_t gfp)
> {
>       unsigned int cpu, tcpu;
> -       int i;
> +       int i, nid;
> 
>       gfp |= __GFP_HIGHMEM;
> 
>       for_each_possible_cpu(cpu) {
> +               nid = cpu_to_node(cpu);
> +
> +               if (nid == NUMA_NO_NODE || !node_online(nid))
> +                       nid = numa_mem_id();
>               for (i = page_start; i < page_end; i++) {
>                       struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
> 
> -                       *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +                       *pagep = alloc_pages_node(nid, gfp, 0);
>                       if (!*pagep)
>                               goto err;
>               }
> 

Thanks,
—Alexey


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2] mm: fix panic in __alloc_pages
  2021-11-08  6:12                   ` Alexey Makhalov
@ 2021-11-08  6:36                     ` Alexey Makhalov
  2021-11-08  8:32                       ` David Hildenbrand
  2021-11-08 10:37                       ` [PATCH v2] mm: fix panic in __alloc_pages Michal Hocko
  0 siblings, 2 replies; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-08  6:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexey Makhalov, Andrew Morton, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Dennis Zhou, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

There is a kernel panic caused by pcpu_alloc_pages() passing
offlined and uninitialized node to alloc_pages_node() leading
to panic by NULL dereferencing uninitialized NODE_DATA(nid).

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Panic can be easily reproduced by disabling udev rule for
automatic onlining hot added CPU followed by CPU with
memoryless node (NUMA node with CPU only) hot add.

Hot adding CPU and memoryless node does not bring the node
to online state. Memoryless node will be onlined only during
the onlining its CPU.

Node can be in one of the following states:
1. not present.(nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

Percpu code is doing allocations for all possible CPUs. The
issue happens when it serves hot added but not yet onlined
CPU when its node is in 2nd state. This node is not ready
to use, fallback to node_mem_id().

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
---
 mm/percpu-vm.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 2054c9213..f58d73c92 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 			    gfp_t gfp)
 {
 	unsigned int cpu, tcpu;
-	int i;
+	int i, nid;
 
 	gfp |= __GFP_HIGHMEM;
 
 	for_each_possible_cpu(cpu) {
+		nid = cpu_to_node(cpu);
+		if (nid == NUMA_NO_NODE || !node_online(nid))
+			nid = numa_mem_id();
+
 		for (i = page_start; i < page_end; i++) {
 			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
-			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+			*pagep = alloc_pages_node(nid, gfp, 0);
 			if (!*pagep)
 				goto err;
 		}
-- 
2.30.0


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2] mm: fix panic in __alloc_pages
  2021-11-08  6:36                     ` [PATCH v2] " Alexey Makhalov
@ 2021-11-08  8:32                       ` David Hildenbrand
  2021-11-08 20:23                         ` [PATCH v3] " Alexey Makhalov
  2021-11-08 10:37                       ` [PATCH v2] mm: fix panic in __alloc_pages Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-08  8:32 UTC (permalink / raw)
  To: Alexey Makhalov, linux-mm
  Cc: Andrew Morton, Michal Hocko, Oscar Salvador, Dennis Zhou,
	Tejun Heo, Christoph Lameter, linux-kernel, stable

On 08.11.21 07:36, Alexey Makhalov wrote:
> There is a kernel panic caused by pcpu_alloc_pages() passing
> offlined and uninitialized node to alloc_pages_node() leading
> to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> 
>  CPU2 has been hot-added
>  BUG: unable to handle page fault for address: 0000000000001608
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] SMP PTI
>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> 
>  RIP: 0010:__alloc_pages+0x127/0x290
>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>  Call Trace:
>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>   pcpu_populate_chunk+0x33/0xb0
>   pcpu_alloc+0x4d3/0x6f0
>   __alloc_percpu_gfp+0xd/0x10
>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>   mem_cgroup_alloc+0xed/0x2f0
>   mem_cgroup_css_alloc+0x33/0x2f0
>   css_create+0x3a/0x1f0
>   cgroup_apply_control_enable+0x12b/0x150
>   cgroup_mkdir+0xdd/0x110
>   kernfs_iop_mkdir+0x4f/0x80
>   vfs_mkdir+0x178/0x230
>   do_mkdirat+0xfd/0x120
>   __x64_sys_mkdir+0x47/0x70
>   ? syscall_exit_to_user_mode+0x21/0x50
>   do_syscall_64+0x43/0x90
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Panic can be easily reproduced by disabling udev rule for
> automatic onlining hot added CPU followed by CPU with
> memoryless node (NUMA node with CPU only) hot add.
> 
> Hot adding CPU and memoryless node does not bring the node
> to online state. Memoryless node will be onlined only during
> the onlining its CPU.
> 
> Node can be in one of the following states:
> 1. not present.(nid == NUMA_NO_NODE)
> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> 				NODE_DATA(nid) == NULL)
> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> 				NODE_DATA(nid) != NULL)
> 
> Percpu code is doing allocations for all possible CPUs. The
> issue happens when it serves hot added but not yet onlined
> CPU when its node is in 2nd state. This node is not ready
> to use, fallback to node_mem_id().
> 
> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  mm/percpu-vm.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213..f58d73c92 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>  			    gfp_t gfp)
>  {
>  	unsigned int cpu, tcpu;
> -	int i;
> +	int i, nid;
>  
>  	gfp |= __GFP_HIGHMEM;
>  
>  	for_each_possible_cpu(cpu) {
> +		nid = cpu_to_node(cpu);

As raised by Michal, we could use cpu_to_mem() here instead of
cpu_to_node(). However, AFAIU, it's a pure optimization to avoid the
fallback path:

Documentation/vm/numa.rst:

"If the architecture supports--does not hide--memoryless nodes, then
CPUs attached to memoryless nodes would always incur the fallback path
overhead  or some subsystems would fail to initialize if they attempted
to allocated memory exclusively from a node without memory.  To support
such architectures transparently, kernel subsystems can use the
numa_mem_id() or cpu_to_mem() function to locate the "local memory node"
for the calling or specified CPU.  Again, this is the same node from
which default, local page allocations will be attempted."


The root issue here is that we're iterating possible CPUs (not online or
present CPUs), belonging to nodes that might not be online yet. I agree
that this fix, although sub-optimal, might be the right thing to do for
now. It would be different if we'd be iterating online CPUs.


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2] mm: fix panic in __alloc_pages
  2021-11-08  6:36                     ` [PATCH v2] " Alexey Makhalov
  2021-11-08  8:32                       ` David Hildenbrand
@ 2021-11-08 10:37                       ` Michal Hocko
  1 sibling, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-11-08 10:37 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Oscar Salvador,
	Dennis Zhou, Tejun Heo, Christoph Lameter, linux-kernel, stable

On Sun 07-11-21 22:36:50, Alexey Makhalov wrote:
> There is a kernel panic caused by pcpu_alloc_pages() passing
> offlined and uninitialized node to alloc_pages_node() leading
> to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> 
>  CPU2 has been hot-added
>  BUG: unable to handle page fault for address: 0000000000001608
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] SMP PTI
>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> 
>  RIP: 0010:__alloc_pages+0x127/0x290
>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>  Call Trace:
>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>   pcpu_populate_chunk+0x33/0xb0
>   pcpu_alloc+0x4d3/0x6f0
>   __alloc_percpu_gfp+0xd/0x10
>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>   mem_cgroup_alloc+0xed/0x2f0
>   mem_cgroup_css_alloc+0x33/0x2f0
>   css_create+0x3a/0x1f0
>   cgroup_apply_control_enable+0x12b/0x150
>   cgroup_mkdir+0xdd/0x110
>   kernfs_iop_mkdir+0x4f/0x80
>   vfs_mkdir+0x178/0x230
>   do_mkdirat+0xfd/0x120
>   __x64_sys_mkdir+0x47/0x70
>   ? syscall_exit_to_user_mode+0x21/0x50
>   do_syscall_64+0x43/0x90
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Panic can be easily reproduced by disabling udev rule for
> automatic onlining hot added CPU followed by CPU with
> memoryless node (NUMA node with CPU only) hot add.
> 
> Hot adding CPU and memoryless node does not bring the node
> to online state. Memoryless node will be onlined only during
> the onlining its CPU.
> 
> Node can be in one of the following states:
> 1. not present.(nid == NUMA_NO_NODE)
> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> 				NODE_DATA(nid) == NULL)
> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> 				NODE_DATA(nid) != NULL)
> 
> Percpu code is doing allocations for all possible CPUs. The
> issue happens when it serves hot added but not yet onlined
> CPU when its node is in 2nd state. This node is not ready
> to use, fallback to node_mem_id().

I do agree that cpu_to_mem usage is better here. But I still think this
is papering over a deeper problem. We should never allow cpu_to_mem to
return an invalid numa node.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-08  8:32                       ` David Hildenbrand
@ 2021-11-08 20:23                         ` Alexey Makhalov
  2021-11-09  2:08                           ` Eric Dumazet
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-08 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexey Makhalov, Andrew Morton, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Dennis Zhou, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

There is a kernel panic caused by pcpu_alloc_pages() passing
offlined and uninitialized node to alloc_pages_node() leading
to panic by NULL dereferencing uninitialized NODE_DATA(nid).

 CPU2 has been hot-added
 BUG: unable to handle page fault for address: 0000000000001608
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP PTI
 CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW

 RIP: 0010:__alloc_pages+0x127/0x290
 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
 FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
 Call Trace:
  pcpu_alloc_pages.constprop.0+0xe4/0x1c0
  pcpu_populate_chunk+0x33/0xb0
  pcpu_alloc+0x4d3/0x6f0
  __alloc_percpu_gfp+0xd/0x10
  alloc_mem_cgroup_per_node_info+0x54/0xb0
  mem_cgroup_alloc+0xed/0x2f0
  mem_cgroup_css_alloc+0x33/0x2f0
  css_create+0x3a/0x1f0
  cgroup_apply_control_enable+0x12b/0x150
  cgroup_mkdir+0xdd/0x110
  kernfs_iop_mkdir+0x4f/0x80
  vfs_mkdir+0x178/0x230
  do_mkdirat+0xfd/0x120
  __x64_sys_mkdir+0x47/0x70
  ? syscall_exit_to_user_mode+0x21/0x50
  do_syscall_64+0x43/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Panic can be easily reproduced by disabling udev rule for
automatic onlining hot added CPU followed by CPU with
memoryless node (NUMA node with CPU only) hot add.

Hot adding CPU and memoryless node does not bring the node
to online state. Memoryless node will be onlined only during
the onlining its CPU.

Node can be in one of the following states:
1. not present.(nid == NUMA_NO_NODE)
2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
				NODE_DATA(nid) == NULL)
3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
				NODE_DATA(nid) != NULL)

Percpu code is doing allocations for all possible CPUs. The
issue happens when it serves hot added but not yet onlined
CPU when its node is in 2nd state. This node is not ready
to use, fallback to numa_mem_id().

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
---
 mm/percpu-vm.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 2054c9213..f58d73c92 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
 			    gfp_t gfp)
 {
 	unsigned int cpu, tcpu;
-	int i;
+	int i, nid;
 
 	gfp |= __GFP_HIGHMEM;
 
 	for_each_possible_cpu(cpu) {
+		nid = cpu_to_node(cpu);
+		if (nid == NUMA_NO_NODE || !node_online(nid))
+			nid = numa_mem_id();
+
 		for (i = page_start; i < page_end; i++) {
 			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
 
-			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+			*pagep = alloc_pages_node(nid, gfp, 0);
 			if (!*pagep)
 				goto err;
 		}
-- 
2.30.0


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-08 20:23                         ` [PATCH v3] " Alexey Makhalov
@ 2021-11-09  2:08                           ` Eric Dumazet
  2021-11-09  7:03                             ` David Hildenbrand
  2021-11-09 17:15                             ` Michal Hocko
  0 siblings, 2 replies; 98+ messages in thread
From: Eric Dumazet @ 2021-11-09  2:08 UTC (permalink / raw)
  To: Alexey Makhalov, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Michal Hocko, Oscar Salvador,
	Dennis Zhou, Tejun Heo, Christoph Lameter, linux-kernel, stable



On 11/8/21 12:23 PM, Alexey Makhalov wrote:
> There is a kernel panic caused by pcpu_alloc_pages() passing
> offlined and uninitialized node to alloc_pages_node() leading
> to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> 
>  CPU2 has been hot-added
>  BUG: unable to handle page fault for address: 0000000000001608
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] SMP PTI
>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> 
>  RIP: 0010:__alloc_pages+0x127/0x290
>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>  Call Trace:
>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>   pcpu_populate_chunk+0x33/0xb0
>   pcpu_alloc+0x4d3/0x6f0
>   __alloc_percpu_gfp+0xd/0x10
>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>   mem_cgroup_alloc+0xed/0x2f0
>   mem_cgroup_css_alloc+0x33/0x2f0
>   css_create+0x3a/0x1f0
>   cgroup_apply_control_enable+0x12b/0x150
>   cgroup_mkdir+0xdd/0x110
>   kernfs_iop_mkdir+0x4f/0x80
>   vfs_mkdir+0x178/0x230
>   do_mkdirat+0xfd/0x120
>   __x64_sys_mkdir+0x47/0x70
>   ? syscall_exit_to_user_mode+0x21/0x50
>   do_syscall_64+0x43/0x90
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Panic can be easily reproduced by disabling udev rule for
> automatic onlining hot added CPU followed by CPU with
> memoryless node (NUMA node with CPU only) hot add.
> 
> Hot adding CPU and memoryless node does not bring the node
> to online state. Memoryless node will be onlined only during
> the onlining its CPU.
> 
> Node can be in one of the following states:
> 1. not present.(nid == NUMA_NO_NODE)
> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> 				NODE_DATA(nid) == NULL)
> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> 				NODE_DATA(nid) != NULL)
> 
> Percpu code is doing allocations for all possible CPUs. The
> issue happens when it serves hot added but not yet onlined
> CPU when its node is in 2nd state. This node is not ready
> to use, fallback to numa_mem_id().
> 
> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: stable@vger.kernel.org
> ---
>  mm/percpu-vm.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 2054c9213..f58d73c92 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>  			    gfp_t gfp)
>  {
>  	unsigned int cpu, tcpu;
> -	int i;
> +	int i, nid;
>  
>  	gfp |= __GFP_HIGHMEM;
>  
>  	for_each_possible_cpu(cpu) {
> +		nid = cpu_to_node(cpu);
> +		if (nid == NUMA_NO_NODE || !node_online(nid))
> +			nid = numa_mem_id();

Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?

Or maybe there is no support for this constraint in per-cpu allocator anyway.

I am a bit worried that we do not really know if pages are
allocated on the right node or not.

Some workloads could really be hurt if all per-cpu pages were
put on a single NUMA node.

> +
>  		for (i = page_start; i < page_end; i++) {
>  			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
>  
> -			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
> +			*pagep = alloc_pages_node(nid, gfp, 0);
>  			if (!*pagep)
>  				goto err;
>  		}
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09  2:08                           ` Eric Dumazet
@ 2021-11-09  7:03                             ` David Hildenbrand
  2021-11-09 16:55                               ` Eric Dumazet
  2021-11-09 17:15                             ` Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-11-09  7:03 UTC (permalink / raw)
  To: Eric Dumazet, Alexey Makhalov, linux-mm
  Cc: Andrew Morton, Michal Hocko, Oscar Salvador, Dennis Zhou,
	Tejun Heo, Christoph Lameter, linux-kernel, stable

On 09.11.21 03:08, Eric Dumazet wrote:
> 
> 
> On 11/8/21 12:23 PM, Alexey Makhalov wrote:
>> There is a kernel panic caused by pcpu_alloc_pages() passing
>> offlined and uninitialized node to alloc_pages_node() leading
>> to panic by NULL dereferencing uninitialized NODE_DATA(nid).
>>
>>  CPU2 has been hot-added
>>  BUG: unable to handle page fault for address: 0000000000001608
>>  #PF: supervisor read access in kernel mode
>>  #PF: error_code(0x0000) - not-present page
>>  PGD 0 P4D 0
>>  Oops: 0000 [#1] SMP PTI
>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>
>>  RIP: 0010:__alloc_pages+0x127/0x290
>>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>>  Call Trace:
>>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>>   pcpu_populate_chunk+0x33/0xb0
>>   pcpu_alloc+0x4d3/0x6f0
>>   __alloc_percpu_gfp+0xd/0x10
>>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>>   mem_cgroup_alloc+0xed/0x2f0
>>   mem_cgroup_css_alloc+0x33/0x2f0
>>   css_create+0x3a/0x1f0
>>   cgroup_apply_control_enable+0x12b/0x150
>>   cgroup_mkdir+0xdd/0x110
>>   kernfs_iop_mkdir+0x4f/0x80
>>   vfs_mkdir+0x178/0x230
>>   do_mkdirat+0xfd/0x120
>>   __x64_sys_mkdir+0x47/0x70
>>   ? syscall_exit_to_user_mode+0x21/0x50
>>   do_syscall_64+0x43/0x90
>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> Panic can be easily reproduced by disabling udev rule for
>> automatic onlining hot added CPU followed by CPU with
>> memoryless node (NUMA node with CPU only) hot add.
>>
>> Hot adding CPU and memoryless node does not bring the node
>> to online state. Memoryless node will be onlined only during
>> the onlining its CPU.
>>
>> Node can be in one of the following states:
>> 1. not present.(nid == NUMA_NO_NODE)
>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>> 				NODE_DATA(nid) == NULL)
>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>> 				NODE_DATA(nid) != NULL)
>>
>> Percpu code is doing allocations for all possible CPUs. The
>> issue happens when it serves hot added but not yet onlined
>> CPU when its node is in 2nd state. This node is not ready
>> to use, fallback to numa_mem_id().
>>
>> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dennis Zhou <dennis@kernel.org>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: stable@vger.kernel.org
>> ---
>>  mm/percpu-vm.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
>> index 2054c9213..f58d73c92 100644
>> --- a/mm/percpu-vm.c
>> +++ b/mm/percpu-vm.c
>> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>>  			    gfp_t gfp)
>>  {
>>  	unsigned int cpu, tcpu;
>> -	int i;
>> +	int i, nid;
>>  
>>  	gfp |= __GFP_HIGHMEM;
>>  
>>  	for_each_possible_cpu(cpu) {
>> +		nid = cpu_to_node(cpu);
>> +		if (nid == NUMA_NO_NODE || !node_online(nid))
>> +			nid = numa_mem_id();
> 
> Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?

... and what to do then? Fail the allocation? We could do that, but ...

> 
> Or maybe there is no support for this constraint in per-cpu allocator anyway.
> 

... looking at mm/percpu.c, I don't think there are any users (IOW not
supported?).

> I am a bit worried that we do not really know if pages are
> allocated on the right node or not.

Even without __GFP_THISNODE it's sub-optimal. But if there is no memory
on that node, there is barely anything we can do than falling back.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09  7:03                             ` David Hildenbrand
@ 2021-11-09 16:55                               ` Eric Dumazet
  0 siblings, 0 replies; 98+ messages in thread
From: Eric Dumazet @ 2021-11-09 16:55 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Makhalov, linux-mm
  Cc: Andrew Morton, Michal Hocko, Oscar Salvador, Dennis Zhou,
	Tejun Heo, Christoph Lameter, linux-kernel, stable



On 11/8/21 11:03 PM, David Hildenbrand wrote:
> On 09.11.21 03:08, Eric Dumazet wrote:
>>
>>
>> On 11/8/21 12:23 PM, Alexey Makhalov wrote:
>>> There is a kernel panic caused by pcpu_alloc_pages() passing
>>> offlined and uninitialized node to alloc_pages_node() leading
>>> to panic by NULL dereferencing uninitialized NODE_DATA(nid).
>>>
>>>  CPU2 has been hot-added
>>>  BUG: unable to handle page fault for address: 0000000000001608
>>>  #PF: supervisor read access in kernel mode
>>>  #PF: error_code(0x0000) - not-present page
>>>  PGD 0 P4D 0
>>>  Oops: 0000 [#1] SMP PTI
>>>  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
>>>  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
>>>
>>>  RIP: 0010:__alloc_pages+0x127/0x290
>>>  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
>>>  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
>>>  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
>>>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
>>>  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
>>>  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
>>>  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
>>>  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
>>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
>>>  Call Trace:
>>>   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
>>>   pcpu_populate_chunk+0x33/0xb0
>>>   pcpu_alloc+0x4d3/0x6f0
>>>   __alloc_percpu_gfp+0xd/0x10
>>>   alloc_mem_cgroup_per_node_info+0x54/0xb0
>>>   mem_cgroup_alloc+0xed/0x2f0
>>>   mem_cgroup_css_alloc+0x33/0x2f0
>>>   css_create+0x3a/0x1f0
>>>   cgroup_apply_control_enable+0x12b/0x150
>>>   cgroup_mkdir+0xdd/0x110
>>>   kernfs_iop_mkdir+0x4f/0x80
>>>   vfs_mkdir+0x178/0x230
>>>   do_mkdirat+0xfd/0x120
>>>   __x64_sys_mkdir+0x47/0x70
>>>   ? syscall_exit_to_user_mode+0x21/0x50
>>>   do_syscall_64+0x43/0x90
>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>
>>> Panic can be easily reproduced by disabling udev rule for
>>> automatic onlining hot added CPU followed by CPU with
>>> memoryless node (NUMA node with CPU only) hot add.
>>>
>>> Hot adding CPU and memoryless node does not bring the node
>>> to online state. Memoryless node will be onlined only during
>>> the onlining its CPU.
>>>
>>> Node can be in one of the following states:
>>> 1. not present.(nid == NUMA_NO_NODE)
>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
>>> 				NODE_DATA(nid) == NULL)
>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
>>> 				NODE_DATA(nid) != NULL)
>>>
>>> Percpu code is doing allocations for all possible CPUs. The
>>> issue happens when it serves hot added but not yet onlined
>>> CPU when its node is in 2nd state. This node is not ready
>>> to use, fallback to numa_mem_id().
>>>
>>> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Dennis Zhou <dennis@kernel.org>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Christoph Lameter <cl@linux.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: stable@vger.kernel.org
>>> ---
>>>  mm/percpu-vm.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
>>> index 2054c9213..f58d73c92 100644
>>> --- a/mm/percpu-vm.c
>>> +++ b/mm/percpu-vm.c
>>> @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
>>>  			    gfp_t gfp)
>>>  {
>>>  	unsigned int cpu, tcpu;
>>> -	int i;
>>> +	int i, nid;
>>>  
>>>  	gfp |= __GFP_HIGHMEM;
>>>  
>>>  	for_each_possible_cpu(cpu) {
>>> +		nid = cpu_to_node(cpu);
>>> +		if (nid == NUMA_NO_NODE || !node_online(nid))
>>> +			nid = numa_mem_id();
>>
>> Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?
> 
> ... and what to do then? Fail the allocation? We could do that, but ...
> 
>>
>> Or maybe there is no support for this constraint in per-cpu allocator anyway.
>>
> 
> ... looking at mm/percpu.c, I don't think there are any users (IOW not
> supported?).
> 
>> I am a bit worried that we do not really know if pages are
>> allocated on the right node or not.
> 
> Even without __GFP_THISNODE it's sub-optimal. But if there is no memory
> on that node, there is barely anything we can do than falling back.

Some users need having fine control.
They would prefer -ENOMEM instead of a fallback.

Usually, /prov/vmallocinfo tells us numbers of allocated pages per node,
but it does not work (yet) with pcpu_get_vm_areas zones.

# grep alloc_large_system_hash /proc/vmallocinfo
0x00000000fb57af48-0x0000000084e058f0 134221824 alloc_large_system_hash+0x10f/0x2a0 pages=32768 vmalloc vpages N0=16384 N1=16384

# grep pcpu_get_vm_areas /proc/vmallocinfo 
0x000000009d7bd01f-0x000000002aa861cb 12582912 pcpu_get_vm_areas+0x0/0xa90 vmalloc
0x000000002aa861cb-0x0000000019fb1839 12582912 pcpu_get_vm_areas+0x0/0xa90 vmalloc
0x0000000019fb1839-0x00000000ba64fb09 12582912 pcpu_get_vm_areas+0x0/0xa90 vmalloc
0x00000000ba64fb09-0x00000000d688f04b 12582912 pcpu_get_vm_areas+0x0/0xa90 vmalloc
0x00000000d688f04b-0x0000000074e3854e 12582912 pcpu_get_vm_areas+0x0/0xa90 vmalloc

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09  2:08                           ` Eric Dumazet
  2021-11-09  7:03                             ` David Hildenbrand
@ 2021-11-09 17:15                             ` Michal Hocko
  2021-11-09 19:06                               ` Dennis Zhou
  1 sibling, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-09 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Makhalov, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dennis Zhou, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Mon 08-11-21 18:08:52, Eric Dumazet wrote:
> 
> 
> On 11/8/21 12:23 PM, Alexey Makhalov wrote:
> > There is a kernel panic caused by pcpu_alloc_pages() passing
> > offlined and uninitialized node to alloc_pages_node() leading
> > to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> > 
> >  CPU2 has been hot-added
> >  BUG: unable to handle page fault for address: 0000000000001608
> >  #PF: supervisor read access in kernel mode
> >  #PF: error_code(0x0000) - not-present page
> >  PGD 0 P4D 0
> >  Oops: 0000 [#1] SMP PTI
> >  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
> >  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> > 
> >  RIP: 0010:__alloc_pages+0x127/0x290
> >  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
> >  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
> >  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
> >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
> >  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
> >  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
> >  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
> >  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
> >  Call Trace:
> >   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
> >   pcpu_populate_chunk+0x33/0xb0
> >   pcpu_alloc+0x4d3/0x6f0
> >   __alloc_percpu_gfp+0xd/0x10
> >   alloc_mem_cgroup_per_node_info+0x54/0xb0
> >   mem_cgroup_alloc+0xed/0x2f0
> >   mem_cgroup_css_alloc+0x33/0x2f0
> >   css_create+0x3a/0x1f0
> >   cgroup_apply_control_enable+0x12b/0x150
> >   cgroup_mkdir+0xdd/0x110
> >   kernfs_iop_mkdir+0x4f/0x80
> >   vfs_mkdir+0x178/0x230
> >   do_mkdirat+0xfd/0x120
> >   __x64_sys_mkdir+0x47/0x70
> >   ? syscall_exit_to_user_mode+0x21/0x50
> >   do_syscall_64+0x43/0x90
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > Panic can be easily reproduced by disabling udev rule for
> > automatic onlining hot added CPU followed by CPU with
> > memoryless node (NUMA node with CPU only) hot add.
> > 
> > Hot adding CPU and memoryless node does not bring the node
> > to online state. Memoryless node will be onlined only during
> > the onlining its CPU.
> > 
> > Node can be in one of the following states:
> > 1. not present.(nid == NUMA_NO_NODE)
> > 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> > 				NODE_DATA(nid) == NULL)
> > 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> > 				NODE_DATA(nid) != NULL)
> > 
> > Percpu code is doing allocations for all possible CPUs. The
> > issue happens when it serves hot added but not yet onlined
> > CPU when its node is in 2nd state. This node is not ready
> > to use, fallback to numa_mem_id().
> > 
> > Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Dennis Zhou <dennis@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: stable@vger.kernel.org
> > ---
> >  mm/percpu-vm.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > index 2054c9213..f58d73c92 100644
> > --- a/mm/percpu-vm.c
> > +++ b/mm/percpu-vm.c
> > @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> >  			    gfp_t gfp)
> >  {
> >  	unsigned int cpu, tcpu;
> > -	int i;
> > +	int i, nid;
> >  
> >  	gfp |= __GFP_HIGHMEM;
> >  
> >  	for_each_possible_cpu(cpu) {
> > +		nid = cpu_to_node(cpu);
> > +		if (nid == NUMA_NO_NODE || !node_online(nid))
> > +			nid = numa_mem_id();
> 
> Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?
> 
> Or maybe there is no support for this constraint in per-cpu allocator anyway.

I would be really curious about the usecase. Not to mention that pcp
allocation would be effectively unusable on any setups with memory less
nodes.

> I am a bit worried that we do not really know if pages are
> allocated on the right node or not.

There hasn't been any guarantee like that. Page allocator would fallback
to other nodes (in the node distance order) unless __GFP_THISNODE is
specified. This patch just papers over the fact that currently we can
end up having an invalid numa node associated with a cpu. This is a bug
in the initialization code. Even if that is fixed the node fallback is
still a real thing that might happen.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09 17:15                             ` Michal Hocko
@ 2021-11-09 19:06                               ` Dennis Zhou
  2021-11-09 19:54                                 ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Dennis Zhou @ 2021-11-09 19:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Eric Dumazet, Alexey Makhalov, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Dennis Zhou, Tejun Heo,
	Christoph Lameter, linux-kernel, stable

Hello,

On Tue, Nov 09, 2021 at 06:15:33PM +0100, Michal Hocko wrote:
> On Mon 08-11-21 18:08:52, Eric Dumazet wrote:
> > 
> > 
> > On 11/8/21 12:23 PM, Alexey Makhalov wrote:
> > > There is a kernel panic caused by pcpu_alloc_pages() passing
> > > offlined and uninitialized node to alloc_pages_node() leading
> > > to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> > > 
> > >  CPU2 has been hot-added
> > >  BUG: unable to handle page fault for address: 0000000000001608
> > >  #PF: supervisor read access in kernel mode
> > >  #PF: error_code(0x0000) - not-present page
> > >  PGD 0 P4D 0
> > >  Oops: 0000 [#1] SMP PTI
> > >  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
> > >  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> > > 
> > >  RIP: 0010:__alloc_pages+0x127/0x290
> > >  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
> > >  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
> > >  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
> > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
> > >  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
> > >  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
> > >  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
> > >  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
> > >  Call Trace:
> > >   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
> > >   pcpu_populate_chunk+0x33/0xb0
> > >   pcpu_alloc+0x4d3/0x6f0
> > >   __alloc_percpu_gfp+0xd/0x10
> > >   alloc_mem_cgroup_per_node_info+0x54/0xb0
> > >   mem_cgroup_alloc+0xed/0x2f0
> > >   mem_cgroup_css_alloc+0x33/0x2f0
> > >   css_create+0x3a/0x1f0
> > >   cgroup_apply_control_enable+0x12b/0x150
> > >   cgroup_mkdir+0xdd/0x110
> > >   kernfs_iop_mkdir+0x4f/0x80
> > >   vfs_mkdir+0x178/0x230
> > >   do_mkdirat+0xfd/0x120
> > >   __x64_sys_mkdir+0x47/0x70
> > >   ? syscall_exit_to_user_mode+0x21/0x50
> > >   do_syscall_64+0x43/0x90
> > >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > 
> > > Panic can be easily reproduced by disabling udev rule for
> > > automatic onlining hot added CPU followed by CPU with
> > > memoryless node (NUMA node with CPU only) hot add.
> > > 
> > > Hot adding CPU and memoryless node does not bring the node
> > > to online state. Memoryless node will be onlined only during
> > > the onlining its CPU.
> > > 
> > > Node can be in one of the following states:
> > > 1. not present.(nid == NUMA_NO_NODE)
> > > 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> > > 				NODE_DATA(nid) == NULL)
> > > 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> > > 				NODE_DATA(nid) != NULL)
> > > 
> > > Percpu code is doing allocations for all possible CPUs. The
> > > issue happens when it serves hot added but not yet onlined
> > > CPU when its node is in 2nd state. This node is not ready
> > > to use, fallback to numa_mem_id().
> > > 
> > > Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Dennis Zhou <dennis@kernel.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: stable@vger.kernel.org
> > > ---
> > >  mm/percpu-vm.c | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > > index 2054c9213..f58d73c92 100644
> > > --- a/mm/percpu-vm.c
> > > +++ b/mm/percpu-vm.c
> > > @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> > >  			    gfp_t gfp)
> > >  {
> > >  	unsigned int cpu, tcpu;
> > > -	int i;
> > > +	int i, nid;
> > >  
> > >  	gfp |= __GFP_HIGHMEM;
> > >  
> > >  	for_each_possible_cpu(cpu) {
> > > +		nid = cpu_to_node(cpu);
> > > +		if (nid == NUMA_NO_NODE || !node_online(nid))
> > > +			nid = numa_mem_id();
> > 
> > Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?
> > 
> > Or maybe there is no support for this constraint in per-cpu allocator anyway.
> 
> I would be really curious about the usecase. Not to mention that pcp
> allocation would be effectively unusable on any setups with memory less
> nodes.
> 

Sorry, I briefly saw this thread last week but was on jury duty and got
sequestered when the fix fell into percpu-vm.c.

I'm also not involved with any hotplug work, so my forgive my limited
understanding.

I'm understanding this as a cpu/mem hotplug problem that we're papering
over with this fix. Given that, I should be looking to take this out
when the proper fix to the hotplug subsystem is added. Is that right?

> > I am a bit worried that we do not really know if pages are
> > allocated on the right node or not.
> 
> There hasn't been any guarantee like that. Page allocator would fallback
> to other nodes (in the node distance order) unless __GFP_THISNODE is
> specified. This patch just papers over the fact that currently we can
> end up having an invalid numa node associated with a cpu. This is a bug
> in the initialization code. Even if that is fixed the node fallback is
> still a real thing that might happen.
> 

Percpu has always allocated for_each_possible_cpu(). This means even
before a cpu online and corresponding numa node online, we're not
allocating on the right node anyway. But to me this just seems like a
straight up bug we're papering over as I said above for memoryless node
cpu hotplug.

To me, I don't see the importance of hotplug in situations where
performance is utmost. But it is not exactly ideal contract wise with
percpu. However, the trade off is really halting the system for a period
of time for any hotplug even to correctly add/free existing percpu
allocations and that doesn't seem great. I need to understand the
importance of hotplug and then we can figure out how we can fit that in
with the percpu allocator.

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09 19:06                               ` Dennis Zhou
@ 2021-11-09 19:54                                 ` Michal Hocko
  2021-11-16  1:31                                   ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-09 19:54 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Eric Dumazet, Alexey Makhalov, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 09-11-21 14:06:14, Dennis Zhou wrote:
> Hello,
> 
> On Tue, Nov 09, 2021 at 06:15:33PM +0100, Michal Hocko wrote:
> > On Mon 08-11-21 18:08:52, Eric Dumazet wrote:
> > > 
> > > 
> > > On 11/8/21 12:23 PM, Alexey Makhalov wrote:
> > > > There is a kernel panic caused by pcpu_alloc_pages() passing
> > > > offlined and uninitialized node to alloc_pages_node() leading
> > > > to panic by NULL dereferencing uninitialized NODE_DATA(nid).
> > > > 
> > > >  CPU2 has been hot-added
> > > >  BUG: unable to handle page fault for address: 0000000000001608
> > > >  #PF: supervisor read access in kernel mode
> > > >  #PF: error_code(0x0000) - not-present page
> > > >  PGD 0 P4D 0
> > > >  Oops: 0000 [#1] SMP PTI
> > > >  CPU: 0 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7+ #11
> > > >  Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
> > > > 
> > > >  RIP: 0010:__alloc_pages+0x127/0x290
> > > >  Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
> > > >  RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246
> > > >  RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
> > > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
> > > >  RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600
> > > >  R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2
> > > >  R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2
> > > >  FS:  00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >  CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0
> > > >  Call Trace:
> > > >   pcpu_alloc_pages.constprop.0+0xe4/0x1c0
> > > >   pcpu_populate_chunk+0x33/0xb0
> > > >   pcpu_alloc+0x4d3/0x6f0
> > > >   __alloc_percpu_gfp+0xd/0x10
> > > >   alloc_mem_cgroup_per_node_info+0x54/0xb0
> > > >   mem_cgroup_alloc+0xed/0x2f0
> > > >   mem_cgroup_css_alloc+0x33/0x2f0
> > > >   css_create+0x3a/0x1f0
> > > >   cgroup_apply_control_enable+0x12b/0x150
> > > >   cgroup_mkdir+0xdd/0x110
> > > >   kernfs_iop_mkdir+0x4f/0x80
> > > >   vfs_mkdir+0x178/0x230
> > > >   do_mkdirat+0xfd/0x120
> > > >   __x64_sys_mkdir+0x47/0x70
> > > >   ? syscall_exit_to_user_mode+0x21/0x50
> > > >   do_syscall_64+0x43/0x90
> > > >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > 
> > > > Panic can be easily reproduced by disabling udev rule for
> > > > automatic onlining hot added CPU followed by CPU with
> > > > memoryless node (NUMA node with CPU only) hot add.
> > > > 
> > > > Hot adding CPU and memoryless node does not bring the node
> > > > to online state. Memoryless node will be onlined only during
> > > > the onlining its CPU.
> > > > 
> > > > Node can be in one of the following states:
> > > > 1. not present.(nid == NUMA_NO_NODE)
> > > > 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0,
> > > > 				NODE_DATA(nid) == NULL)
> > > > 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0,
> > > > 				NODE_DATA(nid) != NULL)
> > > > 
> > > > Percpu code is doing allocations for all possible CPUs. The
> > > > issue happens when it serves hot added but not yet onlined
> > > > CPU when its node is in 2nd state. This node is not ready
> > > > to use, fallback to numa_mem_id().
> > > > 
> > > > Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> > > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > Cc: Dennis Zhou <dennis@kernel.org>
> > > > Cc: Tejun Heo <tj@kernel.org>
> > > > Cc: Christoph Lameter <cl@linux.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Cc: stable@vger.kernel.org
> > > > ---
> > > >  mm/percpu-vm.c | 8 ++++++--
> > > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > > > index 2054c9213..f58d73c92 100644
> > > > --- a/mm/percpu-vm.c
> > > > +++ b/mm/percpu-vm.c
> > > > @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
> > > >  			    gfp_t gfp)
> > > >  {
> > > >  	unsigned int cpu, tcpu;
> > > > -	int i;
> > > > +	int i, nid;
> > > >  
> > > >  	gfp |= __GFP_HIGHMEM;
> > > >  
> > > >  	for_each_possible_cpu(cpu) {
> > > > +		nid = cpu_to_node(cpu);
> > > > +		if (nid == NUMA_NO_NODE || !node_online(nid))
> > > > +			nid = numa_mem_id();
> > > 
> > > Maybe we should fail this fallback if (gfp & __GFP_THISNODE) ?
> > > 
> > > Or maybe there is no support for this constraint in per-cpu allocator anyway.
> > 
> > I would be really curious about the usecase. Not to mention that pcp
> > allocation would be effectively unusable on any setups with memory less
> > nodes.
> > 
> 
> Sorry, I briefly saw this thread last week but was on jury duty and got
> sequestered when the fix fell into percpu-vm.c.
> 
> I'm also not involved with any hotplug work, so my forgive my limited
> understanding.
> 
> I'm understanding this as a cpu/mem hotplug problem that we're papering
> over with this fix. Given that, I should be looking to take this out
> when the proper fix to the hotplug subsystem is added. Is that right?

Yes.

> > > I am a bit worried that we do not really know if pages are
> > > allocated on the right node or not.
> > 
> > There hasn't been any guarantee like that. Page allocator would fallback
> > to other nodes (in the node distance order) unless __GFP_THISNODE is
> > specified. This patch just papers over the fact that currently we can
> > end up having an invalid numa node associated with a cpu. This is a bug
> > in the initialization code. Even if that is fixed the node fallback is
> > still a real thing that might happen.
> > 
> 
> Percpu has always allocated for_each_possible_cpu(). This means even
> before a cpu online and corresponding numa node online, we're not
> allocating on the right node anyway. But to me this just seems like a
> straight up bug we're papering over as I said above for memoryless node
> cpu hotplug.

Agreed. As mentioned elsewhere in the thread cpu_to_node resp.
cpu_to_mem shouldn't return a garbage. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-09 19:54                                 ` Michal Hocko
@ 2021-11-16  1:31                                   ` Alexey Makhalov
  2021-11-16  9:17                                     ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-16  1:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

Hi everyone,

Thanks for your attention to this issue.
I’m providing more information as requested by Michal to reproduce the issue.

The issue was reproduced on ESXi 7.0.2 on Ubuntu 20.04 and Photon 3.0/4.0 guest OSes
It was confirmed that master (5.15.0-rc7) and 3 latest LTS kernels (4.19, 5.4, 5.10) are
affected.

To reproduce the issue, VM should be configured to have:
- one CPU per socket (cpuid.corespersocket = "1”)
- one CPU per NUMA node (numa.vcpu.maxPerVirtualNode = "1")
- vCPU hot add enabled (vcpu.hotadd = “TRUE”)
- NUMA hot add enabled (numa.allowHotadd = "TRUE”)
- Memory hot add disabled (mem.hotadd = “FALSE”)

This configuration allows to hot add next possible NUMA node while hot adding the CPU
Keeping newly added CPU offline holds new NUMA node in uninitialized state (as opposed
to memory hot add).

VM was powered on with 4 vCPUs (4 NUMA nodes) and 4GB memory.
ACPI SRAT reports 128 possible CPUs and 128 possible NUMA nodes.

CPU hot add was performed from ESXi web UI by increasing vCPUs number to 5.

Panic was triggered by restarting systemd service such as sshd.

See attached 5 outputs, including before and after hot add event, below.
I’m happy to provide more data if needed.

Thanks,
—Alexey


1. dmesg output before CPU hot add.
# dmesg
[    0.000000] Linux version 5.15.0-rc7-panic+ (root@photon-576f8974caf.org) (gcc (GCC) 10.2.0, GNU ld (GNU Binutils) 2.35) #23 SMP Mon Nov 15 22:54:15 UTC 2021
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-panic root=PARTUUID=65c7e33b-adc7-466f-b763-fbc070600c9d init=/lib/systemd/systemd ro loglevel=3 quiet net.ifnames=0 plymouth.enable=0 systemd.legacy_systemd_cgroup_controller=yes nokaslr
[    0.000000] Disabled fast string operations
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] signal: max sigframe size: 1776
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000c0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000000efaafff] usable
[    0.000000] BIOS-e820: [mem 0x000000000efab000-0x000000000efaefff] reserved
[    0.000000] BIOS-e820: [mem 0x000000000efaf000-0x000000000efbcfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000efbd000-0x000000000efc1fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000000efc2000-0x000000000efc6fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000000efc7000-0x000000000fee5fff] usable
[    0.000000] BIOS-e820: [mem 0x000000000fee6000-0x000000000ff55fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000000ff56000-0x000000000ff71fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000000ff72000-0x000000000ff75fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000000ff76000-0x00000000bfffffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000ffc00000-0x00000000ffc29fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] e820: update [mem 0x0e562018-0x0e56a057] usable ==> usable
[    0.000000] e820: update [mem 0x0e562018-0x0e56a057] usable ==> usable
[    0.000000] e820: update [mem 0x0e55f018-0x0e561057] usable ==> usable
[    0.000000] e820: update [mem 0x0e55f018-0x0e561057] usable ==> usable
[    0.000000] e820: update [mem 0x0e55d018-0x0e55e857] usable ==> usable
[    0.000000] e820: update [mem 0x0e55d018-0x0e55e857] usable ==> usable
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000000-0x0000000000000fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000000001000-0x000000000009ffff] usable
[    0.000000] reserve setup_data: [mem 0x00000000000c0000-0x00000000000fffff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000e55d017] usable
[    0.000000] reserve setup_data: [mem 0x000000000e55d018-0x000000000e55e857] usable
[    0.000000] reserve setup_data: [mem 0x000000000e55e858-0x000000000e55f017] usable
[    0.000000] reserve setup_data: [mem 0x000000000e55f018-0x000000000e561057] usable
[    0.000000] reserve setup_data: [mem 0x000000000e561058-0x000000000e562017] usable
[    0.000000] reserve setup_data: [mem 0x000000000e562018-0x000000000e56a057] usable
[    0.000000] reserve setup_data: [mem 0x000000000e56a058-0x000000000efaafff] usable
[    0.000000] reserve setup_data: [mem 0x000000000efab000-0x000000000efaefff] reserved
[    0.000000] reserve setup_data: [mem 0x000000000efaf000-0x000000000efbcfff] usable
[    0.000000] reserve setup_data: [mem 0x000000000efbd000-0x000000000efc1fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000000efc2000-0x000000000efc6fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000000efc7000-0x000000000fee5fff] usable
[    0.000000] reserve setup_data: [mem 0x000000000fee6000-0x000000000ff55fff] reserved
[    0.000000] reserve setup_data: [mem 0x000000000ff56000-0x000000000ff71fff] ACPI data
[    0.000000] reserve setup_data: [mem 0x000000000ff72000-0x000000000ff75fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x000000000ff76000-0x00000000bfffffff] usable
[    0.000000] reserve setup_data: [mem 0x00000000ffc00000-0x00000000ffc29fff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000013fffffff] usable
[    0.000000] efi: EFI v2.40 by VMware, Inc.
[    0.000000] efi: SMBIOS=0xefc2000 ACPI 2.0=0xff5c000 MEMATTR=0xfcb6b98
[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18452719.B64.2108091906 08/09/2021
[    0.000000] vmware: hypercall mode: 0x02
[    0.000000] Hypervisor detected: VMware
[    0.000000] vmware: TSC freq read from hypervisor : 1899.999 MHz
[    0.000000] vmware: Host bus clock speed read from hypervisor : 66000000 Hz
[    0.000000] vmware: using clock offset of 7034265229 ns
[    0.000024] tsc: Detected 1899.999 MHz processor
[    0.063001] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.063007] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.063015] last_pfn = 0x140000 max_arch_pfn = 0x400000000
[    0.063113] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.063168] total RAM covered: 7168M
[    0.063360] Found optimal setting for mtrr clean up
[    0.063361]  gran_size: 64K 	chunk_size: 64K 	num_reg: 3  	lose cover RAM: 0G
[    0.063505] e820: update [mem 0xc0000000-0xffffffff] usable ==> reserved
[    0.063513] last_pfn = 0xc0000 max_arch_pfn = 0x400000000
[    0.084987] Secure boot disabled
[    0.084994] ACPI: Early table checksum verification disabled
[    0.085000] ACPI: RSDP 0x000000000FF5C000 000024 (v02 VMWARE)
[    0.085009] ACPI: XSDT 0x000000000FF5C064 00005C (v01 INTEL  440BX    06040000 VMW  01324272)
[    0.085022] ACPI: SRAT 0x000000000FF5C0C0 0008F8 (v03 VMWARE EFISRAT  06040001 VMW  000007CE)
[    0.085031] ACPI: FACP 0x000000000FF70DBB 0000F4 (v04 INTEL  440BX    06040000 PTL  000F4240)
[    0.085043] ACPI: DSDT 0x000000000FF5C9B8 014403 (v01 PTLTD  Custom   00000000 INTL 20130823)
[    0.085050] ACPI: FACS 0x000000000FF75000 000040
[    0.085057] ACPI: FACS 0x000000000FF75000 000040
[    0.085063] ACPI: APIC 0x000000000FF71000 000742 (v03 VMWARE EFIAPIC  06040001 VMW  000007CE)
[    0.085070] ACPI: MCFG 0x000000000FF71742 00003C (v01 VMWARE EFIMCFG  06040001 VMW  000007CE)
[    0.085077] ACPI: HPET 0x000000000FF7177E 000038 (v01 VMWARE VMW HPET 00000000 VMW  00000000)
[    0.085085] ACPI: WAET 0x000000000FF717B6 000028 (v01 VMWARE VMW WAET 06040001 VMW  00000001)
[    0.085092] ACPI: WSMT 0x000000000FF717DE 000028 (v01 VMWARE VMW WSMT 06040001 VMW  00000001)
[    0.085098] ACPI: Reserving SRAT table memory at [mem 0xff5c0c0-0xff5c9b7]
[    0.085102] ACPI: Reserving FACP table memory at [mem 0xff70dbb-0xff70eae]
[    0.085105] ACPI: Reserving DSDT table memory at [mem 0xff5c9b8-0xff70dba]
[    0.085107] ACPI: Reserving FACS table memory at [mem 0xff75000-0xff7503f]
[    0.085109] ACPI: Reserving FACS table memory at [mem 0xff75000-0xff7503f]
[    0.085111] ACPI: Reserving APIC table memory at [mem 0xff71000-0xff71741]
[    0.085114] ACPI: Reserving MCFG table memory at [mem 0xff71742-0xff7177d]
[    0.085116] ACPI: Reserving HPET table memory at [mem 0xff7177e-0xff717b5]
[    0.085118] ACPI: Reserving WAET table memory at [mem 0xff717b6-0xff717dd]
[    0.085121] ACPI: Reserving WSMT table memory at [mem 0xff717de-0xff71805]
[    0.085202] system APIC only can use physical flat
[    0.085204] Setting APIC routing to physical flat.
[    0.085239] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.085243] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[    0.085245] SRAT: PXM 2 -> APIC 0x04 -> Node 2
[    0.085247] SRAT: PXM 3 -> APIC 0x06 -> Node 3
[    0.085249] SRAT: PXM 4 -> APIC 0x08 -> Node 4
[    0.085251] SRAT: PXM 5 -> APIC 0x0a -> Node 5
[    0.085253] SRAT: PXM 6 -> APIC 0x0c -> Node 6
[    0.085255] SRAT: PXM 7 -> APIC 0x0e -> Node 7
[    0.085257] SRAT: PXM 8 -> APIC 0x10 -> Node 8
[    0.085259] SRAT: PXM 9 -> APIC 0x12 -> Node 9
[    0.085261] SRAT: PXM 10 -> APIC 0x14 -> Node 10
[    0.085264] SRAT: PXM 11 -> APIC 0x16 -> Node 11
[    0.085266] SRAT: PXM 12 -> APIC 0x18 -> Node 12
[    0.085268] SRAT: PXM 13 -> APIC 0x1a -> Node 13
[    0.085270] SRAT: PXM 14 -> APIC 0x1c -> Node 14
[    0.085272] SRAT: PXM 15 -> APIC 0x1e -> Node 15
[    0.085274] SRAT: PXM 16 -> APIC 0x20 -> Node 16
[    0.085276] SRAT: PXM 17 -> APIC 0x22 -> Node 17
[    0.085278] SRAT: PXM 18 -> APIC 0x24 -> Node 18
[    0.085280] SRAT: PXM 19 -> APIC 0x26 -> Node 19
[    0.085282] SRAT: PXM 20 -> APIC 0x28 -> Node 20
[    0.085284] SRAT: PXM 21 -> APIC 0x2a -> Node 21
[    0.085286] SRAT: PXM 22 -> APIC 0x2c -> Node 22
[    0.085288] SRAT: PXM 23 -> APIC 0x2e -> Node 23
[    0.085290] SRAT: PXM 24 -> APIC 0x30 -> Node 24
[    0.085292] SRAT: PXM 25 -> APIC 0x32 -> Node 25
[    0.085294] SRAT: PXM 26 -> APIC 0x34 -> Node 26
[    0.085296] SRAT: PXM 27 -> APIC 0x36 -> Node 27
[    0.085298] SRAT: PXM 28 -> APIC 0x38 -> Node 28
[    0.085300] SRAT: PXM 29 -> APIC 0x3a -> Node 29
[    0.085302] SRAT: PXM 30 -> APIC 0x3c -> Node 30
[    0.085304] SRAT: PXM 31 -> APIC 0x3e -> Node 31
[    0.085306] SRAT: PXM 32 -> APIC 0x40 -> Node 32
[    0.085308] SRAT: PXM 33 -> APIC 0x42 -> Node 33
[    0.085310] SRAT: PXM 34 -> APIC 0x44 -> Node 34
[    0.085312] SRAT: PXM 35 -> APIC 0x46 -> Node 35
[    0.085314] SRAT: PXM 36 -> APIC 0x48 -> Node 36
[    0.085316] SRAT: PXM 37 -> APIC 0x4a -> Node 37
[    0.085318] SRAT: PXM 38 -> APIC 0x4c -> Node 38
[    0.085320] SRAT: PXM 39 -> APIC 0x4e -> Node 39
[    0.085322] SRAT: PXM 40 -> APIC 0x50 -> Node 40
[    0.085324] SRAT: PXM 41 -> APIC 0x52 -> Node 41
[    0.085326] SRAT: PXM 42 -> APIC 0x54 -> Node 42
[    0.085328] SRAT: PXM 43 -> APIC 0x56 -> Node 43
[    0.085330] SRAT: PXM 44 -> APIC 0x58 -> Node 44
[    0.085332] SRAT: PXM 45 -> APIC 0x5a -> Node 45
[    0.085334] SRAT: PXM 46 -> APIC 0x5c -> Node 46
[    0.085336] SRAT: PXM 47 -> APIC 0x5e -> Node 47
[    0.085339] SRAT: PXM 48 -> APIC 0x60 -> Node 48
[    0.085341] SRAT: PXM 49 -> APIC 0x62 -> Node 49
[    0.085343] SRAT: PXM 50 -> APIC 0x64 -> Node 50
[    0.085345] SRAT: PXM 51 -> APIC 0x66 -> Node 51
[    0.085347] SRAT: PXM 52 -> APIC 0x68 -> Node 52
[    0.085349] SRAT: PXM 53 -> APIC 0x6a -> Node 53
[    0.085351] SRAT: PXM 54 -> APIC 0x6c -> Node 54
[    0.085353] SRAT: PXM 55 -> APIC 0x6e -> Node 55
[    0.085355] SRAT: PXM 56 -> APIC 0x70 -> Node 56
[    0.085357] SRAT: PXM 57 -> APIC 0x72 -> Node 57
[    0.085359] SRAT: PXM 58 -> APIC 0x74 -> Node 58
[    0.085361] SRAT: PXM 59 -> APIC 0x76 -> Node 59
[    0.085363] SRAT: PXM 60 -> APIC 0x78 -> Node 60
[    0.085365] SRAT: PXM 61 -> APIC 0x7a -> Node 61
[    0.085367] SRAT: PXM 62 -> APIC 0x7c -> Node 62
[    0.085369] SRAT: PXM 63 -> APIC 0x7e -> Node 63
[    0.085371] SRAT: PXM 64 -> APIC 0x80 -> Node 64
[    0.085373] SRAT: PXM 65 -> APIC 0x82 -> Node 65
[    0.085375] SRAT: PXM 66 -> APIC 0x84 -> Node 66
[    0.085377] SRAT: PXM 67 -> APIC 0x86 -> Node 67
[    0.085379] SRAT: PXM 68 -> APIC 0x88 -> Node 68
[    0.085381] SRAT: PXM 69 -> APIC 0x8a -> Node 69
[    0.085383] SRAT: PXM 70 -> APIC 0x8c -> Node 70
[    0.085385] SRAT: PXM 71 -> APIC 0x8e -> Node 71
[    0.085387] SRAT: PXM 72 -> APIC 0x90 -> Node 72
[    0.085389] SRAT: PXM 73 -> APIC 0x92 -> Node 73
[    0.085391] SRAT: PXM 74 -> APIC 0x94 -> Node 74
[    0.085393] SRAT: PXM 75 -> APIC 0x96 -> Node 75
[    0.085395] SRAT: PXM 76 -> APIC 0x98 -> Node 76
[    0.085397] SRAT: PXM 77 -> APIC 0x9a -> Node 77
[    0.085399] SRAT: PXM 78 -> APIC 0x9c -> Node 78
[    0.085401] SRAT: PXM 79 -> APIC 0x9e -> Node 79
[    0.085403] SRAT: PXM 80 -> APIC 0xa0 -> Node 80
[    0.085405] SRAT: PXM 81 -> APIC 0xa2 -> Node 81
[    0.085407] SRAT: PXM 82 -> APIC 0xa4 -> Node 82
[    0.085409] SRAT: PXM 83 -> APIC 0xa6 -> Node 83
[    0.085411] SRAT: PXM 84 -> APIC 0xa8 -> Node 84
[    0.085413] SRAT: PXM 85 -> APIC 0xaa -> Node 85
[    0.085415] SRAT: PXM 86 -> APIC 0xac -> Node 86
[    0.085417] SRAT: PXM 87 -> APIC 0xae -> Node 87
[    0.085419] SRAT: PXM 88 -> APIC 0xb0 -> Node 88
[    0.085421] SRAT: PXM 89 -> APIC 0xb2 -> Node 89
[    0.085423] SRAT: PXM 90 -> APIC 0xb4 -> Node 90
[    0.085425] SRAT: PXM 91 -> APIC 0xb6 -> Node 91
[    0.085427] SRAT: PXM 92 -> APIC 0xb8 -> Node 92
[    0.085429] SRAT: PXM 93 -> APIC 0xba -> Node 93
[    0.085431] SRAT: PXM 94 -> APIC 0xbc -> Node 94
[    0.085433] SRAT: PXM 95 -> APIC 0xbe -> Node 95
[    0.085435] SRAT: PXM 96 -> APIC 0xc0 -> Node 96
[    0.085437] SRAT: PXM 97 -> APIC 0xc2 -> Node 97
[    0.085439] SRAT: PXM 98 -> APIC 0xc4 -> Node 98
[    0.085441] SRAT: PXM 99 -> APIC 0xc6 -> Node 99
[    0.085443] SRAT: PXM 100 -> APIC 0xc8 -> Node 100
[    0.085446] SRAT: PXM 101 -> APIC 0xca -> Node 101
[    0.085448] SRAT: PXM 102 -> APIC 0xcc -> Node 102
[    0.085450] SRAT: PXM 103 -> APIC 0xce -> Node 103
[    0.085452] SRAT: PXM 104 -> APIC 0xd0 -> Node 104
[    0.085454] SRAT: PXM 105 -> APIC 0xd2 -> Node 105
[    0.085456] SRAT: PXM 106 -> APIC 0xd4 -> Node 106
[    0.085458] SRAT: PXM 107 -> APIC 0xd6 -> Node 107
[    0.085460] SRAT: PXM 108 -> APIC 0xd8 -> Node 108
[    0.085462] SRAT: PXM 109 -> APIC 0xda -> Node 109
[    0.085464] SRAT: PXM 110 -> APIC 0xdc -> Node 110
[    0.085466] SRAT: PXM 111 -> APIC 0xde -> Node 111
[    0.085468] SRAT: PXM 112 -> APIC 0xe0 -> Node 112
[    0.085471] SRAT: PXM 113 -> APIC 0xe2 -> Node 113
[    0.085475] SRAT: PXM 114 -> APIC 0xe4 -> Node 114
[    0.085479] SRAT: PXM 115 -> APIC 0xe6 -> Node 115
[    0.085483] SRAT: PXM 116 -> APIC 0xe8 -> Node 116
[    0.085486] SRAT: PXM 117 -> APIC 0xea -> Node 117
[    0.085490] SRAT: PXM 118 -> APIC 0xec -> Node 118
[    0.085494] SRAT: PXM 119 -> APIC 0xee -> Node 119
[    0.085498] SRAT: PXM 120 -> APIC 0xf0 -> Node 120
[    0.085502] SRAT: PXM 121 -> APIC 0xf2 -> Node 121
[    0.085505] SRAT: PXM 122 -> APIC 0xf4 -> Node 122
[    0.085509] SRAT: PXM 123 -> APIC 0xf6 -> Node 123
[    0.085513] SRAT: PXM 124 -> APIC 0xf8 -> Node 124
[    0.085517] SRAT: PXM 125 -> APIC 0xfa -> Node 125
[    0.085520] SRAT: PXM 126 -> APIC 0xfc -> Node 126
[    0.085524] SRAT: PXM 127 -> APIC 0xfe -> Node 127
[    0.085534] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.085542] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
[    0.085547] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
[    0.085552] ACPI: SRAT: Node 2 PXM 2 [mem 0x80000000-0xbfffffff]
[    0.085557] ACPI: SRAT: Node 3 PXM 3 [mem 0x100000000-0x13fffffff]
[    0.085564] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
[    0.085602] NODE_DATA(0) allocated [mem 0x3ffde000-0x3fffffff]
[    0.086104] NODE_DATA(1) allocated [mem 0x7ffde000-0x7fffffff]
[    0.086593] NODE_DATA(2) allocated [mem 0xbffde000-0xbfffffff]
[    0.086621] NODE_DATA(3) allocated [mem 0x13ffdd000-0x13fffefff]
[    0.086682] Zone ranges:
[    0.086683]   DMA32    [mem 0x0000000000001000-0x00000000ffffffff]
[    0.086688]   Normal   [mem 0x0000000100000000-0x000000013fffffff]
[    0.086692]   Device   empty
[    0.086694] Movable zone start for each node
[    0.086702] Early memory node ranges
[    0.086703]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.086707]   node   0: [mem 0x0000000000100000-0x000000000efaafff]
[    0.086710]   node   0: [mem 0x000000000efaf000-0x000000000efbcfff]
[    0.086712]   node   0: [mem 0x000000000efc7000-0x000000000fee5fff]
[    0.086714]   node   0: [mem 0x000000000ff76000-0x000000003fffffff]
[    0.086717]   node   1: [mem 0x0000000040000000-0x000000007fffffff]
[    0.086719]   node   2: [mem 0x0000000080000000-0x00000000bfffffff]
[    0.086721]   node   3: [mem 0x0000000100000000-0x000000013fffffff]
[    0.086725] Initmem setup node 0 [mem 0x0000000000001000-0x000000003fffffff]
[    0.086731] Initmem setup node 1 [mem 0x0000000040000000-0x000000007fffffff]
[    0.086735] Initmem setup node 2 [mem 0x0000000080000000-0x00000000bfffffff]
[    0.086739] Initmem setup node 3 [mem 0x0000000100000000-0x000000013fffffff]
[    0.087216] On node 0, zone DMA32: 1 pages in unavailable ranges
[    0.088836] On node 0, zone DMA32: 96 pages in unavailable ranges
[    0.088842] On node 0, zone DMA32: 4 pages in unavailable ranges
[    0.088919] On node 0, zone DMA32: 10 pages in unavailable ranges
[    0.095385] On node 0, zone DMA32: 144 pages in unavailable ranges
[    0.118432] ACPI: PM-Timer IO Port: 0x448
[    0.118460] system APIC only can use physical flat
[    0.118486] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[    0.118491] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.118493] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
[    0.118495] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
[    0.118497] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
[    0.118499] ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
[    0.118501] ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
[    0.118503] ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
[    0.118505] ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
[    0.118507] ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1])
[    0.118509] ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1])
[    0.118511] ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1])
[    0.118513] ACPI: LAPIC_NMI (acpi_id[0x0c] high edge lint[0x1])
[    0.118514] ACPI: LAPIC_NMI (acpi_id[0x0d] high edge lint[0x1])
[    0.118516] ACPI: LAPIC_NMI (acpi_id[0x0e] high edge lint[0x1])
[    0.118518] ACPI: LAPIC_NMI (acpi_id[0x0f] high edge lint[0x1])
[    0.118520] ACPI: LAPIC_NMI (acpi_id[0x10] high edge lint[0x1])
[    0.118523] ACPI: LAPIC_NMI (acpi_id[0x11] high edge lint[0x1])
[    0.118525] ACPI: LAPIC_NMI (acpi_id[0x12] high edge lint[0x1])
[    0.118526] ACPI: LAPIC_NMI (acpi_id[0x13] high edge lint[0x1])
[    0.118528] ACPI: LAPIC_NMI (acpi_id[0x14] high edge lint[0x1])
[    0.118530] ACPI: LAPIC_NMI (acpi_id[0x15] high edge lint[0x1])
[    0.118532] ACPI: LAPIC_NMI (acpi_id[0x16] high edge lint[0x1])
[    0.118534] ACPI: LAPIC_NMI (acpi_id[0x17] high edge lint[0x1])
[    0.118536] ACPI: LAPIC_NMI (acpi_id[0x18] high edge lint[0x1])
[    0.118538] ACPI: LAPIC_NMI (acpi_id[0x19] high edge lint[0x1])
[    0.118540] ACPI: LAPIC_NMI (acpi_id[0x1a] high edge lint[0x1])
[    0.118542] ACPI: LAPIC_NMI (acpi_id[0x1b] high edge lint[0x1])
[    0.118544] ACPI: LAPIC_NMI (acpi_id[0x1c] high edge lint[0x1])
[    0.118546] ACPI: LAPIC_NMI (acpi_id[0x1d] high edge lint[0x1])
[    0.118548] ACPI: LAPIC_NMI (acpi_id[0x1e] high edge lint[0x1])
[    0.118549] ACPI: LAPIC_NMI (acpi_id[0x1f] high edge lint[0x1])
[    0.118551] ACPI: LAPIC_NMI (acpi_id[0x20] high edge lint[0x1])
[    0.118553] ACPI: LAPIC_NMI (acpi_id[0x21] high edge lint[0x1])
[    0.118555] ACPI: LAPIC_NMI (acpi_id[0x22] high edge lint[0x1])
[    0.118557] ACPI: LAPIC_NMI (acpi_id[0x23] high edge lint[0x1])
[    0.118559] ACPI: LAPIC_NMI (acpi_id[0x24] high edge lint[0x1])
[    0.118561] ACPI: LAPIC_NMI (acpi_id[0x25] high edge lint[0x1])
[    0.118563] ACPI: LAPIC_NMI (acpi_id[0x26] high edge lint[0x1])
[    0.118565] ACPI: LAPIC_NMI (acpi_id[0x27] high edge lint[0x1])
[    0.118567] ACPI: LAPIC_NMI (acpi_id[0x28] high edge lint[0x1])
[    0.118568] ACPI: LAPIC_NMI (acpi_id[0x29] high edge lint[0x1])
[    0.118570] ACPI: LAPIC_NMI (acpi_id[0x2a] high edge lint[0x1])
[    0.118572] ACPI: LAPIC_NMI (acpi_id[0x2b] high edge lint[0x1])
[    0.118574] ACPI: LAPIC_NMI (acpi_id[0x2c] high edge lint[0x1])
[    0.118576] ACPI: LAPIC_NMI (acpi_id[0x2d] high edge lint[0x1])
[    0.118578] ACPI: LAPIC_NMI (acpi_id[0x2e] high edge lint[0x1])
[    0.118580] ACPI: LAPIC_NMI (acpi_id[0x2f] high edge lint[0x1])
[    0.118582] ACPI: LAPIC_NMI (acpi_id[0x30] high edge lint[0x1])
[    0.118584] ACPI: LAPIC_NMI (acpi_id[0x31] high edge lint[0x1])
[    0.118586] ACPI: LAPIC_NMI (acpi_id[0x32] high edge lint[0x1])
[    0.118588] ACPI: LAPIC_NMI (acpi_id[0x33] high edge lint[0x1])
[    0.118590] ACPI: LAPIC_NMI (acpi_id[0x34] high edge lint[0x1])
[    0.118592] ACPI: LAPIC_NMI (acpi_id[0x35] high edge lint[0x1])
[    0.118594] ACPI: LAPIC_NMI (acpi_id[0x36] high edge lint[0x1])
[    0.118596] ACPI: LAPIC_NMI (acpi_id[0x37] high edge lint[0x1])
[    0.118597] ACPI: LAPIC_NMI (acpi_id[0x38] high edge lint[0x1])
[    0.118599] ACPI: LAPIC_NMI (acpi_id[0x39] high edge lint[0x1])
[    0.118601] ACPI: LAPIC_NMI (acpi_id[0x3a] high edge lint[0x1])
[    0.118603] ACPI: LAPIC_NMI (acpi_id[0x3b] high edge lint[0x1])
[    0.118605] ACPI: LAPIC_NMI (acpi_id[0x3c] high edge lint[0x1])
[    0.118607] ACPI: LAPIC_NMI (acpi_id[0x3d] high edge lint[0x1])
[    0.118609] ACPI: LAPIC_NMI (acpi_id[0x3e] high edge lint[0x1])
[    0.118611] ACPI: LAPIC_NMI (acpi_id[0x3f] high edge lint[0x1])
[    0.118613] ACPI: LAPIC_NMI (acpi_id[0x40] high edge lint[0x1])
[    0.118615] ACPI: LAPIC_NMI (acpi_id[0x41] high edge lint[0x1])
[    0.118617] ACPI: LAPIC_NMI (acpi_id[0x42] high edge lint[0x1])
[    0.118619] ACPI: LAPIC_NMI (acpi_id[0x43] high edge lint[0x1])
[    0.118621] ACPI: LAPIC_NMI (acpi_id[0x44] high edge lint[0x1])
[    0.118622] ACPI: LAPIC_NMI (acpi_id[0x45] high edge lint[0x1])
[    0.118624] ACPI: LAPIC_NMI (acpi_id[0x46] high edge lint[0x1])
[    0.118626] ACPI: LAPIC_NMI (acpi_id[0x47] high edge lint[0x1])
[    0.118628] ACPI: LAPIC_NMI (acpi_id[0x48] high edge lint[0x1])
[    0.118630] ACPI: LAPIC_NMI (acpi_id[0x49] high edge lint[0x1])
[    0.118632] ACPI: LAPIC_NMI (acpi_id[0x4a] high edge lint[0x1])
[    0.118634] ACPI: LAPIC_NMI (acpi_id[0x4b] high edge lint[0x1])
[    0.118641] ACPI: LAPIC_NMI (acpi_id[0x4c] high edge lint[0x1])
[    0.118643] ACPI: LAPIC_NMI (acpi_id[0x4d] high edge lint[0x1])
[    0.118645] ACPI: LAPIC_NMI (acpi_id[0x4e] high edge lint[0x1])
[    0.118647] ACPI: LAPIC_NMI (acpi_id[0x4f] high edge lint[0x1])
[    0.118649] ACPI: LAPIC_NMI (acpi_id[0x50] high edge lint[0x1])
[    0.118650] ACPI: LAPIC_NMI (acpi_id[0x51] high edge lint[0x1])
[    0.118652] ACPI: LAPIC_NMI (acpi_id[0x52] high edge lint[0x1])
[    0.118654] ACPI: LAPIC_NMI (acpi_id[0x53] high edge lint[0x1])
[    0.118656] ACPI: LAPIC_NMI (acpi_id[0x54] high edge lint[0x1])
[    0.118658] ACPI: LAPIC_NMI (acpi_id[0x55] high edge lint[0x1])
[    0.118660] ACPI: LAPIC_NMI (acpi_id[0x56] high edge lint[0x1])
[    0.118662] ACPI: LAPIC_NMI (acpi_id[0x57] high edge lint[0x1])
[    0.118664] ACPI: LAPIC_NMI (acpi_id[0x58] high edge lint[0x1])
[    0.118666] ACPI: LAPIC_NMI (acpi_id[0x59] high edge lint[0x1])
[    0.118668] ACPI: LAPIC_NMI (acpi_id[0x5a] high edge lint[0x1])
[    0.118669] ACPI: LAPIC_NMI (acpi_id[0x5b] high edge lint[0x1])
[    0.118671] ACPI: LAPIC_NMI (acpi_id[0x5c] high edge lint[0x1])
[    0.118673] ACPI: LAPIC_NMI (acpi_id[0x5d] high edge lint[0x1])
[    0.118675] ACPI: LAPIC_NMI (acpi_id[0x5e] high edge lint[0x1])
[    0.118677] ACPI: LAPIC_NMI (acpi_id[0x5f] high edge lint[0x1])
[    0.118679] ACPI: LAPIC_NMI (acpi_id[0x60] high edge lint[0x1])
[    0.118681] ACPI: LAPIC_NMI (acpi_id[0x61] high edge lint[0x1])
[    0.118683] ACPI: LAPIC_NMI (acpi_id[0x62] high edge lint[0x1])
[    0.118685] ACPI: LAPIC_NMI (acpi_id[0x63] high edge lint[0x1])
[    0.118687] ACPI: LAPIC_NMI (acpi_id[0x64] high edge lint[0x1])
[    0.118688] ACPI: LAPIC_NMI (acpi_id[0x65] high edge lint[0x1])
[    0.118690] ACPI: LAPIC_NMI (acpi_id[0x66] high edge lint[0x1])
[    0.118692] ACPI: LAPIC_NMI (acpi_id[0x67] high edge lint[0x1])
[    0.118694] ACPI: LAPIC_NMI (acpi_id[0x68] high edge lint[0x1])
[    0.118696] ACPI: LAPIC_NMI (acpi_id[0x69] high edge lint[0x1])
[    0.118698] ACPI: LAPIC_NMI (acpi_id[0x6a] high edge lint[0x1])
[    0.118700] ACPI: LAPIC_NMI (acpi_id[0x6b] high edge lint[0x1])
[    0.118702] ACPI: LAPIC_NMI (acpi_id[0x6c] high edge lint[0x1])
[    0.118704] ACPI: LAPIC_NMI (acpi_id[0x6d] high edge lint[0x1])
[    0.118706] ACPI: LAPIC_NMI (acpi_id[0x6e] high edge lint[0x1])
[    0.118708] ACPI: LAPIC_NMI (acpi_id[0x6f] high edge lint[0x1])
[    0.118710] ACPI: LAPIC_NMI (acpi_id[0x70] high edge lint[0x1])
[    0.118712] ACPI: LAPIC_NMI (acpi_id[0x71] high edge lint[0x1])
[    0.118713] ACPI: LAPIC_NMI (acpi_id[0x72] high edge lint[0x1])
[    0.118715] ACPI: LAPIC_NMI (acpi_id[0x73] high edge lint[0x1])
[    0.118717] ACPI: LAPIC_NMI (acpi_id[0x74] high edge lint[0x1])
[    0.118719] ACPI: LAPIC_NMI (acpi_id[0x75] high edge lint[0x1])
[    0.118721] ACPI: LAPIC_NMI (acpi_id[0x76] high edge lint[0x1])
[    0.118723] ACPI: LAPIC_NMI (acpi_id[0x77] high edge lint[0x1])
[    0.118725] ACPI: LAPIC_NMI (acpi_id[0x78] high edge lint[0x1])
[    0.118727] ACPI: LAPIC_NMI (acpi_id[0x79] high edge lint[0x1])
[    0.118729] ACPI: LAPIC_NMI (acpi_id[0x7a] high edge lint[0x1])
[    0.118731] ACPI: LAPIC_NMI (acpi_id[0x7b] high edge lint[0x1])
[    0.118732] ACPI: LAPIC_NMI (acpi_id[0x7c] high edge lint[0x1])
[    0.118734] ACPI: LAPIC_NMI (acpi_id[0x7d] high edge lint[0x1])
[    0.118736] ACPI: LAPIC_NMI (acpi_id[0x7e] high edge lint[0x1])
[    0.118738] ACPI: LAPIC_NMI (acpi_id[0x7f] high edge lint[0x1])
[    0.118819] IOAPIC[0]: apic_id 128, version 32, address 0xfec00000, GSI 0-23
[    0.118832] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
[    0.118845] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.118847] ACPI: HPET id: 0x8086af01 base: 0xfed00000
[    0.118856] TSC deadline timer available
[    0.118859] smpboot: Allowing 128 CPUs, 124 hotplug CPUs
[    0.118902] [mem 0xc0000000-0xffbfffff] available for PCI devices
[    0.118907] Booting paravirtualized kernel on VMware hypervisor
[    0.118911] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.118928] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:128
[    0.134179] percpu: Embedded 49 pages/cpu s163840 r8192 d28672 u262144
[    0.134213] pcpu-alloc: s163840 r8192 d28672 u262144 alloc=1*2097152
[    0.134220] pcpu-alloc: [0] 000 004 008 012 016 020 024 028
[    0.134234] pcpu-alloc: [0] 032 036 040 044 048 052 056 060
[    0.134245] pcpu-alloc: [0] 064 068 072 076 080 084 088 092
[    0.134256] pcpu-alloc: [0] 096 100 104 108 112 116 120 124
[    0.134268] pcpu-alloc: [1] 001 005 009 013 017 021 025 029
[    0.134279] pcpu-alloc: [1] 033 037 041 045 049 053 057 061
[    0.134290] pcpu-alloc: [1] 065 069 073 077 081 085 089 093
[    0.134301] pcpu-alloc: [1] 097 101 105 109 113 117 121 125
[    0.134312] pcpu-alloc: [2] 002 006 010 014 018 022 026 030
[    0.134323] pcpu-alloc: [2] 034 038 042 046 050 054 058 062
[    0.134334] pcpu-alloc: [2] 066 070 074 078 082 086 090 094
[    0.134345] pcpu-alloc: [2] 098 102 106 110 114 118 122 126
[    0.134357] pcpu-alloc: [3] 003 007 011 015 019 023 027 031
[    0.134368] pcpu-alloc: [3] 035 039 043 047 051 055 059 063
[    0.134379] pcpu-alloc: [3] 067 071 075 079 083 087 091 095
[    0.134389] pcpu-alloc: [3] 099 103 107 111 115 119 123 127
[    0.134529] vmware: vmware-stealtime: cpu 0, pa 3e627000
[    0.134568] Built 4 zonelists, mobility grouping on.  Total pages: 1031460
[    0.134572] Policy zone: Normal
[    0.134576] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-panic root=PARTUUID=65c7e33b-adc7-466f-b763-fbc070600c9d init=/lib/systemd/systemd ro loglevel=3 quiet net.ifnames=0 plymouth.enable=0 systemd.legacy_systemd_cgroup_controller=yes nokaslr
[    0.134810] Unknown command line parameters: nokaslr BOOT_IMAGE=/boot/vmlinuz-panic
[    0.134814] printk: log_buf_len individual max cpu contribution: 16384 bytes
[    0.134816] printk: log_buf_len total cpu_extra contributions: 2080768 bytes
[    0.134818] printk: log_buf_len min size: 524288 bytes
[    0.142833] printk: log_buf_len: 4194304 bytes
[    0.142837] printk: early log buf free: 498152(95%)
[    0.147483] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.199605] Memory: 3966240K/4193284K available (10248K kernel code, 2922K rwdata, 2532K rodata, 2012K init, 5196K bss, 226788K reserved, 0K cma-reserved)
[    0.201745] Kernel/User page tables isolation: enabled
[    0.202228] ftrace: allocating 32675 entries in 128 pages
[    0.227876] ftrace: allocated 128 pages with 1 groups
[    0.229677] rcu: Hierarchical RCU implementation.
[    0.229681] 	Rude variant of Tasks RCU enabled.
[    0.229683] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.231209] NR_IRQS: 8448, nr_irqs: 1448, preallocated irqs: 16
[    0.232109] random: get_random_bytes called from start_kernel+0x2c3/0x469 with crng_init=0
[    0.236430] Console: colour dummy device 80x25
[    0.236491] printk: console [tty0] enabled
[    0.237953] ACPI: Core revision 20210730
[    0.238854] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    0.239045] APIC: Switch to symmetric I/O mode setup
[    0.239311] x2apic enabled
[    0.239946] Switched APIC routing to physical x2apic.
[    0.243133] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.243209] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x36c6563e7a6, max_idle_ns: 881590602333 ns
[    0.243216] Calibrating delay loop (skipped) preset value.. 3799.99 BogoMIPS (lpj=7599998)
[    0.243223] pid_max: default: 131072 minimum: 1024
[    0.246177] LSM: Security Framework initializing
[    0.246743] Yama: becoming mindful.
[    0.247686] AppArmor: AppArmor initialized
[    0.249350] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    0.249758] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, vmalloc)
[    0.249946] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, vmalloc)
[    0.249967] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, vmalloc)
[    0.258245] Disabled fast string operations
[    0.258434] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
[    0.258439] Last level dTLB entries: 4KB 512, 2MB 0, 4MB 0, 1GB 4
[    0.258454] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.258460] Spectre V2 : Mitigation: Full generic retpoline
[    0.258462] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.258463] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.258466] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.258470] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.258473] MDS: Mitigation: Clear CPU buffers
[    0.260348] Freeing SMP alternatives memory: 32K
[    0.260736] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz (family: 0x6, model: 0x3e, stepping: 0x4)
[    0.261142] Performance Events: IvyBridge events, core PMU driver.
[    0.261290] core: CPUID marked event: 'cpu cycles' unavailable
[    0.261292] core: CPUID marked event: 'instructions' unavailable
[    0.261294] core: CPUID marked event: 'bus cycles' unavailable
[    0.261295] core: CPUID marked event: 'cache references' unavailable
[    0.261297] core: CPUID marked event: 'cache misses' unavailable
[    0.261298] core: CPUID marked event: 'branch instructions' unavailable
[    0.261299] core: CPUID marked event: 'branch misses' unavailable
[    0.261304] ... version:                1
[    0.261306] ... bit width:              48
[    0.261308] ... generic registers:      4
[    0.261309] ... value mask:             0000ffffffffffff
[    0.261312] ... max period:             000000007fffffff
[    0.261313] ... fixed-purpose events:   0
[    0.261315] ... event mask:             000000000000000f
[    0.261777] rcu: Hierarchical SRCU implementation.
[    0.263216] smp: Bringing up secondary CPUs ...
[    0.263619] x86: Booting SMP configuration:
[    0.263622] .... node    #1, CPUs:          #1
[    0.011176] Disabled fast string operations
[    0.011176] smpboot: CPU 1 Converting physical 2 to logical package 1
[    0.011176] smpboot: CPU 1 Converting physical 0 to logical die 1
[    0.264639] vmware: vmware-stealtime: cpu 1, pa 7e627000

[    0.264639] .... node    #2, CPUs:     #2
[    0.011176] Disabled fast string operations
[    0.011176] smpboot: CPU 2 Converting physical 4 to logical package 2
[    0.011176] smpboot: CPU 2 Converting physical 0 to logical die 2
[    0.267442] vmware: vmware-stealtime: cpu 2, pa be627000

[    0.268159] .... node    #3, CPUs:     #3
[    0.011176] Disabled fast string operations
[    0.011176] smpboot: CPU 3 Converting physical 6 to logical package 3
[    0.011176] smpboot: CPU 3 Converting physical 0 to logical die 3
[    0.269147] vmware: vmware-stealtime: cpu 3, pa 13e627000
[    0.269147] smp: Brought up 4 nodes, 4 CPUs
[    0.269147] smpboot: Max logical packages: 128
[    0.269147] smpboot: Total of 4 processors activated (15199.99 BogoMIPS)
[    0.272439] devtmpfs: initialized
[    0.272439] x86/mm: Memory block size: 128MB
[    0.272439] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.272439] futex hash table entries: 32768 (order: 9, 2097152 bytes, vmalloc)
[    0.275604] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.275875] DMA: preallocated 512 KiB GFP_KERNEL pool for atomic allocations
[    0.275885] DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    0.275905] audit: initializing netlink subsys (disabled)
[    0.275986] audit: type=2000 audit(1637024186.036:1): state=initialized audit_enabled=0 res=1
[    0.275986] thermal_sys: Registered thermal governor 'step_wise'
[    0.275986] thermal_sys: Registered thermal governor 'user_space'
[    0.275986] cpuidle: using governor ladder
[    0.275986] cpuidle: using governor menu
[    0.275986] ACPI: bus type PCI registered
[    0.275986] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.275986] PCI: Using configuration type 1 for base access
[    0.275986] core: PMU erratum BJ122, BV98, HSD29 workaround disabled, HT off
[    0.281790] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.391707] ACPI: Added _OSI(Module Device)
[    0.391707] ACPI: Added _OSI(Processor Device)
[    0.391707] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.391707] ACPI: Added _OSI(Processor Aggregator Device)
[    0.391707] ACPI: Added _OSI(Linux-Dell-Video)
[    0.391707] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.391707] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.425087] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.429320] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    0.429344] ACPI: BIOS _OSI(Darwin) query ignored
[    0.494230] ACPI: Interpreter enabled
[    0.494245] ACPI: PM: (supports S0 S5)
[    0.494248] ACPI: Using IOAPIC for interrupt routing
[    0.494301] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.496080] ACPI: Enabled 4 GPEs in block 00 to 0F
[    0.604612] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-01])
[    0.604631] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.604910] acpi PNP0A03:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.605728] PCI host bridge to bus 0000:00
[    0.605734] pci_bus 0000:00: Unknown NUMA node; performance will be reduced
[    0.605738] pci_bus 0000:00: root bus resource [io  0x1f00-0xffff window]
[    0.605743] pci_bus 0000:00: root bus resource [io  0x0c00-0x0cf7 window]
[    0.605746] pci_bus 0000:00: root bus resource [io  0x0000-0x09ff window]
[    0.605750] pci_bus 0000:00: root bus resource [mem 0xffbf0000-0xffdfffff window]
[    0.605754] pci_bus 0000:00: root bus resource [mem 0xfef00000-0xffbdffff window]
[    0.605757] pci_bus 0000:00: root bus resource [mem 0xfed45000-0xfedfffff window]
[    0.605760] pci_bus 0000:00: root bus resource [mem 0xfec10000-0xfed3ffff window]
[    0.605764] pci_bus 0000:00: root bus resource [mem 0xf8000000-0xf82fffff window]
[    0.605767] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xf7ffffff pref window]
[    0.605771] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xed2fffff window]
[    0.605774] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.605777] pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000c7fff window]
[    0.605780] pci_bus 0000:00: root bus resource [mem 0x000c8000-0x000cbfff window]
[    0.605783] pci_bus 0000:00: root bus resource [mem 0x000d4000-0x000d7fff window]
[    0.605786] pci_bus 0000:00: root bus resource [mem 0x000d8000-0x000dbfff window]
[    0.605789] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[    0.605792] pci_bus 0000:00: root bus resource [mem 0x000e8000-0x000ebfff window]
[    0.605796] pci_bus 0000:00: root bus resource [mem 0x000ec000-0x000effff window]
[    0.605799] pci_bus 0000:00: root bus resource [bus 00-01]
[    0.605931] pci 0000:00:00.0: [8086:7190] type 00 class 0x060000
[    0.607146] pci 0000:00:01.0: [8086:7191] type 01 class 0x060400
[    0.608397] pci 0000:00:07.0: [8086:7110] type 00 class 0x060100
[    0.610065] pci 0000:00:07.1: [8086:7111] type 00 class 0x01018a
[    0.612120] pci 0000:00:07.1: reg 0x20: [io  0x0850-0x085f]
[    0.613062] pci 0000:00:07.1: legacy IDE quirk: reg 0x10: [io  0x01f0-0x01f7]
[    0.613085] pci 0000:00:07.1: legacy IDE quirk: reg 0x14: [io  0x03f6]
[    0.613089] pci 0000:00:07.1: legacy IDE quirk: reg 0x18: [io  0x0170-0x0177]
[    0.613092] pci 0000:00:07.1: legacy IDE quirk: reg 0x1c: [io  0x0376]
[    0.613626] pci 0000:00:07.3: [8086:7113] type 00 class 0x068000
[    0.614606] pci 0000:00:07.3: quirk: [io  0x0440-0x047f] claimed by PIIX4 ACPI
[    0.615366] pci 0000:00:07.7: [15ad:0740] type 00 class 0x088000
[    0.615615] pci 0000:00:07.7: reg 0x10: [io  0x0800-0x083f]
[    0.615818] pci 0000:00:07.7: reg 0x14: [mem 0xffbf0000-0xffbf1fff 64bit]
[    0.617604] pci 0000:00:0f.0: [15ad:0405] type 00 class 0x030000
[    0.619230] pci 0000:00:0f.0: reg 0x10: [io  0x0840-0x084f]
[    0.620726] pci 0000:00:0f.0: reg 0x14: [mem 0xf0000000-0xf7ffffff pref]
[    0.622137] pci 0000:00:0f.0: reg 0x18: [mem 0xff000000-0xff7fffff]
[    0.627228] pci 0000:00:0f.0: reg 0x30: [mem 0xffff8000-0xffffffff pref]
[    0.627365] pci 0000:00:0f.0: BAR 1: assigned to efifb
[    0.628725] pci_bus 0000:01: extended config space not accessible
[    0.628870] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.629579] ACPI: PCI: Interrupt link LNKA configured for IRQ 0
[    0.629586] ACPI: PCI: Interrupt link LNKA disabled
[    0.629826] ACPI: PCI: Interrupt link LNKB configured for IRQ 0
[    0.629830] ACPI: PCI: Interrupt link LNKB disabled
[    0.630058] ACPI: PCI: Interrupt link LNKC configured for IRQ 0
[    0.630062] ACPI: PCI: Interrupt link LNKC disabled
[    0.630315] ACPI: PCI: Interrupt link LNKD configured for IRQ 0
[    0.630319] ACPI: PCI: Interrupt link LNKD disabled
[    0.656788] ACPI: PCI Root Bridge [PC08] (domain 0000 [bus 02])
[    0.656803] acpi PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.656928] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.657510] acpiphp: Slot [32] registered
[    0.657577] acpiphp: Slot [33] registered
[    0.657627] acpiphp: Slot [34] registered
[    0.657676] acpiphp: Slot [35] registered
[    0.657724] acpiphp: Slot [36] registered
[    0.657772] acpiphp: Slot [37] registered
[    0.657821] acpiphp: Slot [38] registered
[    0.657869] acpiphp: Slot [39] registered
[    0.657917] acpiphp: Slot [40] registered
[    0.657964] acpiphp: Slot [41] registered
[    0.658011] acpiphp: Slot [42] registered
[    0.658058] acpiphp: Slot [43] registered
[    0.658106] acpiphp: Slot [44] registered
[    0.658154] acpiphp: Slot [45] registered
[    0.658206] acpiphp: Slot [46] registered
[    0.658255] acpiphp: Slot [47] registered
[    0.658304] acpiphp: Slot [48] registered
[    0.658354] acpiphp: Slot [49] registered
[    0.658403] acpiphp: Slot [50] registered
[    0.658451] acpiphp: Slot [51] registered
[    0.658500] acpiphp: Slot [52] registered
[    0.658548] acpiphp: Slot [53] registered
[    0.658596] acpiphp: Slot [54] registered
[    0.658646] acpiphp: Slot [55] registered
[    0.658695] acpiphp: Slot [56] registered
[    0.658742] acpiphp: Slot [57] registered
[    0.658790] acpiphp: Slot [58] registered
[    0.658839] acpiphp: Slot [59] registered
[    0.658886] acpiphp: Slot [60] registered
[    0.658935] acpiphp: Slot [61] registered
[    0.658983] acpiphp: Slot [62] registered
[    0.659030] acpiphp: Slot [63] registered
[    0.659064] PCI host bridge to bus 0000:02
[    0.659068] pci_bus 0000:02: Unknown NUMA node; performance will be reduced
[    0.659073] pci_bus 0000:02: root bus resource [mem 0xffbe0000-0xffbeffff window]
[    0.659077] pci_bus 0000:02: root bus resource [mem 0xfdd00000-0xfe1fffff pref window]
[    0.659081] pci_bus 0000:02: root bus resource [mem 0xfe200000-0xfebfffff window]
[    0.659084] pci_bus 0000:02: root bus resource [io  0x0a00-0x0bff window]
[    0.659088] pci_bus 0000:02: root bus resource [bus 02]
[    0.659447] pci 0000:02:00.0: [15ad:07c0] type 00 class 0x010700
[    0.660373] pci 0000:02:00.0: reg 0x10: [io  0x0a10-0x0a17]
[    0.661131] pci 0000:02:00.0: reg 0x14: [mem 0xffbe0000-0xffbe7fff 64bit]
[    0.663587] pci 0000:02:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
[    0.664534] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    0.666210] pci 0000:02:02.0: [15ad:07b0] type 00 class 0x020000
[    0.666464] pci 0000:02:02.0: reg 0x10: [mem 0xfe223000-0xfe223fff]
[    0.666579] pci 0000:02:02.0: reg 0x14: [mem 0xfe222000-0xfe222fff]
[    0.666693] pci 0000:02:02.0: reg 0x18: [mem 0xfe220000-0xfe221fff]
[    0.666807] pci 0000:02:02.0: reg 0x1c: [io  0x0a00-0x0a0f]
[    0.667143] pci 0000:02:02.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
[    0.668113] pci 0000:02:02.0: supports D1 D2
[    0.668117] pci 0000:02:02.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.671836] ACPI: PCI Root Bridge [PC0G] (domain 0000 [bus 03])
[    0.671849] acpi PNP0A08:01: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.671973] acpi PNP0A08:01: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.672572] acpiphp: Slot [64] registered
[    0.672607] acpiphp: Slot [65] registered
[    0.672635] acpiphp: Slot [66] registered
[    0.672661] acpiphp: Slot [67] registered
[    0.672691] acpiphp: Slot [68] registered
[    0.672719] acpiphp: Slot [69] registered
[    0.672746] acpiphp: Slot [70] registered
[    0.672773] acpiphp: Slot [71] registered
[    0.672800] acpiphp: Slot [72] registered
[    0.672827] acpiphp: Slot [73] registered
[    0.672853] acpiphp: Slot [74] registered
[    0.672879] acpiphp: Slot [75] registered
[    0.672906] acpiphp: Slot [76] registered
[    0.672934] acpiphp: Slot [77] registered
[    0.672963] acpiphp: Slot [78] registered
[    0.672989] acpiphp: Slot [79] registered
[    0.673016] acpiphp: Slot [80] registered
[    0.673044] acpiphp: Slot [81] registered
[    0.673073] acpiphp: Slot [82] registered
[    0.673101] acpiphp: Slot [83] registered
[    0.673129] acpiphp: Slot [84] registered
[    0.673157] acpiphp: Slot [85] registered
[    0.673185] acpiphp: Slot [86] registered
[    0.673212] acpiphp: Slot [87] registered
[    0.673240] acpiphp: Slot [88] registered
[    0.673267] acpiphp: Slot [89] registered
[    0.673294] acpiphp: Slot [90] registered
[    0.673322] acpiphp: Slot [91] registered
[    0.673349] acpiphp: Slot [92] registered
[    0.673375] acpiphp: Slot [93] registered
[    0.673403] acpiphp: Slot [94] registered
[    0.673431] acpiphp: Slot [95] registered
[    0.673444] PCI host bridge to bus 0000:03
[    0.673448] pci_bus 0000:03: root bus resource [mem 0xfce00000-0xfd2fffff pref window]
[    0.673454] pci_bus 0000:03: root bus resource [mem 0xfd300000-0xfdcfffff window]
[    0.673457] pci_bus 0000:03: root bus resource [io  0x0d00-0x0eff window]
[    0.673461] pci_bus 0000:03: root bus resource [bus 03]
[    0.673582] pci_bus 0000:03: on NUMA node 0
[    0.674896] ACPI: PCI Root Bridge [PC0H] (domain 0000 [bus 04])
[    0.674908] acpi PNP0A08:02: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.675035] acpi PNP0A08:02: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.675644] acpiphp: Slot [576] registered
[    0.675675] acpiphp: Slot [577] registered
[    0.675705] acpiphp: Slot [578] registered
[    0.675732] acpiphp: Slot [579] registered
[    0.675757] acpiphp: Slot [580] registered
[    0.675783] acpiphp: Slot [581] registered
[    0.675809] acpiphp: Slot [582] registered
[    0.675836] acpiphp: Slot [583] registered
[    0.675862] acpiphp: Slot [584] registered
[    0.675890] acpiphp: Slot [585] registered
[    0.675916] acpiphp: Slot [586] registered
[    0.675943] acpiphp: Slot [587] registered
[    0.675969] acpiphp: Slot [588] registered
[    0.675995] acpiphp: Slot [589] registered
[    0.676022] acpiphp: Slot [590] registered
[    0.676048] acpiphp: Slot [591] registered
[    0.676074] acpiphp: Slot [592] registered
[    0.676101] acpiphp: Slot [593] registered
[    0.676127] acpiphp: Slot [594] registered
[    0.676154] acpiphp: Slot [595] registered
[    0.676181] acpiphp: Slot [596] registered
[    0.676208] acpiphp: Slot [597] registered
[    0.676235] acpiphp: Slot [598] registered
[    0.676261] acpiphp: Slot [599] registered
[    0.676287] acpiphp: Slot [600] registered
[    0.676313] acpiphp: Slot [601] registered
[    0.676340] acpiphp: Slot [602] registered
[    0.676366] acpiphp: Slot [603] registered
[    0.676394] acpiphp: Slot [604] registered
[    0.676423] acpiphp: Slot [605] registered
[    0.676449] acpiphp: Slot [606] registered
[    0.676476] acpiphp: Slot [607] registered
[    0.676490] PCI host bridge to bus 0000:04
[    0.676494] pci_bus 0000:04: root bus resource [mem 0xfbf00000-0xfc3fffff pref window]
[    0.676499] pci_bus 0000:04: root bus resource [mem 0xfc400000-0xfcdfffff window]
[    0.676503] pci_bus 0000:04: root bus resource [io  0x0f00-0x10ff window]
[    0.676507] pci_bus 0000:04: root bus resource [bus 04]
[    0.676625] pci_bus 0000:04: on NUMA node 1
[    0.677927] ACPI: PCI Root Bridge [PC0I] (domain 0000 [bus 05])
[    0.677939] acpi PNP0A08:03: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.678064] acpi PNP0A08:03: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.678693] acpiphp: Slot [1088] registered
[    0.678727] acpiphp: Slot [1089] registered
[    0.678756] acpiphp: Slot [1090] registered
[    0.678784] acpiphp: Slot [1091] registered
[    0.678812] acpiphp: Slot [1092] registered
[    0.678838] acpiphp: Slot [1093] registered
[    0.678866] acpiphp: Slot [1094] registered
[    0.678893] acpiphp: Slot [1095] registered
[    0.678920] acpiphp: Slot [1096] registered
[    0.678948] acpiphp: Slot [1097] registered
[    0.678977] acpiphp: Slot [1098] registered
[    0.679004] acpiphp: Slot [1099] registered
[    0.679031] acpiphp: Slot [1100] registered
[    0.679058] acpiphp: Slot [1101] registered
[    0.679084] acpiphp: Slot [1102] registered
[    0.679111] acpiphp: Slot [1103] registered
[    0.679139] acpiphp: Slot [1104] registered
[    0.679166] acpiphp: Slot [1105] registered
[    0.679194] acpiphp: Slot [1106] registered
[    0.679223] acpiphp: Slot [1107] registered
[    0.679254] acpiphp: Slot [1108] registered
[    0.679281] acpiphp: Slot [1109] registered
[    0.679309] acpiphp: Slot [1110] registered
[    0.679336] acpiphp: Slot [1111] registered
[    0.679364] acpiphp: Slot [1112] registered
[    0.679392] acpiphp: Slot [1113] registered
[    0.679418] acpiphp: Slot [1114] registered
[    0.679444] acpiphp: Slot [1115] registered
[    0.679472] acpiphp: Slot [1116] registered
[    0.679498] acpiphp: Slot [1117] registered
[    0.679524] acpiphp: Slot [1118] registered
[    0.679550] acpiphp: Slot [1119] registered
[    0.679565] PCI host bridge to bus 0000:05
[    0.679568] pci_bus 0000:05: root bus resource [mem 0xfb000000-0xfb4fffff pref window]
[    0.679573] pci_bus 0000:05: root bus resource [mem 0xfb500000-0xfbefffff window]
[    0.679577] pci_bus 0000:05: root bus resource [io  0x1100-0x12ff window]
[    0.679580] pci_bus 0000:05: root bus resource [bus 05]
[    0.679698] pci_bus 0000:05: on NUMA node 2
[    0.681022] ACPI: PCI Root Bridge [PC0J] (domain 0000 [bus 06])
[    0.681033] acpi PNP0A08:04: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.681155] acpi PNP0A08:04: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.681808] acpiphp: Slot [1600] registered
[    0.681841] acpiphp: Slot [1601] registered
[    0.681868] acpiphp: Slot [1602] registered
[    0.681895] acpiphp: Slot [1603] registered
[    0.681921] acpiphp: Slot [1604] registered
[    0.681947] acpiphp: Slot [1605] registered
[    0.681973] acpiphp: Slot [1606] registered
[    0.682003] acpiphp: Slot [1607] registered
[    0.682030] acpiphp: Slot [1608] registered
[    0.682057] acpiphp: Slot [1609] registered
[    0.682083] acpiphp: Slot [1610] registered
[    0.682110] acpiphp: Slot [1611] registered
[    0.682136] acpiphp: Slot [1612] registered
[    0.682163] acpiphp: Slot [1613] registered
[    0.682190] acpiphp: Slot [1614] registered
[    0.682216] acpiphp: Slot [1615] registered
[    0.682243] acpiphp: Slot [1616] registered
[    0.682269] acpiphp: Slot [1617] registered
[    0.682296] acpiphp: Slot [1618] registered
[    0.682327] acpiphp: Slot [1619] registered
[    0.682355] acpiphp: Slot [1620] registered
[    0.682382] acpiphp: Slot [1621] registered
[    0.682409] acpiphp: Slot [1622] registered
[    0.682436] acpiphp: Slot [1623] registered
[    0.682462] acpiphp: Slot [1624] registered
[    0.682489] acpiphp: Slot [1625] registered
[    0.682516] acpiphp: Slot [1626] registered
[    0.682543] acpiphp: Slot [1627] registered
[    0.682570] acpiphp: Slot [1628] registered
[    0.682597] acpiphp: Slot [1629] registered
[    0.682624] acpiphp: Slot [1630] registered
[    0.682651] acpiphp: Slot [1631] registered
[    0.682664] PCI host bridge to bus 0000:06
[    0.682668] pci_bus 0000:06: root bus resource [mem 0xfa100000-0xfa5fffff pref window]
[    0.682673] pci_bus 0000:06: root bus resource [mem 0xfa600000-0xfaffffff window]
[    0.682676] pci_bus 0000:06: root bus resource [io  0x1300-0x14ff window]
[    0.682680] pci_bus 0000:06: root bus resource [bus 06]
[    0.682799] pci_bus 0000:06: on NUMA node 2
[    0.684117] ACPI: PCI Root Bridge [PC0K] (domain 0000 [bus 07])
[    0.684129] acpi PNP0A08:05: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.684250] acpi PNP0A08:05: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.684932] acpiphp: Slot [2112] registered
[    0.684966] acpiphp: Slot [2113] registered
[    0.684994] acpiphp: Slot [2114] registered
[    0.685021] acpiphp: Slot [2115] registered
[    0.685047] acpiphp: Slot [2116] registered
[    0.685074] acpiphp: Slot [2117] registered
[    0.685101] acpiphp: Slot [2118] registered
[    0.685128] acpiphp: Slot [2119] registered
[    0.685154] acpiphp: Slot [2120] registered
[    0.685180] acpiphp: Slot [2121] registered
[    0.685206] acpiphp: Slot [2122] registered
[    0.685232] acpiphp: Slot [2123] registered
[    0.685260] acpiphp: Slot [2124] registered
[    0.685287] acpiphp: Slot [2125] registered
[    0.685315] acpiphp: Slot [2126] registered
[    0.685342] acpiphp: Slot [2127] registered
[    0.685368] acpiphp: Slot [2128] registered
[    0.685394] acpiphp: Slot [2129] registered
[    0.685420] acpiphp: Slot [2130] registered
[    0.685447] acpiphp: Slot [2131] registered
[    0.685473] acpiphp: Slot [2132] registered
[    0.685500] acpiphp: Slot [2133] registered
[    0.685527] acpiphp: Slot [2134] registered
[    0.685553] acpiphp: Slot [2135] registered
[    0.685579] acpiphp: Slot [2136] registered
[    0.685605] acpiphp: Slot [2137] registered
[    0.685632] acpiphp: Slot [2138] registered
[    0.685659] acpiphp: Slot [2139] registered
[    0.685685] acpiphp: Slot [2140] registered
[    0.685711] acpiphp: Slot [2141] registered
[    0.685737] acpiphp: Slot [2142] registered
[    0.685765] acpiphp: Slot [2143] registered
[    0.685779] PCI host bridge to bus 0000:07
[    0.685783] pci_bus 0000:07: root bus resource [mem 0xf9200000-0xf96fffff pref window]
[    0.685788] pci_bus 0000:07: root bus resource [mem 0xf9700000-0xfa0fffff window]
[    0.685791] pci_bus 0000:07: root bus resource [io  0x1500-0x16ff window]
[    0.685795] pci_bus 0000:07: root bus resource [bus 07]
[    0.685929] pci_bus 0000:07: on NUMA node 2
[    0.687249] ACPI: PCI Root Bridge [PC0L] (domain 0000 [bus 08])
[    0.687260] acpi PNP0A08:06: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.687386] acpi PNP0A08:06: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.688084] acpiphp: Slot [2624] registered
[    0.688117] acpiphp: Slot [2625] registered
[    0.688144] acpiphp: Slot [2626] registered
[    0.688170] acpiphp: Slot [2627] registered
[    0.688195] acpiphp: Slot [2628] registered
[    0.688221] acpiphp: Slot [2629] registered
[    0.688247] acpiphp: Slot [2630] registered
[    0.688274] acpiphp: Slot [2631] registered
[    0.688300] acpiphp: Slot [2632] registered
[    0.688328] acpiphp: Slot [2633] registered
[    0.688355] acpiphp: Slot [2634] registered
[    0.688381] acpiphp: Slot [2635] registered
[    0.688408] acpiphp: Slot [2636] registered
[    0.688434] acpiphp: Slot [2637] registered
[    0.688460] acpiphp: Slot [2638] registered
[    0.688485] acpiphp: Slot [2639] registered
[    0.688512] acpiphp: Slot [2640] registered
[    0.688538] acpiphp: Slot [2641] registered
[    0.688564] acpiphp: Slot [2642] registered
[    0.688590] acpiphp: Slot [2643] registered
[    0.688616] acpiphp: Slot [2644] registered
[    0.688643] acpiphp: Slot [2645] registered
[    0.688669] acpiphp: Slot [2646] registered
[    0.688698] acpiphp: Slot [2647] registered
[    0.688724] acpiphp: Slot [2648] registered
[    0.688751] acpiphp: Slot [2649] registered
[    0.688778] acpiphp: Slot [2650] registered
[    0.688824] acpiphp: Slot [2651] registered
[    0.688852] acpiphp: Slot [2652] registered
[    0.688878] acpiphp: Slot [2653] registered
[    0.688904] acpiphp: Slot [2654] registered
[    0.688932] acpiphp: Slot [2655] registered
[    0.688947] PCI host bridge to bus 0000:08
[    0.688950] pci_bus 0000:08: root bus resource [mem 0xf8300000-0xf87fffff pref window]
[    0.688955] pci_bus 0000:08: root bus resource [mem 0xf8800000-0xf91fffff window]
[    0.688958] pci_bus 0000:08: root bus resource [io  0x1700-0x18ff window]
[    0.688962] pci_bus 0000:08: root bus resource [bus 08]
[    0.689081] pci_bus 0000:08: on NUMA node 2
[    0.690378] ACPI: PCI Root Bridge [PC0M] (domain 0000 [bus 09])
[    0.690389] acpi PNP0A08:07: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.690512] acpi PNP0A08:07: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.691268] acpiphp: Slot [3136] registered
[    0.691299] acpiphp: Slot [3137] registered
[    0.691327] acpiphp: Slot [3138] registered
[    0.691354] acpiphp: Slot [3139] registered
[    0.691381] acpiphp: Slot [3140] registered
[    0.691408] acpiphp: Slot [3141] registered
[    0.691436] acpiphp: Slot [3142] registered
[    0.691463] acpiphp: Slot [3143] registered
[    0.691490] acpiphp: Slot [3144] registered
[    0.691518] acpiphp: Slot [3145] registered
[    0.691545] acpiphp: Slot [3146] registered
[    0.691571] acpiphp: Slot [3147] registered
[    0.691597] acpiphp: Slot [3148] registered
[    0.691624] acpiphp: Slot [3149] registered
[    0.691651] acpiphp: Slot [3150] registered
[    0.691680] acpiphp: Slot [3151] registered
[    0.691706] acpiphp: Slot [3152] registered
[    0.691734] acpiphp: Slot [3153] registered
[    0.691761] acpiphp: Slot [3154] registered
[    0.691807] acpiphp: Slot [3155] registered
[    0.691837] acpiphp: Slot [3156] registered
[    0.691865] acpiphp: Slot [3157] registered
[    0.691892] acpiphp: Slot [3158] registered
[    0.691919] acpiphp: Slot [3159] registered
[    0.691945] acpiphp: Slot [3160] registered
[    0.691971] acpiphp: Slot [3161] registered
[    0.691998] acpiphp: Slot [3162] registered
[    0.692027] acpiphp: Slot [3163] registered
[    0.692054] acpiphp: Slot [3164] registered
[    0.692081] acpiphp: Slot [3165] registered
[    0.692108] acpiphp: Slot [3166] registered
[    0.692136] acpiphp: Slot [3167] registered
[    0.692150] PCI host bridge to bus 0000:09
[    0.692153] pci_bus 0000:09: root bus resource [mem 0xef100000-0xef5fffff pref window]
[    0.692158] pci_bus 0000:09: root bus resource [mem 0xef600000-0xefffffff window]
[    0.692161] pci_bus 0000:09: root bus resource [io  0x1900-0x1aff window]
[    0.692165] pci_bus 0000:09: root bus resource [bus 09]
[    0.692287] pci_bus 0000:09: on NUMA node 2
[    0.693581] ACPI: PCI Root Bridge [PC0N] (domain 0000 [bus 0a])
[    0.693592] acpi PNP0A08:08: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.693716] acpi PNP0A08:08: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.694488] acpiphp: Slot [3648] registered
[    0.694520] acpiphp: Slot [3649] registered
[    0.694546] acpiphp: Slot [3650] registered
[    0.694573] acpiphp: Slot [3651] registered
[    0.694599] acpiphp: Slot [3652] registered
[    0.694625] acpiphp: Slot [3653] registered
[    0.694652] acpiphp: Slot [3654] registered
[    0.694678] acpiphp: Slot [3655] registered
[    0.694704] acpiphp: Slot [3656] registered
[    0.694731] acpiphp: Slot [3657] registered
[    0.694757] acpiphp: Slot [3658] registered
[    0.694800] acpiphp: Slot [3659] registered
[    0.694829] acpiphp: Slot [3660] registered
[    0.694855] acpiphp: Slot [3661] registered
[    0.694882] acpiphp: Slot [3662] registered
[    0.694908] acpiphp: Slot [3663] registered
[    0.694933] acpiphp: Slot [3664] registered
[    0.694961] acpiphp: Slot [3665] registered
[    0.694988] acpiphp: Slot [3666] registered
[    0.695015] acpiphp: Slot [3667] registered
[    0.695044] acpiphp: Slot [3668] registered
[    0.695071] acpiphp: Slot [3669] registered
[    0.695098] acpiphp: Slot [3670] registered
[    0.695125] acpiphp: Slot [3671] registered
[    0.695153] acpiphp: Slot [3672] registered
[    0.695179] acpiphp: Slot [3673] registered
[    0.695206] acpiphp: Slot [3674] registered
[    0.695238] acpiphp: Slot [3675] registered
[    0.695266] acpiphp: Slot [3676] registered
[    0.695297] acpiphp: Slot [3677] registered
[    0.695325] acpiphp: Slot [3678] registered
[    0.695352] acpiphp: Slot [3679] registered
[    0.695367] PCI host bridge to bus 0000:0a
[    0.695370] pci_bus 0000:0a: root bus resource [mem 0xee200000-0xee6fffff pref window]
[    0.695375] pci_bus 0000:0a: root bus resource [mem 0xee700000-0xef0fffff window]
[    0.695379] pci_bus 0000:0a: root bus resource [io  0x1b00-0x1cff window]
[    0.695382] pci_bus 0000:0a: root bus resource [bus 0a]
[    0.695500] pci_bus 0000:0a: on NUMA node 2
[    0.696799] ACPI: PCI Root Bridge [PC0O] (domain 0000 [bus 0b])
[    0.696810] acpi PNP0A08:09: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    0.696933] acpi PNP0A08:09: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    0.697783] acpiphp: Slot [96] registered
[    0.697815] acpiphp: Slot [97] registered
[    0.697842] acpiphp: Slot [98] registered
[    0.697869] acpiphp: Slot [99] registered
[    0.697897] acpiphp: Slot [100] registered
[    0.697924] acpiphp: Slot [101] registered
[    0.697950] acpiphp: Slot [102] registered
[    0.697976] acpiphp: Slot [103] registered
[    0.698003] acpiphp: Slot [104] registered
[    0.698030] acpiphp: Slot [105] registered
[    0.698058] acpiphp: Slot [106] registered
[    0.698084] acpiphp: Slot [107] registered
[    0.698111] acpiphp: Slot [108] registered
[    0.698139] acpiphp: Slot [109] registered
[    0.698167] acpiphp: Slot [110] registered
[    0.698194] acpiphp: Slot [111] registered
[    0.698221] acpiphp: Slot [112] registered
[    0.698248] acpiphp: Slot [113] registered
[    0.698276] acpiphp: Slot [114] registered
[    0.698303] acpiphp: Slot [115] registered
[    0.698330] acpiphp: Slot [116] registered
[    0.698359] acpiphp: Slot [117] registered
[    0.698387] acpiphp: Slot [118] registered
[    0.698414] acpiphp: Slot [119] registered
[    0.698441] acpiphp: Slot [120] registered
[    0.698467] acpiphp: Slot [121] registered
[    0.698494] acpiphp: Slot [122] registered
[    0.698521] acpiphp: Slot [123] registered
[    0.698547] acpiphp: Slot [124] registered
[    0.698574] acpiphp: Slot [125] registered
[    0.698600] acpiphp: Slot [126] registered
[    0.698626] acpiphp: Slot [127] registered
[    0.698640] PCI host bridge to bus 0000:0b
[    0.698643] pci_bus 0000:0b: root bus resource [mem 0xed300000-0xed7fffff pref window]
[    0.698648] pci_bus 0000:0b: root bus resource [mem 0xed800000-0xee1fffff window]
[    0.698652] pci_bus 0000:0b: root bus resource [io  0x1d00-0x1eff window]
[    0.698656] pci_bus 0000:0b: root bus resource [bus 0b]
[    0.698774] pci_bus 0000:0b: on NUMA node 3
[    0.700574] pci 0000:00:0f.0: vgaarb: setting as boot VGA device
[    0.700574] pci 0000:00:0f.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.700574] pci 0000:00:0f.0: vgaarb: bridge control possible
[    0.700574] vgaarb: loaded
[    0.700574] SCSI subsystem initialized
[    0.700574] libata version 3.00 loaded.
[    0.700574] pps_core: LinuxPPS API ver. 1 registered
[    0.700574] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.700574] Registered efivars operations
[    0.700574] PCI: Using ACPI for IRQ routing
[    0.700574] PCI: pci_cache_line_size set to 64 bytes
[    0.700574] e820: reserve RAM buffer [mem 0x0e55d018-0x0fffffff]
[    0.700574] e820: reserve RAM buffer [mem 0x0e55f018-0x0fffffff]
[    0.700574] e820: reserve RAM buffer [mem 0x0e562018-0x0fffffff]
[    0.700574] e820: reserve RAM buffer [mem 0x0efab000-0x0fffffff]
[    0.700574] e820: reserve RAM buffer [mem 0x0efbd000-0x0fffffff]
[    0.700574] e820: reserve RAM buffer [mem 0x0fee6000-0x0fffffff]
[    0.700574] clocksource: Switched to clocksource tsc-early
[    0.709154] VFS: Disk quotas dquot_6.6.0
[    0.709286] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.709646] AppArmor: AppArmor Filesystem Enabled
[    0.709685] pnp: PnP ACPI init
[    0.709994] system 00:00: [io  0x0440-0x047f] has been reserved
[    0.710003] system 00:00: [io  0x5658-0x5659] has been reserved
[    0.710008] system 00:00: [io  0x5670] has been reserved
[    0.710012] system 00:00: [io  0x0cf0-0x0cf1] has been reserved
[    0.710615] system 00:04: [mem 0xfed00000-0xfed003ff] has been reserved
[    0.715510] system 00:05: [io  0x0400-0x041f] has been reserved
[    0.715519] system 00:05: [mem 0xe0000000-0xe7ffffff] has been reserved
[    0.715526] system 00:05: [mem 0xffc00000-0xffdfffff] could not be reserved
[    0.729373] pnp: PnP ACPI: found 6 devices
[    0.741494] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.741828] NET: Registered PF_INET protocol family
[    0.742012] IP idents hash table entries: 65536 (order: 7, 524288 bytes, vmalloc)
[    0.744011] tcp_listen_portaddr_hash hash table entries: 2048 (order: 3, 32768 bytes, vmalloc)
[    0.744036] TCP established hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[    0.744142] TCP bind hash table entries: 32768 (order: 7, 524288 bytes, vmalloc)
[    0.744246] TCP: Hash tables configured (established 32768 bind 32768)
[    0.744634] UDP hash table entries: 2048 (order: 4, 65536 bytes, vmalloc)
[    0.744661] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes, vmalloc)
[    0.745162] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.745178] pci 0000:00:0f.0: can't claim BAR 6 [mem 0xffff8000-0xffffffff pref]: no compatible bridge window
[    0.745187] pci 0000:02:00.0: can't claim BAR 6 [mem 0xffff0000-0xffffffff pref]: no compatible bridge window
[    0.745190] pci 0000:02:02.0: can't claim BAR 6 [mem 0xffff0000-0xffffffff pref]: no compatible bridge window
[    0.745206] pci 0000:00:0f.0: BAR 6: assigned [mem 0xffbf8000-0xffbfffff pref]
[    0.745212] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.745511] pci_bus 0000:00: resource 4 [io  0x1f00-0xffff window]
[    0.745516] pci_bus 0000:00: resource 5 [io  0x0c00-0x0cf7 window]
[    0.745520] pci_bus 0000:00: resource 6 [io  0x0000-0x09ff window]
[    0.745523] pci_bus 0000:00: resource 7 [mem 0xffbf0000-0xffdfffff window]
[    0.745526] pci_bus 0000:00: resource 8 [mem 0xfef00000-0xffbdffff window]
[    0.745530] pci_bus 0000:00: resource 9 [mem 0xfed45000-0xfedfffff window]
[    0.745533] pci_bus 0000:00: resource 10 [mem 0xfec10000-0xfed3ffff window]
[    0.745536] pci_bus 0000:00: resource 11 [mem 0xf8000000-0xf82fffff window]
[    0.745539] pci_bus 0000:00: resource 12 [mem 0xf0000000-0xf7ffffff pref window]
[    0.745542] pci_bus 0000:00: resource 13 [mem 0xc0000000-0xed2fffff window]
[    0.745546] pci_bus 0000:00: resource 14 [mem 0x000a0000-0x000bffff window]
[    0.745549] pci_bus 0000:00: resource 15 [mem 0x000c4000-0x000c7fff window]
[    0.745552] pci_bus 0000:00: resource 16 [mem 0x000c8000-0x000cbfff window]
[    0.745555] pci_bus 0000:00: resource 17 [mem 0x000d4000-0x000d7fff window]
[    0.745558] pci_bus 0000:00: resource 18 [mem 0x000d8000-0x000dbfff window]
[    0.745561] pci_bus 0000:00: resource 19 [mem 0x000e4000-0x000e7fff window]
[    0.745564] pci_bus 0000:00: resource 20 [mem 0x000e8000-0x000ebfff window]
[    0.745567] pci_bus 0000:00: resource 21 [mem 0x000ec000-0x000effff window]
[    0.745657] pci 0000:02:00.0: BAR 6: assigned [mem 0xfdd00000-0xfdd0ffff pref]
[    0.745662] pci 0000:02:02.0: BAR 6: assigned [mem 0xfdd10000-0xfdd1ffff pref]
[    0.745667] pci_bus 0000:02: resource 4 [mem 0xffbe0000-0xffbeffff window]
[    0.745670] pci_bus 0000:02: resource 5 [mem 0xfdd00000-0xfe1fffff pref window]
[    0.745673] pci_bus 0000:02: resource 6 [mem 0xfe200000-0xfebfffff window]
[    0.745676] pci_bus 0000:02: resource 7 [io  0x0a00-0x0bff window]
[    0.745716] pci_bus 0000:03: resource 4 [mem 0xfce00000-0xfd2fffff pref window]
[    0.745720] pci_bus 0000:03: resource 5 [mem 0xfd300000-0xfdcfffff window]
[    0.745723] pci_bus 0000:03: resource 6 [io  0x0d00-0x0eff window]
[    0.745762] pci_bus 0000:04: resource 4 [mem 0xfbf00000-0xfc3fffff pref window]
[    0.745766] pci_bus 0000:04: resource 5 [mem 0xfc400000-0xfcdfffff window]
[    0.745769] pci_bus 0000:04: resource 6 [io  0x0f00-0x10ff window]
[    0.745808] pci_bus 0000:05: resource 4 [mem 0xfb000000-0xfb4fffff pref window]
[    0.745812] pci_bus 0000:05: resource 5 [mem 0xfb500000-0xfbefffff window]
[    0.745815] pci_bus 0000:05: resource 6 [io  0x1100-0x12ff window]
[    0.745853] pci_bus 0000:06: resource 4 [mem 0xfa100000-0xfa5fffff pref window]
[    0.745856] pci_bus 0000:06: resource 5 [mem 0xfa600000-0xfaffffff window]
[    0.745859] pci_bus 0000:06: resource 6 [io  0x1300-0x14ff window]
[    0.745898] pci_bus 0000:07: resource 4 [mem 0xf9200000-0xf96fffff pref window]
[    0.745902] pci_bus 0000:07: resource 5 [mem 0xf9700000-0xfa0fffff window]
[    0.745905] pci_bus 0000:07: resource 6 [io  0x1500-0x16ff window]
[    0.745944] pci_bus 0000:08: resource 4 [mem 0xf8300000-0xf87fffff pref window]
[    0.745947] pci_bus 0000:08: resource 5 [mem 0xf8800000-0xf91fffff window]
[    0.745950] pci_bus 0000:08: resource 6 [io  0x1700-0x18ff window]
[    0.745988] pci_bus 0000:09: resource 4 [mem 0xef100000-0xef5fffff pref window]
[    0.745992] pci_bus 0000:09: resource 5 [mem 0xef600000-0xefffffff window]
[    0.745995] pci_bus 0000:09: resource 6 [io  0x1900-0x1aff window]
[    0.746033] pci_bus 0000:0a: resource 4 [mem 0xee200000-0xee6fffff pref window]
[    0.746036] pci_bus 0000:0a: resource 5 [mem 0xee700000-0xef0fffff window]
[    0.746039] pci_bus 0000:0a: resource 6 [io  0x1b00-0x1cff window]
[    0.746078] pci_bus 0000:0b: resource 4 [mem 0xed300000-0xed7fffff pref window]
[    0.746082] pci_bus 0000:0b: resource 5 [mem 0xed800000-0xee1fffff window]
[    0.746085] pci_bus 0000:0b: resource 6 [io  0x1d00-0x1eff window]
[    0.746127] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.746441] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.746519] PCI: CLS 32 bytes, default 64
[    0.746550] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.746552] software IO TLB: mapped [mem 0x00000000ba600000-0x00000000be600000] (64MB)
[    0.746690] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x36c6563e7a6, max_idle_ns: 881590602333 ns
[    0.746879] clocksource: Switched to clocksource tsc
[    0.749913] Initialise system trusted keyrings
[    0.750514] workingset: timestamp_bits=36 max_order=20 bucket_order=0
[    0.751994] NET: Registered PF_ALG protocol family
[    0.752000] Key type asymmetric registered
[    0.752002] Asymmetric key parser 'x509' registered
[    0.752037] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249)
[    0.753154] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.754542] vmw_vmci 0000:00:07.7: enabling device (0000 -> 0003)
[    0.756581] vmw_vmci 0000:00:07.7: Found VMCI PCI device at 0x10800, irq 16
[    0.756736] vmw_vmci 0000:00:07.7: Using capabilities 0x1c
[    0.758045] Guest personality initialized and is active
[    0.758144] VMCI host device registered (name=vmci, major=10, minor=126)
[    0.758148] Initialized host personality
[    0.758198] mpt3sas version 39.100.00.00 loaded
[    0.758390] VMware PVSCSI driver - version 1.0.7.0-k
[    0.759957] vmw_pvscsi: using 64bit dma
[    0.760190] vmw_pvscsi: max_id: 65
[    0.760195] vmw_pvscsi: setting ring_pages to 32
[    0.762686] vmw_pvscsi: enabling reqCallThreshold
[    0.763062] vmw_pvscsi: driver-based request coalescing enabled
[    0.763066] vmw_pvscsi: using MSI-X
[    0.763288] scsi host0: VMware PVSCSI storage adapter rev 2, req/cmp/msg rings: 32/32/1 pages, cmd_per_lun=254
[    0.764168] vmw_pvscsi 0000:02:00.0: VMware PVSCSI rev 2 host #0
[    0.764762] scsi 0:0:0:0: Direct-Access     VMware   Virtual disk     2.0  PQ: 0 ANSI: 6
[    0.765449] ata_piix 0000:00:07.1: version 2.13
[    0.766977] scsi host1: ata_piix
[    0.768474] scsi host2: ata_piix
[    0.768875] ata1: PATA max UDMA/33 cmd 0x1f0 ctl 0x3f6 bmdma 0x850 irq 14
[    0.768881] ata2: PATA max UDMA/33 cmd 0x170 ctl 0x376 bmdma 0x858 irq 15
[    0.768944] VMware vmxnet3 virtual NIC driver - version 1.6.0.0-k-NAPI
[    0.769062] vmxnet3 0000:02:02.0: enabling device (0000 -> 0003)
[    0.771674] vmxnet3 0000:02:02.0: # of Tx queues : 4, # of Rx queues : 4
[    0.773997] vmxnet3 0000:02:02.0 eth0: NIC Link is Up 10000 Mbps
[    0.774044] Fusion MPT base driver 3.04.20
[    0.774046] Copyright (c) 1999-2008 LSI Corporation
[    0.774054] Fusion MPT SPI Host driver 3.04.20
[    0.774076] Fusion MPT SAS Host driver 3.04.20
[    0.774148] i8042: PNP: PS/2 Controller [PNP0303:KBC,PNP0f13:MOUS] at 0x60,0x64 irq 1,12
[    0.777955] serio: i8042 KBD port at 0x60,0x64 irq 1
[    0.777968] serio: i8042 AUX port at 0x60,0x64 irq 12
[    0.782950] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[    0.784554] rtc_cmos 00:01: registered as rtc0
[    0.784561] rtc_cmos 00:01: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[    0.784585] intel_pstate: CPU model not supported
[    0.784614] efifb: probing for efifb
[    0.784635] efifb: framebuffer at 0xf0000000, using 1200k, total 1200k
[    0.784639] efifb: mode is 640x480x32, linelength=2560, pages=1
[    0.784642] efifb: scrolling: redraw
[    0.784643] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.785595] Console: switching to colour frame buffer device 80x30
[    0.789891] fb0: EFI VGA frame buffer device
[    0.803545] scsi 0:0:0:0: Attached scsi generic sg0 type 0
[    0.804168] sd 0:0:0:0: [sda] 33554432 512-byte logical blocks: (17.2 GB/16.0 GiB)
[    0.804235] sd 0:0:0:0: [sda] Write Protect is off
[    0.804240] sd 0:0:0:0: [sda] Mode Sense: 3b 00 00 00
[    0.804343] sd 0:0:0:0: [sda] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
[    0.807196] pstore: Registered efi as persistent store backend
[    0.807457] Initializing XFRM netlink socket
[    0.807467] NET: Registered PF_PACKET protocol family
[    0.807857] NET: Registered PF_VSOCK protocol family
[    0.808043] IPI shorthand broadcast: enabled
[    0.808071] sched_clock: Marking stable (800446062, 7176662)->(817554884, -9932160)
[    0.808365] registered taskstats version 1
[    0.808369] Loading compiled-in X.509 certificates
[    0.810934] Loaded X.509 cert 'Build time autogenerated kernel key: 9600dd9b26c16a08c9372856781d4ae9e1d1f2b2'
[    0.811747] pstore: Using crash dump compression: deflate
[    0.811787] AppArmor: AppArmor sha1 policy hashing enabled
[    0.811798] ima: No TPM chip found, activating TPM-bypass!
[    0.811801] ima: Allocated hash algorithm: sha256
[    0.811823] ima: No architecture policies found
[    0.824363]  sda: sda1 sda2 sda3
[    0.825535] sd 0:0:0:0: [sda] Attached SCSI disk
[    0.978010] EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[    0.978052] VFS: Mounted root (ext4 filesystem) readonly on device 8:3.
[    0.983240] devtmpfs: mounted
[    0.987453] Freeing unused decrypted memory: 2044K
[    0.988209] Freeing unused kernel image (initmem) memory: 2012K
[    0.988303] Write protecting the kernel read-only data: 16384k
[    0.989357] Freeing unused kernel image (text/rodata gap) memory: 2036K
[    0.989964] Freeing unused kernel image (rodata/data gap) memory: 1564K
[    0.990073] Run /lib/systemd/systemd as init process
[    0.990076]   with arguments:
[    0.990079]     /lib/systemd/systemd
[    0.990081]     nokaslr
[    0.990082]   with environment:
[    0.990084]     HOME=/
[    0.990086]     TERM=linux
[    0.990087]     BOOT_IMAGE=/boot/vmlinuz-panic
[    1.307781] random: fast init done
[    1.784307] ipv6: module verification failed: signature and/or required key missing - tainting kernel
[    1.788970] NET: Registered PF_INET6 protocol family
[    1.790213] Segment Routing with IPv6
[    1.790225] In-situ OAM (IOAM) with IPv6
[    1.887084] systemd[1]: systemd v247.3-1.ph4 running in system mode. (+PAM -AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN -PCRE2 default-hierarchy=hybrid)
[    1.887282] systemd[1]: Detected virtualization vmware.
[    1.887294] systemd[1]: Detected architecture x86-64.
[    1.920933] systemd[1]: Set hostname to <photon-576f8974caf.org>.
[    2.315320] random: lvmconfig: uninitialized urandom read (4 bytes read)
[    2.806135] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-udev.service:27: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.806161] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-udev.service:28: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.808108] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-trigger.service:23: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.808131] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-trigger.service:24: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.809818] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-pivot.service:30: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.809840] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-pivot.service:31: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.811348] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-mount.service:22: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.811370] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-pre-mount.service:23: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.813895] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-mount.service:22: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.813917] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-mount.service:23: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.815272] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-initqueue.service:24: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.815295] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-initqueue.service:25: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.822061] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-cmdline.service:26: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
[    2.822084] systemd[1]: /usr/lib/dracut/modules.d/98dracut-systemd/dracut-cmdline.service:27: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
[    2.889763] systemd[1]: /usr/lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
[    3.000241] systemd[1]: Queued start job for default target Multi-User System.
[    3.001785] systemd[1]: system-getty.slice: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
[    3.001796] systemd[1]: (This warning is only shown for the first unit using IP firewalling.)
[    3.033876] systemd[1]: Created slice system-getty.slice.
[    3.039991] systemd[1]: Created slice system-modprobe.slice.
[    3.046641] systemd[1]: Created slice User and Session Slice.
[    3.046890] systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[    3.047074] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[    3.047705] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[    3.048535] systemd[1]: Reached target Paths.
[    3.048647] systemd[1]: Reached target Remote File Systems.
[    3.048701] systemd[1]: Reached target Slices.
[    3.048909] systemd[1]: Reached target Swap.
[    3.049307] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[    3.054703] systemd[1]: Listening on Process Core Dump Socket.
[    3.054948] systemd[1]: Listening on initctl Compatibility Named Pipe.
[    3.055550] systemd[1]: Listening on Journal Audit Socket.
[    3.055904] systemd[1]: Listening on Journal Socket (/dev/log).
[    3.056285] systemd[1]: Listening on Journal Socket.
[    3.056828] systemd[1]: Listening on Network Service Netlink Socket.
[    3.057242] systemd[1]: Listening on udev Control Socket.
[    3.057549] systemd[1]: Listening on udev Kernel Socket.
[    3.057870] systemd[1]: Listening on User Database Manager Socket.
[    3.070249] systemd[1]: Mounting Huge Pages File System...
[    3.077559] systemd[1]: Mounting POSIX Message Queue File System...
[    3.085770] systemd[1]: Mounting Kernel Debug File System...
[    3.094682] systemd[1]: Mounting Kernel Trace File System...
[    3.104339] systemd[1]: Mounting Temporary Directory (/tmp)...
[    3.113358] systemd[1]: Starting Create list of static device nodes for the current kernel...
[    3.132848] systemd[1]: Starting Load Kernel Module configfs...
[    3.146627] systemd[1]: Starting Load Kernel Module drm...
[    3.157594] systemd[1]: Starting Load Kernel Module fuse...
[    3.158676] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
[    3.170207] systemd[1]: Starting File System Check on Root Device...
[    3.180668] systemd[1]: Starting Journal Service...
[    3.198001] systemd[1]: Starting Load Kernel Modules...
[    3.198453] systemd[1]: Condition check resulted in Repartition Root Disk being skipped.
[    3.207343] systemd[1]: Starting Coldplug All udev Devices...
[    3.213098] systemd[1]: Mounted Huge Pages File System.
[    3.213366] systemd[1]: Mounted POSIX Message Queue File System.
[    3.213614] systemd[1]: Mounted Kernel Debug File System.
[    3.213856] systemd[1]: Mounted Kernel Trace File System.
[    3.214135] systemd[1]: Mounted Temporary Directory (/tmp).
[    3.215082] systemd[1]: Finished Create list of static device nodes for the current kernel.
[    3.215806] systemd[1]: modprobe@configfs.service: Succeeded.
[    3.216454] systemd[1]: Finished Load Kernel Module configfs.
[    3.229800] systemd[1]: Mounting Kernel Configuration File System...
[    3.246706] systemd[1]: Starting Create Static Device Nodes in /dev...
[    3.250135] systemd[1]: Mounted Kernel Configuration File System.
[    3.266559] systemd[1]: Finished Load Kernel Modules.
[    3.277903] systemd[1]: Starting Apply Kernel Variables...
[    3.383345] systemd[1]: Finished Apply Kernel Variables.
[    3.388925] systemd[1]: Finished File System Check on Root Device.
[    3.402430] systemd[1]: Starting Remount Root and Kernel File Systems...
[    3.412880] fuse: init (API version 7.34)
[    3.414490] systemd[1]: modprobe@fuse.service: Succeeded.
[    3.415403] systemd[1]: Finished Load Kernel Module fuse.
[    3.424760] systemd[1]: Mounting FUSE Control File System...
[    3.432541] systemd[1]: Mounted FUSE Control File System.
[    3.455386] systemd[1]: Started Journal Service.
[    3.623418] EXT4-fs (sda3): Mount option "noacl" will be removed by 3.5
               Contact linux-ext4@vger.kernel.org if you think we should keep it.

[    3.711991] EXT4-fs (sda3): re-mounted. Opts: barrier,noacl,data=ordered. Quota mode: none.
[    3.762482] systemd-journald[157]: Received client request to flush runtime journal.
[    4.523734] cryptd: max_cpu_qlen set to 1000
[    4.694398] input: ImPS/2 Generic Wheel Mouse as /devices/platform/i8042/serio1/input/input3
[    4.741837] AVX version of gcm_enc/dec engaged.
[    4.741942] AES CTR mode by8 optimization enabled
[    4.800672] mousedev: PS/2 mouse device common for all mice
[    4.941170] random: crng init done
[    5.010282] FAT-fs (sda2): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[    5.140605] Linux agpgart interface v0.103
[    5.427932] checking generic (f0000000 12c000) vs hw (f0000000 8000000)
[    5.427944] fb0: switching to vmwgfx from EFI VGA
[    5.428175] Console: switching to colour dummy device 80x25
[    5.428243] vmwgfx 0000:00:0f.0: vgaarb: deactivate vga console
[    5.430039] [TTM] Zone  kernel: Available graphics memory: 1998782 KiB
[    5.430110] vmwgfx 0000:00:0f.0: [drm] FIFO at 0x00000000ff000000 size is 8192 kiB
[    5.430237] vmwgfx 0000:00:0f.0: [drm] VRAM at 0x00000000f0000000 size is 131072 kiB
[    5.430296] vmwgfx 0000:00:0f.0: [drm] Running on SVGA version 2.
[    5.430325] vmwgfx 0000:00:0f.0: [drm] DMA map mode: Caching DMA mappings.
[    5.430606] vmwgfx 0000:00:0f.0: [drm] Legacy memory limits: VRAM = 4096 kB, FIFO = 256 kB, surface = 0 kB
[    5.430612] vmwgfx 0000:00:0f.0: [drm] MOB limits: max mob size = 16384 kB, max mob pages = 12288
[    5.430619] vmwgfx 0000:00:0f.0: [drm] Capabilities: rect copy, cursor, cursor bypass, cursor bypass 2, 8bit emulation, alpha cursor, extended fifo, multimon, pitchlock, irq mask, display topology, gmr, traces, gmr2, screen object 2, command buffers, command buffers 2, gbobject, dx, hp cmd queue, no bb restriction, cap2 register,
[    5.430626] vmwgfx 0000:00:0f.0: [drm] Capabilities2: grow otable, intra surface copy, dx2, gb memsize 2, screendma reg, otable ptdepth2, non ms to ms stretchblt, cursor mob, mshint, cb max size 4mb, dx3, frame type, cotable copy, trace full fb, extra regs, lo staging,
[    5.430630] vmwgfx 0000:00:0f.0: [drm] Max GMR ids is 64
[    5.430632] vmwgfx 0000:00:0f.0: [drm] Max number of GMR pages is 65536
[    5.430635] vmwgfx 0000:00:0f.0: [drm] Maximum display memory size is 16384 kiB
[    5.443919] vmwgfx 0000:00:0f.0: [drm] Screen Target display unit initialized
[    5.445835] vmwgfx 0000:00:0f.0: [drm] Fifo max 0x00040000 min 0x00001000 cap 0x0000077f
[    5.447717] vmwgfx 0000:00:0f.0: [drm] Using command buffers with DMA pool.
[    5.447733] vmwgfx 0000:00:0f.0: [drm] Available shader model: Legacy.
[    5.450011] fbcon: svgadrmfb (fb0) is primary device
[    5.451210] Console: switching to colour frame buffer device 100x37
[    5.458015] [drm] Initialized vmwgfx 2.19.0 20210722 for 0000:00:0f.0 on minor 0
[    8.851698] vmxnet3 0000:02:02.0 eth0: intr type 3, mode 0, 5 vectors allocated
[    8.853628] vmxnet3 0000:02:02.0 eth0: NIC Link is Up 10000 Mbps

2. lscpu output before CPU hot add
# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           62
Model name:                      Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
Stepping:                        4
CPU MHz:                         1899.999
BogoMIPS:                        3799.99
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        80 MiB
NUMA node0 CPU(s):               0
NUMA node1 CPU(s):               1
NUMA node2 CPU(s):               2
NUMA node3 CPU(s):               3
Vulnerability Itlb multihit:     Processor vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx r
                                 dtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq sss
                                 e3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fa
                                 ult pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep arat md_clear flush_l1d arch_capabilities


3. last 12 lines of dmesg after CPU hot add
# dmesg | tail -n 12
[    5.451210] Console: switching to colour frame buffer device 100x37
[    5.458015] [drm] Initialized vmwgfx 2.19.0 20210722 for 0000:00:0f.0 on minor 0
[    8.851698] vmxnet3 0000:02:02.0 eth0: intr type 3, mode 0, 5 vectors allocated
[    8.853628] vmxnet3 0000:02:02.0 eth0: NIC Link is Up 10000 Mbps
[  152.680387] audit: type=1006 audit(1637024339.355:2): pid=415 uid=0 subj==unconfined old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
[  152.680403] audit: type=1300 audit(1637024339.355:2): arch=c000003e syscall=1 success=yes exit=1 a0=8 a1=7ffeaf68d5c0 a2=1 a3=0 items=0 ppid=1 pid=415 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=1 comm="(systemd)" exe="/usr/lib/systemd/systemd" subj==unconfined key=(null)
[  152.680410] audit: type=1327 audit(1637024339.355:2): proctitle="(systemd)"
[  152.909780] audit: type=1006 audit(1637024339.583:3): pid=405 uid=0 subj==unconfined old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=2 res=1
[  152.909807] audit: type=1300 audit(1637024339.583:3): arch=c000003e syscall=1 success=yes exit=1 a0=7 a1=7ffd04812df0 a2=1 a3=0 items=0 ppid=1 pid=405 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=2 comm="sshd" exe="/usr/sbin/sshd" subj==unconfined key=(null)
[  152.909819] audit: type=1327 audit(1637024339.583:3): proctitle=737368643A20726F6F74205B707269765D
[  720.093493] CPU4 has been hot-added
[  720.093651] acpi_processor_hotadd_init:205 cpu 4, node 4, online 0, ndata 0000000000000000

Comment.
The last message line was added to get more info about newly added node:
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 6737b1cbf..bbc1a70d5 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -200,6 +200,10 @@ static int acpi_processor_hotadd_init(struct acpi_processor *pr)
         * gets online for the first time.
         */
        pr_info("CPU%d has been hot-added\n", pr->id);
+       {
+               int nid = cpu_to_node(pr->id);
+               printk("%s:%d cpu %d, node %d, online %d, ndata %p\n", __FUNCTION__, __LINE__, pr->id, nid, node_online(nid), NODE_DATA(nid));
+       }
        pr->flags.need_hotplug_init = 1;

 out:


4. lscpu output after CPU hot add
# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          5
On-line CPU(s) list:             0-3
Off-line CPU(s) list:            4
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           62
Model name:                      Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
Stepping:                        4
CPU MHz:                         1899.999
BogoMIPS:                        3799.99
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        80 MiB
NUMA node0 CPU(s):               0
NUMA node1 CPU(s):               1
NUMA node2 CPU(s):               2
NUMA node3 CPU(s):               3
Vulnerability Itlb multihit:     Processor vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx r
                                 dtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq sss
                                 e3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fa
                                 ult pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep arat md_clear flush_l1d arch_capabilities

5. Panic message.
[  152.909819] audit: type=1327 audit(1637024339.583:3): proctitle=737368643A20726F6F74205B707269765D
[  720.093493] CPU4 has been hot-added
[  720.093651] acpi_processor_hotadd_init:205 cpu 4, node 4, online 0, ndata 0000000000000000
[ 1173.844361] BUG: unable to handle page fault for address: 0000000000001608
[ 1173.844443] #PF: supervisor read access in kernel mode
[ 1173.844483] #PF: error_code(0x0000) - not-present page
[ 1173.844522] PGD 0 P4D 0
[ 1173.844546] Oops: 0000 [#1] SMP PTI
[ 1173.844576] CPU: 2 PID: 1 Comm: systemd Tainted: G            E     5.15.0-rc7-panic+ #23
[ 1173.844637] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18452719.B64.2108091906 08/09/2021
[ 1173.844721] RIP: 0010:__alloc_pages+0x135/0x2f0
[ 1173.844763] Code: 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 4c 8b 45 a0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c0 48 85 d2 0f 85 4c 01 00 00 <45> 3b 68 08 0f 82 42 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
[ 1173.844893] RSP: 0018:ffffc900006efba0 EFLAGS: 00010246
[ 1173.844934] RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
[ 1173.845006] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
[ 1173.845061] RBP: ffffc900006efc00 R08: 0000000000001600 R09: 0000000000000000
[ 1173.845112] R10: 0000000000000004 R11: ffffe8ff7f601050 R12: 0000000000000cc2
[ 1173.845163] R13: 0000000000000001 R14: 0000000000000cc2 R15: 0000000000000000
[ 1173.845214] FS:  00007f76e2a4f500(0000) GS:ffff8880be600000(0000) knlGS:0000000000000000
[ 1173.845272] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1173.845314] CR2: 0000000000001608 CR3: 0000000080adc001 CR4: 00000000001706a0
[ 1173.845420] Call Trace:
[ 1173.845449]  pcpu_alloc_pages.constprop.0+0xd3/0x1b0
[ 1173.845494]  pcpu_populate_chunk+0x38/0xb0
[ 1173.845581]  pcpu_alloc+0x57b/0x830
[ 1173.845760]  __alloc_percpu_gfp+0x12/0x20
[ 1173.845948]  alloc_mem_cgroup_per_node_info+0x5e/0xc0
[ 1173.846154]  mem_cgroup_alloc+0xf2/0x2f0
[ 1173.846342]  mem_cgroup_css_alloc+0x38/0x300
[ 1173.846543]  css_create+0x3f/0x200
[ 1173.846730]  cgroup_apply_control_enable+0x13b/0x160
[ 1173.846938]  cgroup_mkdir+0xe6/0x190
[ 1173.847135]  kernfs_iop_mkdir+0x5c/0x90
[ 1173.847324]  vfs_mkdir+0x17d/0x230
[ 1173.848489]  do_mkdirat+0x102/0x120
[ 1173.849673]  __x64_sys_mkdir+0x4c/0x70
[ 1173.850741]  ? syscall_exit_to_user_mode+0x21/0x50
[ 1173.851785]  do_syscall_64+0x43/0x90
[ 1173.852835]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1173.854074] RIP: 0033:0x7f76e3013937
[ 1173.855963] Code: 00 00 00 48 8b 05 31 05 0d 00 41 bc ff ff ff ff 64 c7 00 16 00 00 00 e9 37 ff ff ff e8 32 e7 01 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 01 05 0d 00 f7 d8 64 89 01 48
[ 1173.859296] RSP: 002b:00007ffeaf68dd18 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
[ 1173.860399] RAX: ffffffffffffffda RBX: 00005640f268ba10 RCX: 00007f76e3013937
[ 1173.861502] RDX: 00000000000001ed RSI: 00000000000001ed RDI: 00005640f2760e80
[ 1173.862612] RBP: 00007ffeaf68dd30 R08: 0000000000000001 R09: 0000000000000000
[ 1173.863743] R10: 00005640f2760e80 R11: 0000000000000202 R12: 00005640f0d908e0
[ 1173.865090] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1173.866109] Modules linked in: vmwgfx(E) ttm(E) agpgart(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) mousedev(E) sysimgblt(E) vfat(E) aesni_intel(E) fat(E) fb_sys_fops(E) crypto_simd(E) cryptd(E) psmouse(E) evdev(E) drm(E) fuse(E) i2c_core(E) configfs(E) efivarfs(E) ipv6(E)
[ 1173.869624] CR2: 0000000000001608
[ 1173.870794] ---[ end trace 2f96517946d847e9 ]---
[ 1174.236250] RIP: 0010:__alloc_pages+0x135/0x2f0
[ 1174.237390] Code: 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 4c 8b 45 a0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c0 48 85 d2 0f 85 4c 01 00 00 <45> 3b 68 08 0f 82 42 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2
[ 1174.241007] RSP: 0018:ffffc900006efba0 EFLAGS: 00010246
[ 1174.242175] RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000
[ 1174.243362] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2
[ 1174.244672] RBP: ffffc900006efc00 R08: 0000000000001600 R09: 0000000000000000
[ 1174.245859] R10: 0000000000000004 R11: ffffe8ff7f601050 R12: 0000000000000cc2
[ 1174.247037] R13: 0000000000000001 R14: 0000000000000cc2 R15: 0000000000000000
[ 1174.248333] FS:  00007f76e2a4f500(0000) GS:ffff8880be600000(0000) knlGS:0000000000000000
[ 1174.249599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1174.251186] CR2: 0000000000001608 CR3: 0000000080adc001 CR4: 00000000001706a0
[ 1174.252570] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 1174.253994] Kernel Offset: disabled

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-16  1:31                                   ` Alexey Makhalov
@ 2021-11-16  9:17                                     ` Michal Hocko
  2021-11-16 20:22                                       ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-16  9:17 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 16-11-21 01:31:44, Alexey Makhalov wrote:
[...]
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6737b1cbf..bbc1a70d5 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -200,6 +200,10 @@ static int acpi_processor_hotadd_init(struct acpi_processor *pr)
>          * gets online for the first time.
>          */
>         pr_info("CPU%d has been hot-added\n", pr->id);
> +       {
> +               int nid = cpu_to_node(pr->id);
> +               printk("%s:%d cpu %d, node %d, online %d, ndata %p\n", __FUNCTION__, __LINE__, pr->id, nid, node_online(nid), NODE_DATA(nid));
> +       }
>         pr->flags.need_hotplug_init = 1;

OK, IIUC you are adding a processor which is outside of
possible_cpu_mask and that means that the node is not allocated for such
a future to be hotplugged cpu and its memory node. init_cpu_to_node
would have done that initialization otherwise. I think you want to talk
to x86 maintainers and people who have introduced a support for
memoryless nodes for x86.

To me it seems like you are trying to use a functionality that has
never been properly implemented. I do not remember how other acpi based
architectures handle this and maybe we need a generic solution and that
would bring up the node as soon as a new cpu is hot added.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-16  9:17                                     ` Michal Hocko
@ 2021-11-16 20:22                                       ` Alexey Makhalov
  2021-11-18  8:35                                         ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-11-16 20:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 2889 bytes --]



> On Nov 16, 2021, at 1:17 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 16-11-21 01:31:44, Alexey Makhalov wrote:
> [...]
>> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
>> index 6737b1cbf..bbc1a70d5 100644
>> --- a/drivers/acpi/acpi_processor.c
>> +++ b/drivers/acpi/acpi_processor.c
>> @@ -200,6 +200,10 @@ static int acpi_processor_hotadd_init(struct acpi_processor *pr)
>>        * gets online for the first time.
>>        */
>>       pr_info("CPU%d has been hot-added\n", pr->id);
>> +       {
>> +               int nid = cpu_to_node(pr->id);
>> +               printk("%s:%d cpu %d, node %d, online %d, ndata %p\n", __FUNCTION__, __LINE__, pr->id, nid, node_online(nid), NODE_DATA(nid));
>> +       }
>>       pr->flags.need_hotplug_init = 1;
> 
> OK, IIUC you are adding a processor which is outside of
> possible_cpu_mask and that means that the node is not allocated for such
> a future to be hotplugged cpu and its memory node. init_cpu_to_node
> would have done that initialization otherwise.
It is not correct.

possible_cpus is 128 for this VM. Look at SRAT and percpu output for proof.
[    0.085524] SRAT: PXM 127 -> APIC 0xfe -> Node 127
[    0.118928] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:128

It is impossible to add processor outside of possible_cpu_mask. possible_cpus is absolute maximum
that system can support. See Documentation/core-api/cpu_hotplug.rst

Number of present and onlined CPUs (and nodes) is 4. Other 124 CPUs (and nodes) are not present, but can
be potentially hot added.
Number of initialized nodes is 4, as init_cpu_to_node() will skip not yet present nodes,
see arch/x86/mm/numa.c:798 (numa_cpu_node(CPU #4) == NUMA_NO_NODE)
788 void __init init_cpu_to_node(void)
789 {
790         int cpu;
791         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
792
793         BUG_ON(cpu_to_apicid == NULL);
794
795         for_each_possible_cpu(cpu) {
796                 int node = numa_cpu_node(cpu);
797
798                 if (node == NUMA_NO_NODE)
799                         continue;
800

After CPU (and node) hot plug:
- CPU 4 is marker as present, but not yet online
- New node got ID 4. numa_cpu_node(CPU #4) returns 4
- node_online(4) == 0 and NODE_DATA(4) == NULL, but it will be accessed inside
for_each_possible_cpu loop in percpu allocation.

Digging further.
Even if x86/CPU hot add maintainers decide to clean up memoryless node hot add code to initialize the node on time of
attaching it (to be aligned with mm node while memory hot add), this percpu fix is still needed as it is used during
the node onlining, See chicken and egg problem that I described above.
Or as 2nd option, numa_cpu_node(4) should return NUMA_NO_NODE until node 4 get fully initialized.

Regards,
—Alexey



[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-16 20:22                                       ` Alexey Makhalov
@ 2021-11-18  8:35                                         ` Michal Hocko
  2021-12-07 10:54                                           ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-11-18  8:35 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 16-11-21 20:22:49, Alexey Makhalov wrote:
> 
> 
> > On Nov 16, 2021, at 1:17 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Tue 16-11-21 01:31:44, Alexey Makhalov wrote:
> > [...]
> >> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> >> index 6737b1cbf..bbc1a70d5 100644
> >> --- a/drivers/acpi/acpi_processor.c
> >> +++ b/drivers/acpi/acpi_processor.c
> >> @@ -200,6 +200,10 @@ static int acpi_processor_hotadd_init(struct acpi_processor *pr)
> >>        * gets online for the first time.
> >>        */
> >>       pr_info("CPU%d has been hot-added\n", pr->id);
> >> +       {
> >> +               int nid = cpu_to_node(pr->id);
> >> +               printk("%s:%d cpu %d, node %d, online %d, ndata %p\n", __FUNCTION__, __LINE__, pr->id, nid, node_online(nid), NODE_DATA(nid));
> >> +       }
> >>       pr->flags.need_hotplug_init = 1;
> > 
> > OK, IIUC you are adding a processor which is outside of
> > possible_cpu_mask and that means that the node is not allocated for such
> > a future to be hotplugged cpu and its memory node. init_cpu_to_node
> > would have done that initialization otherwise.
> It is not correct.
> 
> possible_cpus is 128 for this VM. Look at SRAT and percpu output for proof.
> [    0.085524] SRAT: PXM 127 -> APIC 0xfe -> Node 127
> [    0.118928] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:128 nr_node_ids:128

OK, I see. I have missed that when looking at the boot log you have
sent.

> It is impossible to add processor outside of possible_cpu_mask. possible_cpus is absolute maximum
> that system can support. See Documentation/core-api/cpu_hotplug.rst

That was my understanding hence the suspicion you might be doing
something that is not really supported.

> Number of present and onlined CPUs (and nodes) is 4. Other 124 CPUs (and nodes) are not present, but can
> be potentially hot added.

Yes this is a configuration I have already seen. The cpu->node binding
was configured during the boot time though IIRC.

> Number of initialized nodes is 4, as init_cpu_to_node() will skip not yet present nodes,
> see arch/x86/mm/numa.c:798 (numa_cpu_node(CPU #4) == NUMA_NO_NODE)

Isn't this the problem? Why is the cpu->node association missing here? 

> 788 void __init init_cpu_to_node(void)
> 789 {
> 790         int cpu;
> 791         u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> 792
> 793         BUG_ON(cpu_to_apicid == NULL);
> 794
> 795         for_each_possible_cpu(cpu) {
> 796                 int node = numa_cpu_node(cpu);
> 797
> 798                 if (node == NUMA_NO_NODE)
> 799                         continue;
> 800
> 
> After CPU (and node) hot plug:
> - CPU 4 is marker as present, but not yet online
> - New node got ID 4. numa_cpu_node(CPU #4) returns 4
> - node_online(4) == 0 and NODE_DATA(4) == NULL, but it will be accessed inside
> for_each_possible_cpu loop in percpu allocation.
> 
> Digging further.
> Even if x86/CPU hot add maintainers decide to clean up memoryless node hot add code to initialize the node on time of
> attaching it (to be aligned with mm node while memory hot add), this percpu fix is still needed as it is used during
> the node onlining, See chicken and egg problem that I described above.

I have to say I do not see the chicken and egg problem. As long as
init_cpu_to_node initializes the memoryless node for the cpu properly
then the pcp allocator doesn't really have to care as the page allocator
falls back to to first populated node in a distance order. So I believe
the whole issue boils down to addressing why init_cpu_to_node doesn't
see a proper cpu->node association.

> Or as 2nd option, numa_cpu_node(4) should return NUMA_NO_NODE until node 4 get fully initialized.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-11-18  8:35                                         ` Michal Hocko
@ 2021-12-07 10:54                                           ` Michal Hocko
  2021-12-07 11:08                                             ` David Hildenbrand
                                                               ` (2 more replies)
  0 siblings, 3 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 10:54 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

Hi,
I didn't have much time to dive into this deeper and I have hit some
problems handling this in an arch specific code so I have tried to play
with this instead:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..4d71759d0d9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8032,8 +8032,16 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
+	for_each_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n");
+			alloc_node_data(nid);
+			free_area_init_memoryless_node(nid);
+			continue;
+		}
+
 		free_area_init_node(nid);
 
 		/* Any memory on that node */

Could you give it a try? I do not have any machine which would exhibit
the problem so I cannot really test this out. I hope build_zone_info
will not choke on this. I assume the node distance table is
uninitialized for these nodes and IIUC this should lead to an assumption
that all other nodes are close. But who knows that can blow up there.

Btw. does this make any sense at all to others?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 10:54                                           ` Michal Hocko
@ 2021-12-07 11:08                                             ` David Hildenbrand
  2021-12-07 12:13                                               ` Michal Hocko
  2021-12-08  8:54                                             ` Michal Hocko
  2021-12-09 10:48                                             ` Michal Hocko
  2 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 11:08 UTC (permalink / raw)
  To: Michal Hocko, Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	Oscar Salvador, Tejun Heo, Christoph Lameter, linux-kernel,
	stable

On 07.12.21 11:54, Michal Hocko wrote:
> Hi,
> I didn't have much time to dive into this deeper and I have hit some
> problems handling this in an arch specific code so I have tried to play
> with this instead:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..4d71759d0d9b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8032,8 +8032,16 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> +	for_each_node(nid) {
>  		pg_data_t *pgdat = NODE_DATA(nid);
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n");
> +			alloc_node_data(nid);

That's x86 specific and not exposed to generic code -- at least in my
code base. I think we'd want an arch_alloc_nodedata() variant that
allocates via memblock -- and initializes all fields to 0. So
essentially a generic alloc_node_data().

> +			free_area_init_memoryless_node(nid);

That's really just free_area_init_node() below, I do wonder what value
free_area_init_memoryless_node() has as of today.

> +			continue;
> +		}
> +
>  		free_area_init_node(nid);
>  
>  		/* Any memory on that node */
> 
> Could you give it a try? I do not have any machine which would exhibit
> the problem so I cannot really test this out. I hope build_zone_info
> will not choke on this. I assume the node distance table is
> uninitialized for these nodes and IIUC this should lead to an assumption
> that all other nodes are close. But who knows that can blow up there.
> 
> Btw. does this make any sense at all to others?
> 

__build_all_zonelists() has to update the zonelists of all nodes I think.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 11:08                                             ` David Hildenbrand
@ 2021-12-07 12:13                                               ` Michal Hocko
  2021-12-07 12:28                                                 ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 12:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 12:08:57, David Hildenbrand wrote:
> On 07.12.21 11:54, Michal Hocko wrote:
> > Hi,
> > I didn't have much time to dive into this deeper and I have hit some
> > problems handling this in an arch specific code so I have tried to play
> > with this instead:
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..4d71759d0d9b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8032,8 +8032,16 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >  	/* Initialise every node */
> >  	mminit_verify_pageflags_layout();
> >  	setup_nr_node_ids();
> > -	for_each_online_node(nid) {
> > +	for_each_node(nid) {
> >  		pg_data_t *pgdat = NODE_DATA(nid);
> > +
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n");
> > +			alloc_node_data(nid);
> 
> That's x86 specific and not exposed to generic code -- at least in my
> code base. I think we'd want an arch_alloc_nodedata() variant that
> allocates via memblock -- and initializes all fields to 0. So
> essentially a generic alloc_node_data().

you are right

> 
> > +			free_area_init_memoryless_node(nid);
> 
> That's really just free_area_init_node() below, I do wonder what value
> free_area_init_memoryless_node() has as of today.

I am not sure there is any real value in having this special name for
this but I have kept is sync with what x86 does currently. If we want to
remove the wrapper then just do it everywhere. I can do that on top.

> > +			continue;
> > +		}
> > +
> >  		free_area_init_node(nid);
> >  
> >  		/* Any memory on that node */
> > 
> > Could you give it a try? I do not have any machine which would exhibit
> > the problem so I cannot really test this out. I hope build_zone_info
> > will not choke on this. I assume the node distance table is
> > uninitialized for these nodes and IIUC this should lead to an assumption
> > that all other nodes are close. But who knows that can blow up there.
> > 
> > Btw. does this make any sense at all to others?
> > 
> 
> __build_all_zonelists() has to update the zonelists of all nodes I think.

I am not sure what you mean. This should be achieved by this patch
because the boot time build_all_zonelists will go over all online nodes
(i.e. with pgdat). free_area_init happens before that. I am just worried
that the arch specific node_distance() will generate a complete garbage
or blow up for some reason.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 12:13                                               ` Michal Hocko
@ 2021-12-07 12:28                                                 ` David Hildenbrand
  2021-12-07 13:23                                                   ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 12:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

>>> +			free_area_init_memoryless_node(nid);
>>
>> That's really just free_area_init_node() below, I do wonder what value
>> free_area_init_memoryless_node() has as of today.
> 
> I am not sure there is any real value in having this special name for
> this but I have kept is sync with what x86 does currently. If we want to
> remove the wrapper then just do it everywhere. I can do that on top.
> 

Sure, just a general comment.

>>> +			continue;
>>> +		}
>>> +
>>>  		free_area_init_node(nid);
>>>  
>>>  		/* Any memory on that node */
>>>
>>> Could you give it a try? I do not have any machine which would exhibit
>>> the problem so I cannot really test this out. I hope build_zone_info
>>> will not choke on this. I assume the node distance table is
>>> uninitialized for these nodes and IIUC this should lead to an assumption
>>> that all other nodes are close. But who knows that can blow up there.
>>>
>>> Btw. does this make any sense at all to others?
>>>
>>
>> __build_all_zonelists() has to update the zonelists of all nodes I think.
> 
> I am not sure what you mean. This should be achieved by this patch
> because the boot time build_all_zonelists will go over all online nodes

"Over all possible nodes", including online and offline ones, to make
sure any possible node has a valid pgdat. IIUC, you're not changing
anything about online vs offline nodes, only that we have a pgdat also
for offline nodes.

> (i.e. with pgdat). free_area_init happens before that. I am just worried
> that the arch specific node_distance() will generate a complete garbage
> or blow up for some reason. 

Assume you online a new zone and then call __build_all_zonelists() to
include the zone in all zonelists (via online_pages()).
__build_all_zonelists() will not include offline nodes (that still have
a pgdat with a valid zonelist now).

Similarly, assume you online a zone and then call
__build_all_zonelists() to exclude the zone from all zonelists (via
offline_pages()). __build_all_zonelists() will not include offline nodes
(that still have a pgdat with a valid zonelist now).

Essentially, IIRC, even after your change
start_kernel()->build_all_zonelists(NULL)->build_all_zonelists_init()->__build_all_zonelists(NULL)
won't initialize the zonelist of the new pgdat, because the nodes are
offline.

I'd assume we'd need

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..e5d958abc7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,7 @@ static void __build_all_zonelists(void *data)
        if (self && !node_online(self->node_id)) {
                build_zonelists(self);
        } else {
-               for_each_online_node(nid) {
+               for_each_node(nid) {
                        pg_data_t *pgdat = NODE_DATA(nid);

                        build_zonelists(pgdat);

to properly initialize the zonelist also for the offline nodes with a
valid pgdat.

But maybe I am missing something important regarding online vs. offline
nodes that your patch changes?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 12:28                                                 ` David Hildenbrand
@ 2021-12-07 13:23                                                   ` Michal Hocko
  2021-12-07 15:09                                                     ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 13:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
[...]
> But maybe I am missing something important regarding online vs. offline
> nodes that your patch changes?

I am relying on alloc_node_data setting the node online. But if we are
to change the call to arch_alloc_node_data then the patch needs to be
more involved. Here is what I have right now. If this happens to be the
right way then there is some additional work to sync up with the hotplug
code.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..a296e934ad2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
+			pgdat = arch_alloc_nodedata(nid);
+			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
+			arch_refresh_nodedata(nid, pgdat);
+			node_set_online(nid);
+			/* TODO do we need register_one_node here or postpone to
+			 * when any memory is onlined there
+			 */
+			free_area_init_memoryless_node(nid);
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 13:23                                                   ` Michal Hocko
@ 2021-12-07 15:09                                                     ` David Hildenbrand
  2021-12-07 15:29                                                       ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On 07.12.21 14:23, Michal Hocko wrote:
> On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
> [...]
>> But maybe I am missing something important regarding online vs. offline
>> nodes that your patch changes?
> 
> I am relying on alloc_node_data setting the node online. But if we are
> to change the call to arch_alloc_node_data then the patch needs to be
> more involved. Here is what I have right now. If this happens to be the
> right way then there is some additional work to sync up with the hotplug
> code.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..a296e934ad2f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
> +			pgdat = arch_alloc_nodedata(nid);
> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +			arch_refresh_nodedata(nid, pgdat);
> +			node_set_online(nid);

Setting all possible nodes online might result in quite some QE noice,
because all these nodes will then be visible in the sysfs and
try_offline_nodes() is essentially for the trash.

I agree to prealloc the pgdat, I don't think we should actually set the
nodes online. Node onlining/offlining should be done when we do have
actual CPUs/memory populated.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 15:09                                                     ` David Hildenbrand
@ 2021-12-07 15:29                                                       ` Michal Hocko
  2021-12-07 15:34                                                         ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 15:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 16:09:39, David Hildenbrand wrote:
> On 07.12.21 14:23, Michal Hocko wrote:
> > On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
> > [...]
> >> But maybe I am missing something important regarding online vs. offline
> >> nodes that your patch changes?
> > 
> > I am relying on alloc_node_data setting the node online. But if we are
> > to change the call to arch_alloc_node_data then the patch needs to be
> > more involved. Here is what I have right now. If this happens to be the
> > right way then there is some additional work to sync up with the hotplug
> > code.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..a296e934ad2f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >  	/* Initialise every node */
> >  	mminit_verify_pageflags_layout();
> >  	setup_nr_node_ids();
> > -	for_each_online_node(nid) {
> > -		pg_data_t *pgdat = NODE_DATA(nid);
> > +	for_each_node(nid) {
> > +		pg_data_t *pgdat;
> > +
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
> > +			pgdat = arch_alloc_nodedata(nid);
> > +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> > +			arch_refresh_nodedata(nid, pgdat);
> > +			node_set_online(nid);
> 
> Setting all possible nodes online might result in quite some QE noice,
> because all these nodes will then be visible in the sysfs and
> try_offline_nodes() is essentially for the trash.

I am not sure I follow. I believe sysfs will not get populate because I
do not call register_one_node.

You are right that try_offline_nodes will be reduce which is good imho.
More changes will be possible (hopefully to drop some ugly code) on top
of this change (or any other that achieves that there are no NULL pgdat
for possible nodes).
 
> I agree to prealloc the pgdat, I don't think we should actually set the
> nodes online. Node onlining/offlining should be done when we do have
> actual CPUs/memory populated.

If we keep the offline/online node state notion we are not solving an
important aspect of the problem - confusing api.

Node states do not really correspond to logical states and that makes
it really hard to wrap head around. I think we should completely drop
for_each_online_node because that just doesn't mean anything without
synchronization with hotplug. People who really need to iterate over all
numa nodes should be using for_each_node and do not expect any surprises
that the node doesn't exist. It is much more easier to think in scope of
completely depleted numa node (and get ENOMEM when strictly requiring
local node resources - e.g. via __GFP_THISNODE) than some special node
without any memory that need a special treatment.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 15:29                                                       ` Michal Hocko
@ 2021-12-07 15:34                                                         ` David Hildenbrand
  2021-12-07 15:56                                                           ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 15:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On 07.12.21 16:29, Michal Hocko wrote:
> On Tue 07-12-21 16:09:39, David Hildenbrand wrote:
>> On 07.12.21 14:23, Michal Hocko wrote:
>>> On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
>>> [...]
>>>> But maybe I am missing something important regarding online vs. offline
>>>> nodes that your patch changes?
>>>
>>> I am relying on alloc_node_data setting the node online. But if we are
>>> to change the call to arch_alloc_node_data then the patch needs to be
>>> more involved. Here is what I have right now. If this happens to be the
>>> right way then there is some additional work to sync up with the hotplug
>>> code.
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c5952749ad40..a296e934ad2f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>>>  	/* Initialise every node */
>>>  	mminit_verify_pageflags_layout();
>>>  	setup_nr_node_ids();
>>> -	for_each_online_node(nid) {
>>> -		pg_data_t *pgdat = NODE_DATA(nid);
>>> +	for_each_node(nid) {
>>> +		pg_data_t *pgdat;
>>> +
>>> +		if (!node_online(nid)) {
>>> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
>>> +			pgdat = arch_alloc_nodedata(nid);
>>> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
>>> +			arch_refresh_nodedata(nid, pgdat);
>>> +			node_set_online(nid);
>>
>> Setting all possible nodes online might result in quite some QE noice,
>> because all these nodes will then be visible in the sysfs and
>> try_offline_nodes() is essentially for the trash.
> 
> I am not sure I follow. I believe sysfs will not get populate because I
> do not call register_one_node.

arch/x86/kernel/topology.c:topology_init()

for_each_online_node(i)
	register_one_node(i);



> 
> You are right that try_offline_nodes will be reduce which is good imho.
> More changes will be possible (hopefully to drop some ugly code) on top
> of this change (or any other that achieves that there are no NULL pgdat
> for possible nodes).
> 

No to exposing actually offline nodes to user space via sysfs.
Let's concentrate on preallocating the pgdat and fixing the issue at
hand. One step at a time please.


>> I agree to prealloc the pgdat, I don't think we should actually set the
>> nodes online. Node onlining/offlining should be done when we do have
>> actual CPUs/memory populated.
> 
> If we keep the offline/online node state notion we are not solving an
> important aspect of the problem - confusing api.

I don't think it's that confusing. Just like we do have online and
offline CPUs. Or online and offline memory blocks. Similarly, a node is
either online or offline.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 15:34                                                         ` David Hildenbrand
@ 2021-12-07 15:56                                                           ` Michal Hocko
  2021-12-07 16:09                                                             ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 15:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 16:34:30, David Hildenbrand wrote:
> On 07.12.21 16:29, Michal Hocko wrote:
> > On Tue 07-12-21 16:09:39, David Hildenbrand wrote:
> >> On 07.12.21 14:23, Michal Hocko wrote:
> >>> On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
> >>> [...]
> >>>> But maybe I am missing something important regarding online vs. offline
> >>>> nodes that your patch changes?
> >>>
> >>> I am relying on alloc_node_data setting the node online. But if we are
> >>> to change the call to arch_alloc_node_data then the patch needs to be
> >>> more involved. Here is what I have right now. If this happens to be the
> >>> right way then there is some additional work to sync up with the hotplug
> >>> code.
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index c5952749ad40..a296e934ad2f 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >>>  	/* Initialise every node */
> >>>  	mminit_verify_pageflags_layout();
> >>>  	setup_nr_node_ids();
> >>> -	for_each_online_node(nid) {
> >>> -		pg_data_t *pgdat = NODE_DATA(nid);
> >>> +	for_each_node(nid) {
> >>> +		pg_data_t *pgdat;
> >>> +
> >>> +		if (!node_online(nid)) {
> >>> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
> >>> +			pgdat = arch_alloc_nodedata(nid);
> >>> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> >>> +			arch_refresh_nodedata(nid, pgdat);
> >>> +			node_set_online(nid);
> >>
> >> Setting all possible nodes online might result in quite some QE noice,
> >> because all these nodes will then be visible in the sysfs and
> >> try_offline_nodes() is essentially for the trash.
> > 
> > I am not sure I follow. I believe sysfs will not get populate because I
> > do not call register_one_node.
> 
> arch/x86/kernel/topology.c:topology_init()
> 
> for_each_online_node(i)
> 	register_one_node(i);

Right you are.
 
> > You are right that try_offline_nodes will be reduce which is good imho.
> > More changes will be possible (hopefully to drop some ugly code) on top
> > of this change (or any other that achieves that there are no NULL pgdat
> > for possible nodes).
> > 
> 
> No to exposing actually offline nodes to user space via sysfs.

Why is that a problem with the sysfs for non-populated nodes?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 15:56                                                           ` Michal Hocko
@ 2021-12-07 16:09                                                             ` David Hildenbrand
  2021-12-07 16:27                                                               ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 16:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On 07.12.21 16:56, Michal Hocko wrote:
> On Tue 07-12-21 16:34:30, David Hildenbrand wrote:
>> On 07.12.21 16:29, Michal Hocko wrote:
>>> On Tue 07-12-21 16:09:39, David Hildenbrand wrote:
>>>> On 07.12.21 14:23, Michal Hocko wrote:
>>>>> On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
>>>>> [...]
>>>>>> But maybe I am missing something important regarding online vs. offline
>>>>>> nodes that your patch changes?
>>>>>
>>>>> I am relying on alloc_node_data setting the node online. But if we are
>>>>> to change the call to arch_alloc_node_data then the patch needs to be
>>>>> more involved. Here is what I have right now. If this happens to be the
>>>>> right way then there is some additional work to sync up with the hotplug
>>>>> code.
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index c5952749ad40..a296e934ad2f 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>>>>>  	/* Initialise every node */
>>>>>  	mminit_verify_pageflags_layout();
>>>>>  	setup_nr_node_ids();
>>>>> -	for_each_online_node(nid) {
>>>>> -		pg_data_t *pgdat = NODE_DATA(nid);
>>>>> +	for_each_node(nid) {
>>>>> +		pg_data_t *pgdat;
>>>>> +
>>>>> +		if (!node_online(nid)) {
>>>>> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
>>>>> +			pgdat = arch_alloc_nodedata(nid);
>>>>> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
>>>>> +			arch_refresh_nodedata(nid, pgdat);
>>>>> +			node_set_online(nid);
>>>>
>>>> Setting all possible nodes online might result in quite some QE noice,
>>>> because all these nodes will then be visible in the sysfs and
>>>> try_offline_nodes() is essentially for the trash.
>>>
>>> I am not sure I follow. I believe sysfs will not get populate because I
>>> do not call register_one_node.
>>
>> arch/x86/kernel/topology.c:topology_init()
>>
>> for_each_online_node(i)
>> 	register_one_node(i);
> 
> Right you are.
>  
>>> You are right that try_offline_nodes will be reduce which is good imho.
>>> More changes will be possible (hopefully to drop some ugly code) on top
>>> of this change (or any other that achieves that there are no NULL pgdat
>>> for possible nodes).
>>>
>>
>> No to exposing actually offline nodes to user space via sysfs.
> 
> Why is that a problem with the sysfs for non-populated nodes?
> 

https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/

Contains some points -- certainly nothing unfixable but it clearly shows
that users expect only nodes with actual memory and cpus to be online --
that's why we export the possible+online state to user space. My point
is to be careful with such drastic changes and do one step at a time.

I think preallocation of the pgdat is a reasonable thing to have without
changing user-space visible semantics or even in-kernel semantics.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 16:09                                                             ` David Hildenbrand
@ 2021-12-07 16:27                                                               ` Michal Hocko
  2021-12-07 16:36                                                                 ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 16:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 17:09:50, David Hildenbrand wrote:
> On 07.12.21 16:56, Michal Hocko wrote:
> > On Tue 07-12-21 16:34:30, David Hildenbrand wrote:
> >> On 07.12.21 16:29, Michal Hocko wrote:
> >>> On Tue 07-12-21 16:09:39, David Hildenbrand wrote:
> >>>> On 07.12.21 14:23, Michal Hocko wrote:
> >>>>> On Tue 07-12-21 13:28:31, David Hildenbrand wrote:
> >>>>> [...]
> >>>>>> But maybe I am missing something important regarding online vs. offline
> >>>>>> nodes that your patch changes?
> >>>>>
> >>>>> I am relying on alloc_node_data setting the node online. But if we are
> >>>>> to change the call to arch_alloc_node_data then the patch needs to be
> >>>>> more involved. Here is what I have right now. If this happens to be the
> >>>>> right way then there is some additional work to sync up with the hotplug
> >>>>> code.
> >>>>>
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index c5952749ad40..a296e934ad2f 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -8032,8 +8032,23 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >>>>>  	/* Initialise every node */
> >>>>>  	mminit_verify_pageflags_layout();
> >>>>>  	setup_nr_node_ids();
> >>>>> -	for_each_online_node(nid) {
> >>>>> -		pg_data_t *pgdat = NODE_DATA(nid);
> >>>>> +	for_each_node(nid) {
> >>>>> +		pg_data_t *pgdat;
> >>>>> +
> >>>>> +		if (!node_online(nid)) {
> >>>>> +			pr_warn("Node %d uninitialized by the platform. Please report with memory map.\n", nid);
> >>>>> +			pgdat = arch_alloc_nodedata(nid);
> >>>>> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> >>>>> +			arch_refresh_nodedata(nid, pgdat);
> >>>>> +			node_set_online(nid);
> >>>>
> >>>> Setting all possible nodes online might result in quite some QE noice,
> >>>> because all these nodes will then be visible in the sysfs and
> >>>> try_offline_nodes() is essentially for the trash.
> >>>
> >>> I am not sure I follow. I believe sysfs will not get populate because I
> >>> do not call register_one_node.
> >>
> >> arch/x86/kernel/topology.c:topology_init()
> >>
> >> for_each_online_node(i)
> >> 	register_one_node(i);
> > 
> > Right you are.
> >  
> >>> You are right that try_offline_nodes will be reduce which is good imho.
> >>> More changes will be possible (hopefully to drop some ugly code) on top
> >>> of this change (or any other that achieves that there are no NULL pgdat
> >>> for possible nodes).
> >>>
> >>
> >> No to exposing actually offline nodes to user space via sysfs.
> > 
> > Why is that a problem with the sysfs for non-populated nodes?
> > 
> 
> https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/

Thanks. It is good to be reminded that we are in cicling around this
problem for quite some time without really forward much.

> Contains some points -- certainly nothing unfixable but it clearly shows
> that users expect only nodes with actual memory and cpus to be online --
> that's why we export the possible+online state to user space. My point
> is to be careful with such drastic changes and do one step at a time.
>
> I think preallocation of the pgdat is a reasonable thing to have without
> changing user-space visible semantics or even in-kernel semantics.

So your proposal is to drop set_node_online from the patch and add it as
a separate one which handles 
	- sysfs part (i.e. do not register a node which doesn't span a
	  physical address space)
	- hotplug side of (drop the pgd allocation, register node lazily
	  when a first memblocks are registered)

Makes sense?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 16:27                                                               ` Michal Hocko
@ 2021-12-07 16:36                                                                 ` Michal Hocko
  2021-12-07 16:40                                                                   ` David Hildenbrand
  2021-12-07 17:02                                                                   ` Alexey Makhalov
  0 siblings, 2 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-07 16:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 17:27:29, Michal Hocko wrote:
[...]
> So your proposal is to drop set_node_online from the patch and add it as
> a separate one which handles 
> 	- sysfs part (i.e. do not register a node which doesn't span a
> 	  physical address space)
> 	- hotplug side of (drop the pgd allocation, register node lazily
> 	  when a first memblocks are registered)

In other words, the first stage
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..f9024ba09c53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		/*
+		 * All possible nodes have pgdat preallocated
+		 * free_area_init
+		 */
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
 			build_zonelists(pgdat);
@@ -8032,8 +8036,24 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
+			pgdat = arch_alloc_nodedata(nid);
+			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
+			arch_refresh_nodedata(nid, pgdat);
+			free_area_init_memoryless_node(nid);
+			/*
+			 * not marking this node online because we do not want to
+			 * confuse userspace by sysfs files/directories for node
+			 * without any memory attached to it (see topology_init)
+			 */
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 16:36                                                                 ` Michal Hocko
@ 2021-12-07 16:40                                                                   ` David Hildenbrand
  2021-12-08  8:28                                                                     ` Michal Hocko
  2021-12-07 17:02                                                                   ` Alexey Makhalov
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On 07.12.21 17:36, Michal Hocko wrote:
> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> [...]
>> So your proposal is to drop set_node_online from the patch and add it as
>> a separate one which handles 
>> 	- sysfs part (i.e. do not register a node which doesn't span a
>> 	  physical address space)
>> 	- hotplug side of (drop the pgd allocation, register node lazily
>> 	  when a first memblocks are registered)
> 

> In other words, the first stage
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..f9024ba09c53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>  	if (self && !node_online(self->node_id)) {
>  		build_zonelists(self);
>  	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated
> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
>  			pg_data_t *pgdat = NODE_DATA(nid);
>  
>  			build_zonelists(pgdat);
> @@ -8032,8 +8036,24 @@ void __init free_area_init(unsigned long *max_zone_pfn)
>  	/* Initialise every node */
>  	mminit_verify_pageflags_layout();
>  	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> +			pgdat = arch_alloc_nodedata(nid);

Is the buddy fully up an running at that point? I don't think so, so we
might have to allocate via memblock instead. But I might be wrong.

> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +			arch_refresh_nodedata(nid, pgdat);
> +			free_area_init_memoryless_node(nid);
> +			/*
> +			 * not marking this node online because we do not want to
> +			 * confuse userspace by sysfs files/directories for node
> +			 * without any memory attached to it (see topology_init)
> +			 */
> +			continue;
> +		}
> +
> +		pgdat = NODE_DATA(nid);
>  		free_area_init_node(nid);
>  
>  		/* Any memory on that node */
> 

Yes, and maybe in the same go, remove/rework hotadd_new_pgdat(), because
there is nothing  to hotadd anymore. (we should double-check the
initialization performed in there, it might all not be necessary anymore)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 16:36                                                                 ` Michal Hocko
  2021-12-07 16:40                                                                   ` David Hildenbrand
@ 2021-12-07 17:02                                                                   ` Alexey Makhalov
  2021-12-07 17:13                                                                     ` David Hildenbrand
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-07 17:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]



> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> [...]
>> So your proposal is to drop set_node_online from the patch and add it as
>> a separate one which handles
>> 	- sysfs part (i.e. do not register a node which doesn't span a
>> 	  physical address space)
>> 	- hotplug side of (drop the pgd allocation, register node lazily
>> 	  when a first memblocks are registered)
> 
> In other words, the first stage
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..f9024ba09c53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> 	if (self && !node_online(self->node_id)) {
> 		build_zonelists(self);
> 	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated
> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
> 			pg_data_t *pgdat = NODE_DATA(nid);
> 
> 			build_zonelists(pgdat);

Will it blow up memory usage for the nodes which might never be onlined?
I prefer the idea of init on demand.

Even now there is an existing problem.
In my experiments, I observed _huge_ memory consumption increase by increasing number
of possible numa nodes. I’m going to report it in separate mail thread.

Thanks,
—-Alexey


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 17:02                                                                   ` Alexey Makhalov
@ 2021-12-07 17:13                                                                     ` David Hildenbrand
  2021-12-07 17:17                                                                       ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 17:13 UTC (permalink / raw)
  To: Alexey Makhalov, Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	Oscar Salvador, Tejun Heo, Christoph Lameter, linux-kernel,
	stable

On 07.12.21 18:02, Alexey Makhalov wrote:
> 
> 
>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
>> [...]
>>> So your proposal is to drop set_node_online from the patch and add it as
>>> a separate one which handles
>>> 	- sysfs part (i.e. do not register a node which doesn't span a
>>> 	  physical address space)
>>> 	- hotplug side of (drop the pgd allocation, register node lazily
>>> 	  when a first memblocks are registered)
>>
>> In other words, the first stage
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c5952749ad40..f9024ba09c53 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>> 	if (self && !node_online(self->node_id)) {
>> 		build_zonelists(self);
>> 	} else {
>> -		for_each_online_node(nid) {
>> +		/*
>> +		 * All possible nodes have pgdat preallocated
>> +		 * free_area_init
>> +		 */
>> +		for_each_node(nid) {
>> 			pg_data_t *pgdat = NODE_DATA(nid);
>>
>> 			build_zonelists(pgdat);
> 
> Will it blow up memory usage for the nodes which might never be onlined?
> I prefer the idea of init on demand.
> 
> Even now there is an existing problem.
> In my experiments, I observed _huge_ memory consumption increase by increasing number
> of possible numa nodes. I’m going to report it in separate mail thread.

I already raised that PPC might be problematic in that regard. Which
architecture / setup do you have in mind that can have a lot of possible
nodes?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 17:13                                                                     ` David Hildenbrand
@ 2021-12-07 17:17                                                                       ` Alexey Makhalov
  2021-12-07 18:03                                                                         ` David Hildenbrand
  2021-12-08  8:04                                                                         ` Michal Hocko
  0 siblings, 2 replies; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-07 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	Oscar Salvador, Tejun Heo, Christoph Lameter, linux-kernel,
	stable



> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 07.12.21 18:02, Alexey Makhalov wrote:
>> 
>> 
>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
>>> 
>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
>>> [...]
>>>> So your proposal is to drop set_node_online from the patch and add it as
>>>> a separate one which handles
>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
>>>> 	  physical address space)
>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
>>>> 	  when a first memblocks are registered)
>>> 
>>> In other words, the first stage
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c5952749ad40..f9024ba09c53 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>>> 	if (self && !node_online(self->node_id)) {
>>> 		build_zonelists(self);
>>> 	} else {
>>> -		for_each_online_node(nid) {
>>> +		/*
>>> +		 * All possible nodes have pgdat preallocated
>>> +		 * free_area_init
>>> +		 */
>>> +		for_each_node(nid) {
>>> 			pg_data_t *pgdat = NODE_DATA(nid);
>>> 
>>> 			build_zonelists(pgdat);
>> 
>> Will it blow up memory usage for the nodes which might never be onlined?
>> I prefer the idea of init on demand.
>> 
>> Even now there is an existing problem.
>> In my experiments, I observed _huge_ memory consumption increase by increasing number
>> of possible numa nodes. I’m going to report it in separate mail thread.
> 
> I already raised that PPC might be problematic in that regard. Which
> architecture / setup do you have in mind that can have a lot of possible
> nodes?
> 
It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
with hot-plug support, 128 possible nodes)  

Thanks,
—-Alexey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 17:17                                                                       ` Alexey Makhalov
@ 2021-12-07 18:03                                                                         ` David Hildenbrand
  2021-12-08  8:12                                                                           ` Michal Hocko
  2021-12-08  8:04                                                                         ` Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-07 18:03 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Michal Hocko, Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	Oscar Salvador, Tejun Heo, Christoph Lameter, linux-kernel,
	stable

On 07.12.21 18:17, Alexey Makhalov wrote:
> 
> 
>> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 07.12.21 18:02, Alexey Makhalov wrote:
>>>
>>>
>>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
>>>> [...]
>>>>> So your proposal is to drop set_node_online from the patch and add it as
>>>>> a separate one which handles
>>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
>>>>> 	  physical address space)
>>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
>>>>> 	  when a first memblocks are registered)
>>>>
>>>> In other words, the first stage
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index c5952749ad40..f9024ba09c53 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>>>> 	if (self && !node_online(self->node_id)) {
>>>> 		build_zonelists(self);
>>>> 	} else {
>>>> -		for_each_online_node(nid) {
>>>> +		/*
>>>> +		 * All possible nodes have pgdat preallocated
>>>> +		 * free_area_init
>>>> +		 */
>>>> +		for_each_node(nid) {
>>>> 			pg_data_t *pgdat = NODE_DATA(nid);
>>>>
>>>> 			build_zonelists(pgdat);
>>>
>>> Will it blow up memory usage for the nodes which might never be onlined?
>>> I prefer the idea of init on demand.
>>>
>>> Even now there is an existing problem.
>>> In my experiments, I observed _huge_ memory consumption increase by increasing number
>>> of possible numa nodes. I’m going to report it in separate mail thread.
>>
>> I already raised that PPC might be problematic in that regard. Which
>> architecture / setup do you have in mind that can have a lot of possible
>> nodes?
>>
> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
> with hot-plug support, 128 possible nodes)  

I thought the pgdat would be smaller but I just gave it a test:

On my system, pgdata_t is 173824 bytes. So 128 nodes would correspond to
21 MiB, which is indeed a lot. I assume it's due to "struct zonelist",
which has MAX_ZONES_PER_ZONELIST == (MAX_NUMNODES * MAX_NR_ZONES) zone
references ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 17:17                                                                       ` Alexey Makhalov
  2021-12-07 18:03                                                                         ` David Hildenbrand
@ 2021-12-08  8:04                                                                         ` Michal Hocko
  2021-12-08  8:19                                                                           ` Alexey Makhalov
  1 sibling, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:04 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: David Hildenbrand, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 17:17:27, Alexey Makhalov wrote:
> 
> 
> > On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
> > 
> > On 07.12.21 18:02, Alexey Makhalov wrote:
> >> 
> >> 
> >>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>> 
> >>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> >>> [...]
> >>>> So your proposal is to drop set_node_online from the patch and add it as
> >>>> a separate one which handles
> >>>> 	- sysfs part (i.e. do not register a node which doesn't span a
> >>>> 	  physical address space)
> >>>> 	- hotplug side of (drop the pgd allocation, register node lazily
> >>>> 	  when a first memblocks are registered)
> >>> 
> >>> In other words, the first stage
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index c5952749ad40..f9024ba09c53 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> >>> 	if (self && !node_online(self->node_id)) {
> >>> 		build_zonelists(self);
> >>> 	} else {
> >>> -		for_each_online_node(nid) {
> >>> +		/*
> >>> +		 * All possible nodes have pgdat preallocated
> >>> +		 * free_area_init
> >>> +		 */
> >>> +		for_each_node(nid) {
> >>> 			pg_data_t *pgdat = NODE_DATA(nid);
> >>> 
> >>> 			build_zonelists(pgdat);
> >> 
> >> Will it blow up memory usage for the nodes which might never be onlined?
> >> I prefer the idea of init on demand.
> >> 
> >> Even now there is an existing problem.
> >> In my experiments, I observed _huge_ memory consumption increase by increasing number
> >> of possible numa nodes. I’m going to report it in separate mail thread.
> > 
> > I already raised that PPC might be problematic in that regard. Which
> > architecture / setup do you have in mind that can have a lot of possible
> > nodes?
> > 
> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
> with hot-plug support, 128 possible nodes)  

This is slightly tangent but could you elaborate more on this setup and
reasoning behind it. I was already curious when you mentioned this
previously. Why would you want to have so many nodes and having 1:1 with
CPUs. What is the resulting NUMA topology?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 18:03                                                                         ` David Hildenbrand
@ 2021-12-08  8:12                                                                           ` Michal Hocko
  2021-12-08  8:24                                                                             ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 19:03:28, David Hildenbrand wrote:
> On 07.12.21 18:17, Alexey Makhalov wrote:
> > 
> > 
> >> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 07.12.21 18:02, Alexey Makhalov wrote:
> >>>
> >>>
> >>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> >>>> [...]
> >>>>> So your proposal is to drop set_node_online from the patch and add it as
> >>>>> a separate one which handles
> >>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
> >>>>> 	  physical address space)
> >>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
> >>>>> 	  when a first memblocks are registered)
> >>>>
> >>>> In other words, the first stage
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index c5952749ad40..f9024ba09c53 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> >>>> 	if (self && !node_online(self->node_id)) {
> >>>> 		build_zonelists(self);
> >>>> 	} else {
> >>>> -		for_each_online_node(nid) {
> >>>> +		/*
> >>>> +		 * All possible nodes have pgdat preallocated
> >>>> +		 * free_area_init
> >>>> +		 */
> >>>> +		for_each_node(nid) {
> >>>> 			pg_data_t *pgdat = NODE_DATA(nid);
> >>>>
> >>>> 			build_zonelists(pgdat);
> >>>
> >>> Will it blow up memory usage for the nodes which might never be onlined?
> >>> I prefer the idea of init on demand.
> >>>
> >>> Even now there is an existing problem.
> >>> In my experiments, I observed _huge_ memory consumption increase by increasing number
> >>> of possible numa nodes. I’m going to report it in separate mail thread.
> >>
> >> I already raised that PPC might be problematic in that regard. Which
> >> architecture / setup do you have in mind that can have a lot of possible
> >> nodes?
> >>
> > It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
> > with hot-plug support, 128 possible nodes)  
> 
> I thought the pgdat would be smaller but I just gave it a test:

Yes, pgdat is quite large! Just embeded zones can eat a lot.

> On my system, pgdata_t is 173824 bytes. So 128 nodes would correspond to
> 21 MiB, which is indeed a lot. I assume it's due to "struct zonelist",
> which has MAX_ZONES_PER_ZONELIST == (MAX_NUMNODES * MAX_NR_ZONES) zone
> references ...

This is what pahole tells me
struct pglist_data {
        struct zone                node_zones[4] __attribute__((__aligned__(64))); /*     0  5632 */
        /* --- cacheline 88 boundary (5632 bytes) --- */
        struct zonelist            node_zonelists[1];    /*  5632    80 */
	[...]
        /* size: 6400, cachelines: 100, members: 27 */
        /* sum members: 6369, holes: 5, sum holes: 31 */

with my particular config (which is !NUMA). I haven't really checked
whether there are other places which might scale with MAX_NUM_NODES or
something like that.

Anyway, is 21MB of wasted space for 128 Node machine something really
note worthy?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:04                                                                         ` Michal Hocko
@ 2021-12-08  8:19                                                                           ` Alexey Makhalov
  2021-12-08  8:30                                                                             ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-08  8:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

Hi Michal,

> On Dec 8, 2021, at 12:04 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 07-12-21 17:17:27, Alexey Makhalov wrote:
>> 
>> 
>>> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>> On 07.12.21 18:02, Alexey Makhalov wrote:
>>>> 
>>>> 
>>>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>> 
>>>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
>>>>> [...]
>>>>>> So your proposal is to drop set_node_online from the patch and add it as
>>>>>> a separate one which handles
>>>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
>>>>>> 	  physical address space)
>>>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
>>>>>> 	  when a first memblocks are registered)
>>>>> 
>>>>> In other words, the first stage
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index c5952749ad40..f9024ba09c53 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>>>>> 	if (self && !node_online(self->node_id)) {
>>>>> 		build_zonelists(self);
>>>>> 	} else {
>>>>> -		for_each_online_node(nid) {
>>>>> +		/*
>>>>> +		 * All possible nodes have pgdat preallocated
>>>>> +		 * free_area_init
>>>>> +		 */
>>>>> +		for_each_node(nid) {
>>>>> 			pg_data_t *pgdat = NODE_DATA(nid);
>>>>> 
>>>>> 			build_zonelists(pgdat);
>>>> 
>>>> Will it blow up memory usage for the nodes which might never be onlined?
>>>> I prefer the idea of init on demand.
>>>> 
>>>> Even now there is an existing problem.
>>>> In my experiments, I observed _huge_ memory consumption increase by increasing number
>>>> of possible numa nodes. I’m going to report it in separate mail thread.
>>> 
>>> I already raised that PPC might be problematic in that regard. Which
>>> architecture / setup do you have in mind that can have a lot of possible
>>> nodes?
>>> 
>> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
>> with hot-plug support, 128 possible nodes)
> 
> This is slightly tangent but could you elaborate more on this setup and
> reasoning behind it. I was already curious when you mentioned this
> previously. Why would you want to have so many nodes and having 1:1 with
> CPUs. What is the resulting NUMA topology?

This setup with 128 nodes was used purely for development purposes. That is when the issue
with hot adding numa nodes was found. Original issue presents even with feasible number of
nodes.

Thanks,
—Alexey

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:12                                                                           ` Michal Hocko
@ 2021-12-08  8:24                                                                             ` David Hildenbrand
  2021-12-08  8:34                                                                               ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-08  8:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On 08.12.21 09:12, Michal Hocko wrote:
> On Tue 07-12-21 19:03:28, David Hildenbrand wrote:
>> On 07.12.21 18:17, Alexey Makhalov wrote:
>>>
>>>
>>>> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 07.12.21 18:02, Alexey Makhalov wrote:
>>>>>
>>>>>
>>>>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
>>>>>>
>>>>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
>>>>>> [...]
>>>>>>> So your proposal is to drop set_node_online from the patch and add it as
>>>>>>> a separate one which handles
>>>>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
>>>>>>> 	  physical address space)
>>>>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
>>>>>>> 	  when a first memblocks are registered)
>>>>>>
>>>>>> In other words, the first stage
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index c5952749ad40..f9024ba09c53 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
>>>>>> 	if (self && !node_online(self->node_id)) {
>>>>>> 		build_zonelists(self);
>>>>>> 	} else {
>>>>>> -		for_each_online_node(nid) {
>>>>>> +		/*
>>>>>> +		 * All possible nodes have pgdat preallocated
>>>>>> +		 * free_area_init
>>>>>> +		 */
>>>>>> +		for_each_node(nid) {
>>>>>> 			pg_data_t *pgdat = NODE_DATA(nid);
>>>>>>
>>>>>> 			build_zonelists(pgdat);
>>>>>
>>>>> Will it blow up memory usage for the nodes which might never be onlined?
>>>>> I prefer the idea of init on demand.
>>>>>
>>>>> Even now there is an existing problem.
>>>>> In my experiments, I observed _huge_ memory consumption increase by increasing number
>>>>> of possible numa nodes. I’m going to report it in separate mail thread.
>>>>
>>>> I already raised that PPC might be problematic in that regard. Which
>>>> architecture / setup do you have in mind that can have a lot of possible
>>>> nodes?
>>>>
>>> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
>>> with hot-plug support, 128 possible nodes)  
>>
>> I thought the pgdat would be smaller but I just gave it a test:
> 
> Yes, pgdat is quite large! Just embeded zones can eat a lot.
> 
>> On my system, pgdata_t is 173824 bytes. So 128 nodes would correspond to
>> 21 MiB, which is indeed a lot. I assume it's due to "struct zonelist",
>> which has MAX_ZONES_PER_ZONELIST == (MAX_NUMNODES * MAX_NR_ZONES) zone
>> references ...
> 
> This is what pahole tells me
> struct pglist_data {
>         struct zone                node_zones[4] __attribute__((__aligned__(64))); /*     0  5632 */
>         /* --- cacheline 88 boundary (5632 bytes) --- */
>         struct zonelist            node_zonelists[1];    /*  5632    80 */
> 	[...]
>         /* size: 6400, cachelines: 100, members: 27 */
>         /* sum members: 6369, holes: 5, sum holes: 31 */
> 
> with my particular config (which is !NUMA). I haven't really checked
> whether there are other places which might scale with MAX_NUM_NODES or
> something like that.
> 
> Anyway, is 21MB of wasted space for 128 Node machine something really
> note worthy?
> 

I think we'll soon might see setups (again, CXL is an example, but als
owhen providing a dynamic amount of performance differentiated memory
via virtio-mem) where this will most probably matter. With performance
differentiated memory we'll see a lot more nodes getting used in
general, and a lot more nodes eventually getting hotplugged.

If 128 nodes is realistic, I cannot tell.

We could optimize by allocating some members dynamically. For example
we'll never need MAX_NUMNODES entries, but only the number of possible
nodes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 16:40                                                                   ` David Hildenbrand
@ 2021-12-08  8:28                                                                     ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 07-12-21 17:40:36, David Hildenbrand wrote:
> On 07.12.21 17:36, Michal Hocko wrote:
> > On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> > [...]
> >> So your proposal is to drop set_node_online from the patch and add it as
> >> a separate one which handles 
> >> 	- sysfs part (i.e. do not register a node which doesn't span a
> >> 	  physical address space)
> >> 	- hotplug side of (drop the pgd allocation, register node lazily
> >> 	  when a first memblocks are registered)
> > 
> 
> > In other words, the first stage
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..f9024ba09c53 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> >  	if (self && !node_online(self->node_id)) {
> >  		build_zonelists(self);
> >  	} else {
> > -		for_each_online_node(nid) {
> > +		/*
> > +		 * All possible nodes have pgdat preallocated
> > +		 * free_area_init
> > +		 */
> > +		for_each_node(nid) {
> >  			pg_data_t *pgdat = NODE_DATA(nid);
> >  
> >  			build_zonelists(pgdat);
> > @@ -8032,8 +8036,24 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> >  	/* Initialise every node */
> >  	mminit_verify_pageflags_layout();
> >  	setup_nr_node_ids();
> > -	for_each_online_node(nid) {
> > -		pg_data_t *pgdat = NODE_DATA(nid);
> > +	for_each_node(nid) {
> > +		pg_data_t *pgdat;
> > +
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> > +			pgdat = arch_alloc_nodedata(nid);
> 
> Is the buddy fully up an running at that point? I don't think so, so we
> might have to allocate via memblock instead. But I might be wrong.

No, not only the page allocator is not ready but slab allocator used by
the generic implementation is not up yet either. I will look deeper into
this later today but I suspect the only choice is to use the memblock
allocator - same way the arch specific code allocates pgdats.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:19                                                                           ` Alexey Makhalov
@ 2021-12-08  8:30                                                                             ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:30 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: David Hildenbrand, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Wed 08-12-21 08:19:16, Alexey Makhalov wrote:
> Hi Michal,
> 
> > On Dec 8, 2021, at 12:04 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Tue 07-12-21 17:17:27, Alexey Makhalov wrote:
> >> 
> >> 
> >>> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
> >>> 
> >>> On 07.12.21 18:02, Alexey Makhalov wrote:
> >>>> 
> >>>> 
> >>>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>>>> 
> >>>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> >>>>> [...]
> >>>>>> So your proposal is to drop set_node_online from the patch and add it as
> >>>>>> a separate one which handles
> >>>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
> >>>>>> 	  physical address space)
> >>>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
> >>>>>> 	  when a first memblocks are registered)
> >>>>> 
> >>>>> In other words, the first stage
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index c5952749ad40..f9024ba09c53 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> >>>>> 	if (self && !node_online(self->node_id)) {
> >>>>> 		build_zonelists(self);
> >>>>> 	} else {
> >>>>> -		for_each_online_node(nid) {
> >>>>> +		/*
> >>>>> +		 * All possible nodes have pgdat preallocated
> >>>>> +		 * free_area_init
> >>>>> +		 */
> >>>>> +		for_each_node(nid) {
> >>>>> 			pg_data_t *pgdat = NODE_DATA(nid);
> >>>>> 
> >>>>> 			build_zonelists(pgdat);
> >>>> 
> >>>> Will it blow up memory usage for the nodes which might never be onlined?
> >>>> I prefer the idea of init on demand.
> >>>> 
> >>>> Even now there is an existing problem.
> >>>> In my experiments, I observed _huge_ memory consumption increase by increasing number
> >>>> of possible numa nodes. I’m going to report it in separate mail thread.
> >>> 
> >>> I already raised that PPC might be problematic in that regard. Which
> >>> architecture / setup do you have in mind that can have a lot of possible
> >>> nodes?
> >>> 
> >> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
> >> with hot-plug support, 128 possible nodes)
> > 
> > This is slightly tangent but could you elaborate more on this setup and
> > reasoning behind it. I was already curious when you mentioned this
> > previously. Why would you want to have so many nodes and having 1:1 with
> > CPUs. What is the resulting NUMA topology?
> 
> This setup with 128 nodes was used purely for development purposes. That is when the issue
> with hot adding numa nodes was found.

OK, I see.

> Original issue presents even with feasible number of nodes.

Yes the issue is independent on the number of offline nodes currently.
The number of nodes is only interesting for the wasted amount of memory
if we are to allocate pgdat for each possible node.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:24                                                                             ` David Hildenbrand
@ 2021-12-08  8:34                                                                               ` Michal Hocko
  2021-12-08  8:38                                                                                 ` David Hildenbrand
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Wed 08-12-21 09:24:39, David Hildenbrand wrote:
> On 08.12.21 09:12, Michal Hocko wrote:
> > On Tue 07-12-21 19:03:28, David Hildenbrand wrote:
> >> On 07.12.21 18:17, Alexey Makhalov wrote:
> >>>
> >>>
> >>>> On Dec 7, 2021, at 9:13 AM, David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 07.12.21 18:02, Alexey Makhalov wrote:
> >>>>>
> >>>>>
> >>>>>> On Dec 7, 2021, at 8:36 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>>>>>
> >>>>>> On Tue 07-12-21 17:27:29, Michal Hocko wrote:
> >>>>>> [...]
> >>>>>>> So your proposal is to drop set_node_online from the patch and add it as
> >>>>>>> a separate one which handles
> >>>>>>> 	- sysfs part (i.e. do not register a node which doesn't span a
> >>>>>>> 	  physical address space)
> >>>>>>> 	- hotplug side of (drop the pgd allocation, register node lazily
> >>>>>>> 	  when a first memblocks are registered)
> >>>>>>
> >>>>>> In other words, the first stage
> >>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>> index c5952749ad40..f9024ba09c53 100644
> >>>>>> --- a/mm/page_alloc.c
> >>>>>> +++ b/mm/page_alloc.c
> >>>>>> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> >>>>>> 	if (self && !node_online(self->node_id)) {
> >>>>>> 		build_zonelists(self);
> >>>>>> 	} else {
> >>>>>> -		for_each_online_node(nid) {
> >>>>>> +		/*
> >>>>>> +		 * All possible nodes have pgdat preallocated
> >>>>>> +		 * free_area_init
> >>>>>> +		 */
> >>>>>> +		for_each_node(nid) {
> >>>>>> 			pg_data_t *pgdat = NODE_DATA(nid);
> >>>>>>
> >>>>>> 			build_zonelists(pgdat);
> >>>>>
> >>>>> Will it blow up memory usage for the nodes which might never be onlined?
> >>>>> I prefer the idea of init on demand.
> >>>>>
> >>>>> Even now there is an existing problem.
> >>>>> In my experiments, I observed _huge_ memory consumption increase by increasing number
> >>>>> of possible numa nodes. I’m going to report it in separate mail thread.
> >>>>
> >>>> I already raised that PPC might be problematic in that regard. Which
> >>>> architecture / setup do you have in mind that can have a lot of possible
> >>>> nodes?
> >>>>
> >>> It is x86_64 VMware VM, not the regular one, but specially configured (1 vCPU per node,
> >>> with hot-plug support, 128 possible nodes)  
> >>
> >> I thought the pgdat would be smaller but I just gave it a test:
> > 
> > Yes, pgdat is quite large! Just embeded zones can eat a lot.
> > 
> >> On my system, pgdata_t is 173824 bytes. So 128 nodes would correspond to
> >> 21 MiB, which is indeed a lot. I assume it's due to "struct zonelist",
> >> which has MAX_ZONES_PER_ZONELIST == (MAX_NUMNODES * MAX_NR_ZONES) zone
> >> references ...
> > 
> > This is what pahole tells me
> > struct pglist_data {
> >         struct zone                node_zones[4] __attribute__((__aligned__(64))); /*     0  5632 */
> >         /* --- cacheline 88 boundary (5632 bytes) --- */
> >         struct zonelist            node_zonelists[1];    /*  5632    80 */
> > 	[...]
> >         /* size: 6400, cachelines: 100, members: 27 */
> >         /* sum members: 6369, holes: 5, sum holes: 31 */
> > 
> > with my particular config (which is !NUMA). I haven't really checked
> > whether there are other places which might scale with MAX_NUM_NODES or
> > something like that.
> > 
> > Anyway, is 21MB of wasted space for 128 Node machine something really
> > note worthy?
> > 
> 
> I think we'll soon might see setups (again, CXL is an example, but als
> owhen providing a dynamic amount of performance differentiated memory
> via virtio-mem) where this will most probably matter. With performance
> differentiated memory we'll see a lot more nodes getting used in
> general, and a lot more nodes eventually getting hotplugged.

There are certainly machines with many nodes. E.g. SLES kernels are
build with CONFIG_NODES_SHIFT=10 which is a lot of potential nodes.
And I have seen really large machines with many nodes but those usually
come with a lot of memory and they do not tend to have non populated
nodes AFAIR.

> If 128 nodes is realistic, I cannot tell.
> 
> We could optimize by allocating some members dynamically. For example
> we'll never need MAX_NUMNODES entries, but only the number of possible
> nodes.

Yes agreed. Scaling with MAX_NUMNODES is almost always wasteful.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:34                                                                               ` Michal Hocko
@ 2021-12-08  8:38                                                                                 ` David Hildenbrand
  0 siblings, 0 replies; 98+ messages in thread
From: David Hildenbrand @ 2021-12-08  8:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

>>
>> I think we'll soon might see setups (again, CXL is an example, but als
>> owhen providing a dynamic amount of performance differentiated memory
>> via virtio-mem) where this will most probably matter. With performance
>> differentiated memory we'll see a lot more nodes getting used in
>> general, and a lot more nodes eventually getting hotplugged.
> 
> There are certainly machines with many nodes. E.g. SLES kernels are
> build with CONFIG_NODES_SHIFT=10 which is a lot of potential nodes.
> And I have seen really large machines with many nodes but those usually
> come with a lot of memory and they do not tend to have non populated
> nodes AFAIR.

Right, and is about to change as nodes are getting used to represent
memory with differing performance characteristics/individual devices,
not the traditional "this is a socket" setup: we'll see more and more
small (virtual) machines with multiple nodes and eventually many
possible nodes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 10:54                                           ` Michal Hocko
  2021-12-07 11:08                                             ` David Hildenbrand
@ 2021-12-08  8:54                                             ` Michal Hocko
  2021-12-08  8:57                                               ` Alexey Makhalov
  2021-12-09  2:16                                               ` Alexey Makhalov
  2021-12-09 10:48                                             ` Michal Hocko
  2 siblings, 2 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  8:54 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

Alexey,
this is still not finalized but it would really help if you could give
it a spin on your setup. I still have to think about how to transition
from a memoryless node to standard node (in hotplug code). Also there
might be other surprises on the way.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..8ed8db2ccb13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		/*
+		 * All possible nodes have pgdat preallocated
+		 * free_area_init
+		 */
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
 			build_zonelists(pgdat);
@@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
+
+			/* Allocator not initialized yet */
+			pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);
+			if (!pgdat) {
+				pr_err("Cannot allocate %zuB for node %d.\n",
+						sizeof(*pgdat), nid);
+				continue;
+			}
+			/* TODO do we need this for memoryless nodes */
+			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
+			arch_refresh_nodedata(nid, pgdat);
+			free_area_init_memoryless_node(nid);
+			/*
+			 * not marking this node online because we do not want to
+			 * confuse userspace by sysfs files/directories for node
+			 * without any memory attached to it (see topology_init)
+			 */
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:54                                             ` Michal Hocko
@ 2021-12-08  8:57                                               ` Alexey Makhalov
  2021-12-08  9:55                                                 ` Michal Hocko
  2021-12-09  2:16                                               ` Alexey Makhalov
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-08  8:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable



> On Dec 8, 2021, at 12:54 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> Alexey,
> this is still not finalized but it would really help if you could give
> it a spin on your setup. I still have to think about how to transition
> from a memoryless node to standard node (in hotplug code). Also there
> might be other surprises on the way.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..8ed8db2ccb13 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> 	if (self && !node_online(self->node_id)) {
> 		build_zonelists(self);
> 	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated
> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
> 			pg_data_t *pgdat = NODE_DATA(nid);
> 
> 			build_zonelists(pgdat);
> @@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> 	/* Initialise every node */
> 	mminit_verify_pageflags_layout();
> 	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> +
> +			/* Allocator not initialized yet */
> +			pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);
> +			if (!pgdat) {
> +				pr_err("Cannot allocate %zuB for node %d.\n",
> +						sizeof(*pgdat), nid);
> +				continue;
> +			}
> +			/* TODO do we need this for memoryless nodes */
> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +			arch_refresh_nodedata(nid, pgdat);
> +			free_area_init_memoryless_node(nid);
> +			/*
> +			 * not marking this node online because we do not want to
> +			 * confuse userspace by sysfs files/directories for node
> +			 * without any memory attached to it (see topology_init)
> +			 */
> +			continue;
> +		}
> +
> +		pgdat = NODE_DATA(nid);
> 		free_area_init_node(nid);
> 
> 		/* Any memory on that node */
> 

Sure Michal, I’ll give it a spin.

Thanks for attention to this topic.

Regarding memory waste. 
Here what I found while was using VM 128 possible NUMA nodes.
My Linux build on VM with only one numa node can be booted on 192Mb RAM,
But on 128 nodes it requires 1GB RAM just to boot. It is server distro,
minimal set of systemd services, no UI.

meminfo shows:
1 node case: Percpu:            53760 kB
128 nodes:   Percpu:           718048 kB !!!

Initial analisys multinode memory consumption showed at least difference in this:

Every memcgroup allocates mem_cgroup_per_node info for all possible node.
Each mem_cgroup_per_node has per cpu stats.
That means, each mem cgroup allocates 128*(sizeof struct mem_cgroup_per_node) + 16384*(sizeof struct lruvec_stats_percpu)

See: mem_cgroup_alloc() -> alloc_mem_cgroup_per_node_info()

There is also old comment about it in alloc_mem_cgroup_per_node_info()
        /*
         * This routine is called against possible nodes.
         * But it's BUG to call kmalloc() against offline node.
         *
         * TODO: this routine can waste much memory for nodes which will
         *       never be onlined. It's better to use memory hotplug callback
         *       function.
         */

Regards,
—Alexey


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:57                                               ` Alexey Makhalov
@ 2021-12-08  9:55                                                 ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-08  9:55 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Wed 08-12-21 08:57:28, Alexey Makhalov wrote:
> 
> 
> > On Dec 8, 2021, at 12:54 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > Alexey,
> > this is still not finalized but it would really help if you could give
> > it a spin on your setup. I still have to think about how to transition
> > from a memoryless node to standard node (in hotplug code). Also there
> > might be other surprises on the way.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..8ed8db2ccb13 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> > 	if (self && !node_online(self->node_id)) {
> > 		build_zonelists(self);
> > 	} else {
> > -		for_each_online_node(nid) {
> > +		/*
> > +		 * All possible nodes have pgdat preallocated
> > +		 * free_area_init
> > +		 */
> > +		for_each_node(nid) {
> > 			pg_data_t *pgdat = NODE_DATA(nid);
> > 
> > 			build_zonelists(pgdat);
> > @@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> > 	/* Initialise every node */
> > 	mminit_verify_pageflags_layout();
> > 	setup_nr_node_ids();
> > -	for_each_online_node(nid) {
> > -		pg_data_t *pgdat = NODE_DATA(nid);
> > +	for_each_node(nid) {
> > +		pg_data_t *pgdat;
> > +
> > +		if (!node_online(nid)) {
> > +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> > +
> > +			/* Allocator not initialized yet */
> > +			pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);
> > +			if (!pgdat) {
> > +				pr_err("Cannot allocate %zuB for node %d.\n",
> > +						sizeof(*pgdat), nid);
> > +				continue;
> > +			}
> > +			/* TODO do we need this for memoryless nodes */
> > +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> > +			arch_refresh_nodedata(nid, pgdat);
> > +			free_area_init_memoryless_node(nid);
> > +			/*
> > +			 * not marking this node online because we do not want to
> > +			 * confuse userspace by sysfs files/directories for node
> > +			 * without any memory attached to it (see topology_init)
> > +			 */
> > +			continue;
> > +		}
> > +
> > +		pgdat = NODE_DATA(nid);
> > 		free_area_init_node(nid);
> > 
> > 		/* Any memory on that node */
> > 
> 
> Sure Michal, I’ll give it a spin.

Thanks!

> Thanks for attention to this topic.
> 
> Regarding memory waste. 
> Here what I found while was using VM 128 possible NUMA nodes.
> My Linux build on VM with only one numa node can be booted on 192Mb RAM,
> But on 128 nodes it requires 1GB RAM just to boot. It is server distro,
> minimal set of systemd services, no UI.
> 
> meminfo shows:
> 1 node case: Percpu:            53760 kB
> 128 nodes:   Percpu:           718048 kB !!!
> 
> Initial analisys multinode memory consumption showed at least difference in this:
> 
> Every memcgroup allocates mem_cgroup_per_node info for all possible node.
> Each mem_cgroup_per_node has per cpu stats.
> That means, each mem cgroup allocates 128*(sizeof struct mem_cgroup_per_node) + 16384*(sizeof struct lruvec_stats_percpu)
> 
> See: mem_cgroup_alloc() -> alloc_mem_cgroup_per_node_info()
> 
> There is also old comment about it in alloc_mem_cgroup_per_node_info()
>         /*
>          * This routine is called against possible nodes.
>          * But it's BUG to call kmalloc() against offline node.
>          *
>          * TODO: this routine can waste much memory for nodes which will
>          *       never be onlined. It's better to use memory hotplug callback
>          *       function.
>          */

Please report that separately. There are likely more places like that.
I do not think many subsystems (including MM) optimize for a very sparse
possible node masks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-08  8:54                                             ` Michal Hocko
  2021-12-08  8:57                                               ` Alexey Makhalov
@ 2021-12-09  2:16                                               ` Alexey Makhalov
  2021-12-09  8:46                                                 ` Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-09  2:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

Hi Michal,


> On Dec 8, 2021, at 12:54 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> Alexey,
> this is still not finalized but it would really help if you could give
> it a spin on your setup. I still have to think about how to transition
> from a memoryless node to standard node (in hotplug code). Also there
> might be other surprises on the way.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..8ed8db2ccb13 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
> 	if (self && !node_online(self->node_id)) {
> 		build_zonelists(self);
> 	} else {
> -		for_each_online_node(nid) {
> +		/*
> +		 * All possible nodes have pgdat preallocated
> +		 * free_area_init
> +		 */
> +		for_each_node(nid) {
> 			pg_data_t *pgdat = NODE_DATA(nid);
> 
> 			build_zonelists(pgdat);
> @@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> 	/* Initialise every node */
> 	mminit_verify_pageflags_layout();
> 	setup_nr_node_ids();
> -	for_each_online_node(nid) {
> -		pg_data_t *pgdat = NODE_DATA(nid);
> +	for_each_node(nid) {
> +		pg_data_t *pgdat;
> +
> +		if (!node_online(nid)) {
> +			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> +
> +			/* Allocator not initialized yet */
> +			pgdat = memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);
> +			if (!pgdat) {
> +				pr_err("Cannot allocate %zuB for node %d.\n",
> +						sizeof(*pgdat), nid);
> +				continue;
> +			}
> +			/* TODO do we need this for memoryless nodes */
> +			pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
> +			arch_refresh_nodedata(nid, pgdat);
> +			free_area_init_memoryless_node(nid);
> +			/*
> +			 * not marking this node online because we do not want to
> +			 * confuse userspace by sysfs files/directories for node
> +			 * without any memory attached to it (see topology_init)
> +			 */
> +			continue;
> +		}
> +
> +		pgdat = NODE_DATA(nid);
> 		free_area_init_node(nid);
> 
> 		/* Any memory on that node */


After applying this patch, kernel panics in early boot with:
[    0.081838] Initmem setup node 0 [mem 0x0000000000001000-0x000000007fffffff]
[    0.081842] Initmem setup node 1 [mem 0x0000000080000000-0x000000013fffffff]
[    0.081844] Node 2 uninitialized by the platform. Please report with boot dmesg.
[    0.081877] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    0.081879] #PF: supervisor read access in kernel mode
[    0.081882] #PF: error_code(0x0000) - not-present page
[    0.081884] PGD 0 P4D 0
[    0.081887] Oops: 0000 [#1] SMP PTI
[    0.081890] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0+ #33
[    0.081893] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
[    0.081896] RIP: 0010:pcpu_alloc+0x330/0x850
[    0.081903] Code: c7 c7 e4 38 5b 82 e8 5f b5 60 00 81 7d ac c0 0c 00 00 0f 85 f1 04 00 00 48
[    0.081906] RSP: 0000:ffffffff82003dc0 EFLAGS: 00010046
[    0.081909] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000cc0
[    0.081911] RDX: 0000000000000003 RSI: 0000000000000006 RDI: ffffffff825b38e4
[    0.081913] RBP: ffffffff82003e40 R08: ffff88813ffb7480 R09: 0000000000001000
[    0.081915] R10: 0000000000001000 R11: 000000013ffff000 R12: 0000000000000001
[    0.081917] R13: 0000000001a2c000 R14: 0000000000000000 R15: 0000000000000003
[    0.081919] FS:  0000000000000000(0000) GS:ffffffff822ee000(0000) knlGS:0000000000000000
[    0.081921] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.081923] CR2: 0000000000000000 CR3: 000000000200a000 CR4: 00000000000606b0
[    0.081946] Call Trace:
[    0.081951]  __alloc_percpu+0x15/0x20
[    0.081954]  free_area_init+0x270/0x300
[    0.081960]  zone_sizes_init+0x44/0x46
[    0.081965]  paging_init+0x23/0x25
[    0.081969]  setup_arch+0x5aa/0x668
[    0.081973]  start_kernel+0x53/0x5b6
[    0.081978]  x86_64_start_reservations+0x24/0x26
[    0.081983]  x86_64_start_kernel+0x70/0x74
[    0.081986]  secondary_startup_64_no_verify+0xb0/0xbb
[    0.081991] Modules linked in:
[    0.081993] CR2: 0000000000000000
[    0.081996] random: get_random_bytes called from oops_exit+0x39/0x60 with crng_init=0


pcpu_alloc+0x330 is
/root/linux-5.15.0/mm/percpu.c:1833
        if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
    359e:       48 63 05 00 00 00 00    movslq 0x0(%rip),%rax        # 35a5 <pcpu_alloc+0x325>
                        35a1: R_X86_64_PC32     .data..ro_after_init+0x5c
    35a5:       48 c1 e0 04             shl    $0x4,%rax
    35a9:       48 03 05 00 00 00 00    add    0x0(%rip),%rax        # 35b0 <pcpu_alloc+0x330>
                        35ac: R_X86_64_PC32     pcpu_chunk_lists-0x4
list_empty():
/root/linux-5.15.0/./include/linux/list.h:282
        return READ_ONCE(head->next) == head;
    35b0:       48 8b 10                mov    (%rax),%rdx                     <— rax == 0



free_area_init() -> /* added by patch */ alloc_percpu() -> pcpu_alloc():
        /*
         * No space left.  Create a new chunk.  We don't want multiple
         * tasks to create chunks simultaneously.  Serialize and create iff
         * there's still no empty chunk after grabbing the mutex.
         */
        if (is_atomic) {
                err = "atomic alloc failed, no space left";
                goto fail;
        }

        if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {                     <— &pcpu_chunk_lists[pcpu_free_slot]) == NULL
                chunk = pcpu_create_chunk(pcpu_gfp);
                if (!chunk) {
                        err = "failed to allocate new chunk";
                        goto fail;
                }

                spin_lock_irqsave(&pcpu_lock, flags);
                pcpu_chunk_relocate(chunk, -1);
        } else { 


This patch calls alloc_percpu() from setup_arch() while percpu allocator is not yet initialized (before setup_per_cpu_areas()).

Thanks,
—Alexey


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09  2:16                                               ` Alexey Makhalov
@ 2021-12-09  8:46                                                 ` Michal Hocko
  2021-12-09  9:28                                                   ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-09  8:46 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Thu 09-12-21 02:16:17, Alexey Makhalov wrote:
> This patch calls alloc_percpu() from setup_arch() while percpu
> allocator is not yet initialized (before setup_per_cpu_areas()).

Yeah, I haven't realized the pcp is not available. I was not really sure
about that. Could you try with the alloc_percpu dropped?

Thanks for testing!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09  8:46                                                 ` Michal Hocko
@ 2021-12-09  9:28                                                   ` Alexey Makhalov
  2021-12-09  9:56                                                     ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-09  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable



> On Dec 9, 2021, at 12:46 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Thu 09-12-21 02:16:17, Alexey Makhalov wrote:
>> This patch calls alloc_percpu() from setup_arch() while percpu
>> allocator is not yet initialized (before setup_per_cpu_areas()).
> 
> Yeah, I haven't realized the pcp is not available. I was not really sure
> about that. Could you try with the alloc_percpu dropped?
> 
> Thanks for testing!
> -- 
> Michal Hocko
> SUSE Labs

It boots now. dmesg has these new messages:

[    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
[    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
...
[    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
[    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]

vCPU/node hot add works.
Onlining works as well, but with warning. I do not think it is related to the patch:
[   36.838838] CPU4 has been hot-added
[   36.838987] acpi_processor_hotadd_init:205 cpu 4, node 4, online 0, ndata 00000000e9c7f79b
[   48.480498] Built 4 zonelists, mobility grouping on.  Total pages: 961440
[   48.480508] Policy zone: Normal
[   48.508318] smpboot: Booting Node 4 Processor 4 APIC 0x8
[   48.509255] Disabled fast string operations
[   48.509807] smpboot: CPU 4 Converting physical 8 to logical package 4
[   48.509825] smpboot: CPU 4 Converting physical 0 to logical die 4
[   48.510040] WARNING: workqueue cpumask: online intersect > possible intersect
[   48.510324] vmware: vmware-stealtime: cpu 4, pa 3e667000
[   48.511311] Will online and init hotplugged CPU: 4

Hot remove does not quite work. It might be issue in ACPI/Firmware code or Hypervisor. Debugging…

Do you want me to perform any specific tests?

Regards,
—Alexey

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09  9:28                                                   ` Alexey Makhalov
@ 2021-12-09  9:56                                                     ` Michal Hocko
  2021-12-09 10:23                                                       ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-09  9:56 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
> 
> 
> > On Dec 9, 2021, at 12:46 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Thu 09-12-21 02:16:17, Alexey Makhalov wrote:
> >> This patch calls alloc_percpu() from setup_arch() while percpu
> >> allocator is not yet initialized (before setup_per_cpu_areas()).
> > 
> > Yeah, I haven't realized the pcp is not available. I was not really sure
> > about that. Could you try with the alloc_percpu dropped?
> > 
> > Thanks for testing!
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> It boots now. dmesg has these new messages:
> 
> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> ...
> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]

Interesting that only those two didn't get a proper arch specific
initialization. Could you check why? I assume init_cpu_to_node
doesn't see any CPU pointing at this node. Wondering why that would be
the case but that can be a bug in the affinity tables.

> vCPU/node hot add works.
> Onlining works as well, but with warning. I do not think it is related to the patch:
> [   36.838838] CPU4 has been hot-added
> [   36.838987] acpi_processor_hotadd_init:205 cpu 4, node 4, online 0, ndata 00000000e9c7f79b
> [   48.480498] Built 4 zonelists, mobility grouping on.  Total pages: 961440
> [   48.480508] Policy zone: Normal
> [   48.508318] smpboot: Booting Node 4 Processor 4 APIC 0x8
> [   48.509255] Disabled fast string operations
> [   48.509807] smpboot: CPU 4 Converting physical 8 to logical package 4
> [   48.509825] smpboot: CPU 4 Converting physical 0 to logical die 4
> [   48.510040] WARNING: workqueue cpumask: online intersect > possible intersect

I will double check. There are changes required on the hotplug side. I
would like to see that this one doesn't blow up before diving there.

> [   48.510324] vmware: vmware-stealtime: cpu 4, pa 3e667000
> [   48.511311] Will online and init hotplugged CPU: 4
> 
> Hot remove does not quite work. It might be issue in ACPI/Firmware code or Hypervisor. Debugging…
> 
> Do you want me to perform any specific tests?

No, not really. AFAIU your issue has been reproducible during boot and
that seems to be fixed. I will work on the hotplug side of the things
and post something resembling a real patch soon. That would require also
memory hotplug testing.

Thanks for your help!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09  9:56                                                     ` Michal Hocko
@ 2021-12-09 10:23                                                       ` Alexey Makhalov
  2021-12-09 13:29                                                         ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-09 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 2392 bytes --]



> On Dec 9, 2021, at 1:56 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
>> 
>> 
>> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
>> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
>> ...
>> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
>> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]
> 
> Interesting that only those two didn't get a proper arch specific
> initialization. Could you check why? I assume init_cpu_to_node
> doesn't see any CPU pointing at this node. Wondering why that would be
> the case but that can be a bug in the affinity tables.

My bad shrinking. Not just these 2, but all possible and not present nodes from 4 to 127
are having this message.


> 
>> vCPU/node hot add works.
>> Onlining works as well, but with warning. I do not think it is related to the patch:
>> [   36.838838] CPU4 has been hot-added
>> [   36.838987] acpi_processor_hotadd_init:205 cpu 4, node 4, online 0, ndata 00000000e9c7f79b
>> [   48.480498] Built 4 zonelists, mobility grouping on.  Total pages: 961440
>> [   48.480508] Policy zone: Normal
>> [   48.508318] smpboot: Booting Node 4 Processor 4 APIC 0x8
>> [   48.509255] Disabled fast string operations
>> [   48.509807] smpboot: CPU 4 Converting physical 8 to logical package 4
>> [   48.509825] smpboot: CPU 4 Converting physical 0 to logical die 4
>> [   48.510040] WARNING: workqueue cpumask: online intersect > possible intersect
> 
> I will double check. There are changes required on the hotplug side. I
> would like to see that this one doesn't blow up before diving there.
> 
>> [   48.510324] vmware: vmware-stealtime: cpu 4, pa 3e667000
>> [   48.511311] Will online and init hotplugged CPU: 4
>> 
>> Hot remove does not quite work. It might be issue in ACPI/Firmware code or Hypervisor. Debugging…
>> 
>> Do you want me to perform any specific tests?
> 
> No, not really. AFAIU your issue has been reproducible during boot and
> that seems to be fixed. I will work on the hotplug side of the things
> and post something resembling a real patch soon. That would require also
> memory hotplug testing.
I can help you with memory hotplug testing if needed.

Thanks,
—Alexey

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-07 10:54                                           ` Michal Hocko
  2021-12-07 11:08                                             ` David Hildenbrand
  2021-12-08  8:54                                             ` Michal Hocko
@ 2021-12-09 10:48                                             ` Michal Hocko
  2021-12-13 15:06                                               ` Michal Hocko
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
  2 siblings, 2 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-09 10:48 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable, Nico Pache

[Cc Nico who has reported a similar problem]

Another attempt to handle the issue. Can you give this a try please?
David could you have a look whether anything hotplug related is missing
at this stage. Once we are settled on this fix I would like to get rid
of the node_state (offline/online) but that is a work on top.
---
From 36782ebaab2eaec637627506ab627c554bb948de Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 9 Dec 2021 10:00:02 +0100
Subject: [PATCH] mm: handle uninitialized numa nodes gracefully

We have had several reports [1][2][3] that page allocator blows up when
an allocation from a possible node is requested. The underlying reason
is that NODE_DATA for the specific node is not allocated.

NUMA specific initialization is arch specific and it can vary a lot.
E.g. x86 tries to initialize all nodes that have some cpu affinity (see
init_cpu_to_node) but this can be insufficient because the node might be
cpuless for example.

One way to address this problem would be to check for !node_online nodes
when trying to get a zonelist and silently fall back to another node.
That is unfortunately adding a branch into allocator hot path and it
doesn't handle any other potential NODE_DATA users.

This patch takes a different approach (following a lead of [3]) and it
pre allocates pgdat for all possible nodes in an arch indipendent code
- free_area_init. All uninitialized nodes are treated as memoryless
nodes. node_state of the node is not changed because that would lead to
other side effects - e.g. sysfs representation of such a node and from
past discussions [4] it is known that some tools might have problems
digesting that.

Newly allocated pgdat only gets a minimal initialization and the rest of
the work is expected to be done by the memory hotplug - hotadd_new_pgdat
(renamed to hotadd_init_pgdat).

generic_alloc_nodedata is changed to use the memblock allocator because
neither page nor slab allocators are available at the stage when all
pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
use the early boot allocator. The only arch specific implementation is
ia64 and that is changed to use the early allocator as well.

Reported-by: Alexey Makhalov <amakhalov@vmware.com>
Reported-by: Nico Pache <npache@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
[2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
[3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
[4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
---
 arch/ia64/mm/discontig.c       |  2 +-
 include/linux/memory_hotplug.h |  5 ++---
 mm/memory_hotplug.c            | 21 +++++++++------------
 mm/page_alloc.c                | 34 +++++++++++++++++++++++++++++++---
 4 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 791d4176e4a6..9a7a09e0aa52 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -613,7 +613,7 @@ pg_data_t *arch_alloc_nodedata(int nid)
 {
 	unsigned long size = compute_pernodesize(nid);
 
-	return kzalloc(size, GFP_KERNEL);
+	return memblock_alloc(size, SMP_CACHE_BYTES);
 }
 
 void arch_free_nodedata(pg_data_t *pgdat)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index be48e003a518..38f8d33f0884 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -176,14 +176,13 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
 
 #ifdef CONFIG_NUMA
 /*
- * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
- * XXX: kmalloc_node() can't work well to get new node's memory at this time.
+ * XXX: node aware allocation can't work well to get new node's memory at this time.
  *	Because, pgdat for the new node is not allocated/initialized yet itself.
  *	To use new node's memory, more consideration will be necessary.
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 852041f6be41..2d38a431f62f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1161,19 +1161,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-static pg_data_t __ref *hotadd_new_pgdat(int nid)
+static pg_data_t __ref *hotadd_init_pgdat(int nid)
 {
 	struct pglist_data *pgdat;
 
 	pgdat = NODE_DATA(nid);
-	if (!pgdat) {
-		pgdat = arch_alloc_nodedata(nid);
-		if (!pgdat)
-			return NULL;
 
+	/*
+	 * NODE_DATA is preallocated (free_area_init) but its internal
+	 * state is not allocated completely. Add missing pieces.
+	 * Completely offline nodes stay around and they just need
+	 * reintialization.
+	 */
+	if (!pgdat->per_cpu_nodestats) {
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
-		arch_refresh_nodedata(nid, pgdat);
 	} else {
 		int cpu;
 		/*
@@ -1192,8 +1194,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
 		}
 	}
 
-	/* we can use NODE_DATA(nid) from here */
-	pgdat->node_id = nid;
 	pgdat->node_start_pfn = 0;
 
 	/* init node's zones as empty zones, we don't have any present pages.*/
@@ -1245,7 +1245,7 @@ static int __try_online_node(int nid, bool set_node_online)
 	if (node_online(nid))
 		return 0;
 
-	pgdat = hotadd_new_pgdat(nid);
+	pgdat = hotadd_init_pgdat(nid);
 	if (!pgdat) {
 		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
 		ret = -ENOMEM;
@@ -1444,9 +1444,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	return ret;
 error:
-	/* rollback pgdat allocation and others */
-	if (new_node)
-		rollback_node_hotadd(nid);
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		memblock_remove(start, size);
 error_mem_hotplug_end:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..f2ceffadf4eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		/*
+		 * All possible nodes have pgdat preallocated
+		 * free_area_init
+		 */
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
 			build_zonelists(pgdat);
@@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
+
+			/* Allocator not initialized yet */
+			pgdat = arch_alloc_nodedata(nid);
+			if (!pgdat) {
+				pr_err("Cannot allocate %zuB for node %d.\n",
+						sizeof(*pgdat), nid);
+				continue;
+			}
+			arch_refresh_nodedata(nid, pgdat);
+			free_area_init_memoryless_node(nid);
+			/*
+			 * not marking this node online because we do not want to
+			 * confuse userspace by sysfs files/directories for node
+			 * without any memory attached to it (see topology_init)
+			 * The pgdat will get fully initialized when a memory is
+			 * hotpluged into it by hotadd_init_pgdat
+			 */
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
2.30.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09 10:23                                                       ` Alexey Makhalov
@ 2021-12-09 13:29                                                         ` Michal Hocko
  2021-12-09 19:01                                                           ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-09 13:29 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Thu 09-12-21 10:23:52, Alexey Makhalov wrote:
> 
> 
> > On Dec 9, 2021, at 1:56 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
> >> 
> >> 
> >> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
> >> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> >> ...
> >> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
> >> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]
> > 
> > Interesting that only those two didn't get a proper arch specific
> > initialization. Could you check why? I assume init_cpu_to_node
> > doesn't see any CPU pointing at this node. Wondering why that would be
> > the case but that can be a bug in the affinity tables.
> 
> My bad shrinking. Not just these 2, but all possible and not present nodes from 4 to 127
> are having this message.

Does that mean that your possible (but offline) cpus do not set their
affinity?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09 13:29                                                         ` Michal Hocko
@ 2021-12-09 19:01                                                           ` Alexey Makhalov
  2021-12-10  9:11                                                             ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-09 19:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]



> On Dec 9, 2021, at 5:29 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Thu 09-12-21 10:23:52, Alexey Makhalov wrote:
>> 
>> 
>>> On Dec 9, 2021, at 1:56 AM, Michal Hocko <mhocko@suse.com> wrote:
>>> 
>>> On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
>>>> 
>>>> 
>>>> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
>>>> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
>>>> ...
>>>> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
>>>> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]
>>> 
>>> Interesting that only those two didn't get a proper arch specific
>>> initialization. Could you check why? I assume init_cpu_to_node
>>> doesn't see any CPU pointing at this node. Wondering why that would be
>>> the case but that can be a bug in the affinity tables.
>> 
>> My bad shrinking. Not just these 2, but all possible and not present nodes from 4 to 127
>> are having this message.
> 
> Does that mean that your possible (but offline) cpus do not set their
> affinity?
> 
Hi Michal,

I didn’t quite gut a question here. Do you mean scheduler affinity for offlined/not present CPUs?
From the patch, this message should be printed for every possible offlined node:
	for_each_node(nid) {
...
		if (!node_online(nid)) {
			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);

Thanks,
—Alexey



[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09 19:01                                                           ` Alexey Makhalov
@ 2021-12-10  9:11                                                             ` Michal Hocko
  2021-12-17 12:53                                                               ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-10  9:11 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Thu 09-12-21 19:01:03, Alexey Makhalov wrote:
> 
> 
> > On Dec 9, 2021, at 5:29 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Thu 09-12-21 10:23:52, Alexey Makhalov wrote:
> >> 
> >> 
> >>> On Dec 9, 2021, at 1:56 AM, Michal Hocko <mhocko@suse.com> wrote:
> >>> 
> >>> On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
> >>>> 
> >>>> 
> >>>> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
> >>>> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> >>>> ...
> >>>> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
> >>>> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]
> >>> 
> >>> Interesting that only those two didn't get a proper arch specific
> >>> initialization. Could you check why? I assume init_cpu_to_node
> >>> doesn't see any CPU pointing at this node. Wondering why that would be
> >>> the case but that can be a bug in the affinity tables.
> >> 
> >> My bad shrinking. Not just these 2, but all possible and not present nodes from 4 to 127
> >> are having this message.
> > 
> > Does that mean that your possible (but offline) cpus do not set their
> > affinity?
> > 
> Hi Michal,
> 
> I didn’t quite gut a question here. Do you mean scheduler affinity for offlined/not present CPUs?
> From the patch, this message should be printed for every possible offlined node:
> 	for_each_node(nid) {
> ...
> 		if (!node_online(nid)) {
> 			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);

Sure, let me expand on this a bit. X86 initialization code
(init_cpu_to_node) does
        for_each_possible_cpu(cpu) {
                int node = numa_cpu_node(cpu);

                if (node == NUMA_NO_NODE)
                        continue;

                if (!node_online(node))
                        init_memory_less_node(node);

                numa_set_node(cpu, node);
        }

which means that a memory less node is not initialized either when
	- your offline CPUs are not listed in possible cpus for some
	  reason
	- or they do not have any node affinity (numa_cpu_node is
	  NUMA_NO_NODE).

Could you check what is the reason in your particular case please?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-09 10:48                                             ` Michal Hocko
@ 2021-12-13 15:06                                               ` Michal Hocko
  2021-12-13 15:07                                                 ` David Hildenbrand
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-13 15:06 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable, Nico Pache

On Thu 09-12-21 11:48:42, Michal Hocko wrote:
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 852041f6be41..2d38a431f62f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1161,19 +1161,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
>  }
>  
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  {
>  	struct pglist_data *pgdat;
>  
>  	pgdat = NODE_DATA(nid);
> -	if (!pgdat) {
> -		pgdat = arch_alloc_nodedata(nid);
> -		if (!pgdat)
> -			return NULL;
>  
> +	/*
> +	 * NODE_DATA is preallocated (free_area_init) but its internal
> +	 * state is not allocated completely. Add missing pieces.
> +	 * Completely offline nodes stay around and they just need
> +	 * reintialization.
> +	 */
> +	if (!pgdat->per_cpu_nodestats) {
>  		pgdat->per_cpu_nodestats =
>  			alloc_percpu(struct per_cpu_nodestat);
> -		arch_refresh_nodedata(nid, pgdat);

This should really be 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 42211485bcf3..2daa88ce8c80 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1173,7 +1173,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
 	 * Completely offline nodes stay around and they just need
 	 * reintialization.
 	 */
-	if (!pgdat->per_cpu_nodestats) {
+	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
 	} else {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-13 15:06                                               ` Michal Hocko
@ 2021-12-13 15:07                                                 ` David Hildenbrand
  2021-12-14  8:38                                                   ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-13 15:07 UTC (permalink / raw)
  To: Michal Hocko, Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	Oscar Salvador, Tejun Heo, Christoph Lameter, linux-kernel,
	stable, Nico Pache

On 13.12.21 16:06, Michal Hocko wrote:
> On Thu 09-12-21 11:48:42, Michal Hocko wrote:
> [...]
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 852041f6be41..2d38a431f62f 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1161,19 +1161,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
>>  }
>>  
>>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
>> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
>> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
>>  {
>>  	struct pglist_data *pgdat;
>>  
>>  	pgdat = NODE_DATA(nid);
>> -	if (!pgdat) {
>> -		pgdat = arch_alloc_nodedata(nid);
>> -		if (!pgdat)
>> -			return NULL;
>>  
>> +	/*
>> +	 * NODE_DATA is preallocated (free_area_init) but its internal
>> +	 * state is not allocated completely. Add missing pieces.
>> +	 * Completely offline nodes stay around and they just need
>> +	 * reintialization.
>> +	 */
>> +	if (!pgdat->per_cpu_nodestats) {
>>  		pgdat->per_cpu_nodestats =
>>  			alloc_percpu(struct per_cpu_nodestat);
>> -		arch_refresh_nodedata(nid, pgdat);
> 
> This should really be 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 42211485bcf3..2daa88ce8c80 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1173,7 +1173,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
>  	 * Completely offline nodes stay around and they just need
>  	 * reintialization.
>  	 */
> -	if (!pgdat->per_cpu_nodestats) {
> +	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
>  		pgdat->per_cpu_nodestats =
>  			alloc_percpu(struct per_cpu_nodestat);
>  	} else {
> 

I'll try giving this some churn later this week -- busy with other stuff.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-13 15:07                                                 ` David Hildenbrand
@ 2021-12-14  8:38                                                   ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14  8:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexey Makhalov, Dennis Zhou, Eric Dumazet, linux-mm,
	Andrew Morton, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable, Nico Pache

On Mon 13-12-21 16:07:18, David Hildenbrand wrote:
> On 13.12.21 16:06, Michal Hocko wrote:
> > On Thu 09-12-21 11:48:42, Michal Hocko wrote:
> > [...]
> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >> index 852041f6be41..2d38a431f62f 100644
> >> --- a/mm/memory_hotplug.c
> >> +++ b/mm/memory_hotplug.c
> >> @@ -1161,19 +1161,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
> >>  }
> >>  
> >>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> >> -static pg_data_t __ref *hotadd_new_pgdat(int nid)
> >> +static pg_data_t __ref *hotadd_init_pgdat(int nid)
> >>  {
> >>  	struct pglist_data *pgdat;
> >>  
> >>  	pgdat = NODE_DATA(nid);
> >> -	if (!pgdat) {
> >> -		pgdat = arch_alloc_nodedata(nid);
> >> -		if (!pgdat)
> >> -			return NULL;
> >>  
> >> +	/*
> >> +	 * NODE_DATA is preallocated (free_area_init) but its internal
> >> +	 * state is not allocated completely. Add missing pieces.
> >> +	 * Completely offline nodes stay around and they just need
> >> +	 * reintialization.
> >> +	 */
> >> +	if (!pgdat->per_cpu_nodestats) {
> >>  		pgdat->per_cpu_nodestats =
> >>  			alloc_percpu(struct per_cpu_nodestat);
> >> -		arch_refresh_nodedata(nid, pgdat);
> > 
> > This should really be 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 42211485bcf3..2daa88ce8c80 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1173,7 +1173,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
> >  	 * Completely offline nodes stay around and they just need
> >  	 * reintialization.
> >  	 */
> > -	if (!pgdat->per_cpu_nodestats) {
> > +	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
> >  		pgdat->per_cpu_nodestats =
> >  			alloc_percpu(struct per_cpu_nodestat);
> >  	} else {
> > 
> 
> I'll try giving this some churn later this week -- busy with other stuff.

Please hang on, this needs to be done yet slightly differently. I will
post something more resembling a final patch later today. For the
purpose of the testing this should be sufficient for now.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2021-12-09 10:48                                             ` Michal Hocko
  2021-12-13 15:06                                               ` Michal Hocko
@ 2021-12-14 10:07                                               ` Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 1/4] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG Michal Hocko
                                                                   ` (5 more replies)
  1 sibling, 6 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:07 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache

Hi,
this should be the full bundle for now. I have ended up with 4 patches.
The primary fix is patch 2 (should be reasonably easy to backport to
older kernels if there is any need for that). Patches 3 and 4 are mere
clean ups.

I will repost once this can get some testing from Alexey. Shouldn't be
too much different from http://lkml.kernel.org/r/YbHfBgPQMkjtuHYF@dhcp22.suse.cz
with the follow up fix squashed in.

I would really appreciate to hear more about http://lkml.kernel.org/r/YbMZsczMGpChaWz0@dhcp22.suse.cz
because I would like to add that information to the changelog as well.

Thanks for the review and testing.



^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 1/4] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
@ 2021-12-14 10:07                                                 ` Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully Michal Hocko
                                                                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:07 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is a preparatory patch and it doesn't introduce any functional
change. It merely pulls out arch_alloc_nodedata (and co) outside of
CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
from the generic MM code.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/ia64/mm/discontig.c       |   2 -
 include/linux/memory_hotplug.h | 119 ++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 62 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 791d4176e4a6..8dc8a554f774 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -608,7 +608,6 @@ void __init paging_init(void)
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
 pg_data_t *arch_alloc_nodedata(int nid)
 {
 	unsigned long size = compute_pernodesize(nid);
@@ -626,7 +625,6 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
 	pgdat_list[update_node] = update_pgdat;
 	scatter_node_data();
 }
-#endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index be48e003a518..4355983b364d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,6 +16,65 @@ struct memory_group;
 struct resource;
 struct vmem_altmap;
 
+#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
+/*
+ * For supporting node-hotadd, we have to allocate a new pgdat.
+ *
+ * If an arch has generic style NODE_DATA(),
+ * node_data[nid] = kzalloc() works well. But it depends on the architecture.
+ *
+ * In general, generic_alloc_nodedata() is used.
+ * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
+ *
+ */
+extern pg_data_t *arch_alloc_nodedata(int nid);
+extern void arch_free_nodedata(pg_data_t *pgdat);
+extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
+
+#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
+#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
+#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
+
+#ifdef CONFIG_NUMA
+/*
+ * XXX: node aware allocation can't work well to get new node's memory at this time.
+ *	Because, pgdat for the new node is not allocated/initialized yet itself.
+ *	To use new node's memory, more consideration will be necessary.
+ */
+#define generic_alloc_nodedata(nid)				\
+({								\
+	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+})
+/*
+ * This definition is just for error path in node hotadd.
+ * For node hotremove, we have to replace this.
+ */
+#define generic_free_nodedata(pgdat)	kfree(pgdat)
+
+extern pg_data_t *node_data[];
+static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+	node_data[nid] = pgdat;
+}
+
+#else /* !CONFIG_NUMA */
+
+/* never called */
+static inline pg_data_t *generic_alloc_nodedata(int nid)
+{
+	BUG();
+	return NULL;
+}
+static inline void generic_free_nodedata(pg_data_t *pgdat)
+{
+}
+static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
+{
+}
+#endif /* CONFIG_NUMA */
+#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
 
@@ -154,66 +213,6 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 	      struct mhp_params *params);
 #endif /* ARCH_HAS_ADD_PAGES */
 
-#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
-/*
- * For supporting node-hotadd, we have to allocate a new pgdat.
- *
- * If an arch has generic style NODE_DATA(),
- * node_data[nid] = kzalloc() works well. But it depends on the architecture.
- *
- * In general, generic_alloc_nodedata() is used.
- * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
- *
- */
-extern pg_data_t *arch_alloc_nodedata(int nid);
-extern void arch_free_nodedata(pg_data_t *pgdat);
-extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
-
-#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
-
-#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
-#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
-
-#ifdef CONFIG_NUMA
-/*
- * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
- * XXX: kmalloc_node() can't work well to get new node's memory at this time.
- *	Because, pgdat for the new node is not allocated/initialized yet itself.
- *	To use new node's memory, more consideration will be necessary.
- */
-#define generic_alloc_nodedata(nid)				\
-({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
-})
-/*
- * This definition is just for error path in node hotadd.
- * For node hotremove, we have to replace this.
- */
-#define generic_free_nodedata(pgdat)	kfree(pgdat)
-
-extern pg_data_t *node_data[];
-static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-	node_data[nid] = pgdat;
-}
-
-#else /* !CONFIG_NUMA */
-
-/* never called */
-static inline pg_data_t *generic_alloc_nodedata(int nid)
-{
-	BUG();
-	return NULL;
-}
-static inline void generic_free_nodedata(pg_data_t *pgdat)
-{
-}
-static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
-{
-}
-#endif /* CONFIG_NUMA */
-#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
-
 void get_online_mems(void);
 void put_online_mems(void);
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 1/4] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG Michal Hocko
@ 2021-12-14 10:07                                                 ` Michal Hocko
  2021-12-14 10:33                                                   ` Christoph Lameter
  2021-12-15  4:47                                                   ` kernel test robot
  2021-12-14 10:07                                                 ` [PATCH v2 3/4] mm, memory_hotplug: drop arch_free_nodedata Michal Hocko
                                                                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:07 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

We have had several reports [1][2][3] that page allocator blows up when
an allocation from a possible node is requested. The underlying reason
is that NODE_DATA for the specific node is not allocated.

NUMA specific initialization is arch specific and it can vary a lot.
E.g. x86 tries to initialize all nodes that have some cpu affinity (see
init_cpu_to_node) but this can be insufficient because the node might be
cpuless for example.

One way to address this problem would be to check for !node_online nodes
when trying to get a zonelist and silently fall back to another node.
That is unfortunately adding a branch into allocator hot path and it
doesn't handle any other potential NODE_DATA users.

This patch takes a different approach (following a lead of [3]) and it
pre allocates pgdat for all possible nodes in an arch indipendent code
- free_area_init. All uninitialized nodes are treated as memoryless
nodes. node_state of the node is not changed because that would lead to
other side effects - e.g. sysfs representation of such a node and from
past discussions [4] it is known that some tools might have problems
digesting that.

Newly allocated pgdat only gets a minimal initialization and the rest of
the work is expected to be done by the memory hotplug - hotadd_new_pgdat
(renamed to hotadd_init_pgdat).

generic_alloc_nodedata is changed to use the memblock allocator because
neither page nor slab allocators are available at the stage when all
pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can
use the early boot allocator. The only arch specific implementation is
ia64 and that is changed to use the early allocator as well.

Reported-by: Alexey Makhalov <amakhalov@vmware.com>
Reported-by: Nico Pache <npache@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
[2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
[3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
[4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
---
 arch/ia64/mm/discontig.c       |  2 +-
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 21 +++++++++------------
 mm/page_alloc.c                | 34 +++++++++++++++++++++++++++++++---
 4 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 8dc8a554f774..b4c46925792f 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -612,7 +612,7 @@ pg_data_t *arch_alloc_nodedata(int nid)
 {
 	unsigned long size = compute_pernodesize(nid);
 
-	return kzalloc(size, GFP_KERNEL);
+	return memblock_alloc(size, SMP_CACHE_BYTES);
 }
 
 void arch_free_nodedata(pg_data_t *pgdat)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4355983b364d..cdd66bfdf855 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -44,7 +44,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 852041f6be41..9009a7b2a170 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1161,19 +1161,21 @@ static void reset_node_present_pages(pg_data_t *pgdat)
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-static pg_data_t __ref *hotadd_new_pgdat(int nid)
+static pg_data_t __ref *hotadd_init_pgdat(int nid)
 {
 	struct pglist_data *pgdat;
 
 	pgdat = NODE_DATA(nid);
-	if (!pgdat) {
-		pgdat = arch_alloc_nodedata(nid);
-		if (!pgdat)
-			return NULL;
 
+	/*
+	 * NODE_DATA is preallocated (free_area_init) but its internal
+	 * state is not allocated completely. Add missing pieces.
+	 * Completely offline nodes stay around and they just need
+	 * reintialization.
+	 */
+	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
 		pgdat->per_cpu_nodestats =
 			alloc_percpu(struct per_cpu_nodestat);
-		arch_refresh_nodedata(nid, pgdat);
 	} else {
 		int cpu;
 		/*
@@ -1192,8 +1194,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid)
 		}
 	}
 
-	/* we can use NODE_DATA(nid) from here */
-	pgdat->node_id = nid;
 	pgdat->node_start_pfn = 0;
 
 	/* init node's zones as empty zones, we don't have any present pages.*/
@@ -1245,7 +1245,7 @@ static int __try_online_node(int nid, bool set_node_online)
 	if (node_online(nid))
 		return 0;
 
-	pgdat = hotadd_new_pgdat(nid);
+	pgdat = hotadd_init_pgdat(nid);
 	if (!pgdat) {
 		pr_err("Cannot online node %d due to NULL pgdat\n", nid);
 		ret = -ENOMEM;
@@ -1444,9 +1444,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 
 	return ret;
 error:
-	/* rollback pgdat allocation and others */
-	if (new_node)
-		rollback_node_hotadd(nid);
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		memblock_remove(start, size);
 error_mem_hotplug_end:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..f2ceffadf4eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6382,7 +6382,11 @@ static void __build_all_zonelists(void *data)
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
 	} else {
-		for_each_online_node(nid) {
+		/*
+		 * All possible nodes have pgdat preallocated
+		 * free_area_init
+		 */
+		for_each_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
 			build_zonelists(pgdat);
@@ -8032,8 +8036,32 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
+	for_each_node(nid) {
+		pg_data_t *pgdat;
+
+		if (!node_online(nid)) {
+			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
+
+			/* Allocator not initialized yet */
+			pgdat = arch_alloc_nodedata(nid);
+			if (!pgdat) {
+				pr_err("Cannot allocate %zuB for node %d.\n",
+						sizeof(*pgdat), nid);
+				continue;
+			}
+			arch_refresh_nodedata(nid, pgdat);
+			free_area_init_memoryless_node(nid);
+			/*
+			 * not marking this node online because we do not want to
+			 * confuse userspace by sysfs files/directories for node
+			 * without any memory attached to it (see topology_init)
+			 * The pgdat will get fully initialized when a memory is
+			 * hotpluged into it by hotadd_init_pgdat
+			 */
+			continue;
+		}
+
+		pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);
 
 		/* Any memory on that node */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 3/4] mm, memory_hotplug: drop arch_free_nodedata
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 1/4] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully Michal Hocko
@ 2021-12-14 10:07                                                 ` Michal Hocko
  2021-12-14 10:07                                                 ` [PATCH v2 4/4] mm, memory_hotplug: reorganize new pgdat initialization Michal Hocko
                                                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:07 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
used to allocate pgdat when memory has been added to a node
(hotadd_init_pgdat) arch_free_nodedata has been only used in the
failure path because once the pgdat is exported (to be visible
by NODA_DATA(nid)) it cannot really be freed because there is no
synchronization available for that.

pgdat is allocated for each possible nodes now so the memory hotplug
doesn't need to do the ever use arch_free_nodedata so drop it.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/ia64/mm/discontig.c       |  5 -----
 include/linux/memory_hotplug.h |  3 ---
 mm/memory_hotplug.c            | 10 ----------
 3 files changed, 18 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index b4c46925792f..f177390fdee1 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -615,11 +615,6 @@ pg_data_t *arch_alloc_nodedata(int nid)
 	return memblock_alloc(size, SMP_CACHE_BYTES);
 }
 
-void arch_free_nodedata(pg_data_t *pgdat)
-{
-	kfree(pgdat);
-}
-
 void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
 {
 	pgdat_list[update_node] = update_pgdat;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index cdd66bfdf855..60f09d3ebb3d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -24,17 +24,14 @@ struct vmem_altmap;
  * node_data[nid] = kzalloc() works well. But it depends on the architecture.
  *
  * In general, generic_alloc_nodedata() is used.
- * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
  *
  */
 extern pg_data_t *arch_alloc_nodedata(int nid);
-extern void arch_free_nodedata(pg_data_t *pgdat);
 extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
 
 #else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
 #define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
-#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
 
 #ifdef CONFIG_NUMA
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9009a7b2a170..2daa88ce8c80 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1216,16 +1216,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
 	return pgdat;
 }
 
-static void rollback_node_hotadd(int nid)
-{
-	pg_data_t *pgdat = NODE_DATA(nid);
-
-	arch_refresh_nodedata(nid, NULL);
-	free_percpu(pgdat->per_cpu_nodestats);
-	arch_free_nodedata(pgdat);
-}
-
-
 /*
  * __try_online_node - online a node if offlined
  * @nid: the node ID
-- 
2.30.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 4/4] mm, memory_hotplug: reorganize new pgdat initialization
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
                                                                   ` (2 preceding siblings ...)
  2021-12-14 10:07                                                 ` [PATCH v2 3/4] mm, memory_hotplug: drop arch_free_nodedata Michal Hocko
@ 2021-12-14 10:07                                                 ` Michal Hocko
  2021-12-17 14:51                                                 ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully David Hildenbrand
  2022-01-10 21:16                                                 ` Rafael Aquini
  5 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:07 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When a !node_online node is brought up it needs a hotplug specific
initialization because the node could be either uninitialized yet or it
could have been recycled after previous hotremove. hotadd_init_pgdat is
responsible for that.

Internal pgdat state is initialized at two places currently
	- hotadd_init_pgdat
	- free_area_init_core_hotplug
There is no real clear cut what should go where but this patch's chosen to
move the whole internal state initialization into free_area_init_core_hotplug.
hotadd_init_pgdat is still responsible to pull all the parts together -
most notably to initialize zonelists because those depend on the overall topology.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 28 +++-------------------------
 mm/page_alloc.c                | 25 +++++++++++++++++++++++--
 3 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 60f09d3ebb3d..76bf2de86def 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
 extern void clear_zone_contiguous(struct zone *zone);
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-extern void __ref free_area_init_core_hotplug(int nid);
+extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
 extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
 extern int add_memory_resource(int nid, struct resource *resource,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2daa88ce8c80..ddae307152b8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1165,39 +1165,16 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
 {
 	struct pglist_data *pgdat;
 
-	pgdat = NODE_DATA(nid);
-
 	/*
 	 * NODE_DATA is preallocated (free_area_init) but its internal
 	 * state is not allocated completely. Add missing pieces.
 	 * Completely offline nodes stay around and they just need
 	 * reintialization.
 	 */
-	if (pgdat->per_cpu_nodestats == &boot_nodestats) {
-		pgdat->per_cpu_nodestats =
-			alloc_percpu(struct per_cpu_nodestat);
-	} else {
-		int cpu;
-		/*
-		 * Reset the nr_zones, order and highest_zoneidx before reuse.
-		 * Note that kswapd will init kswapd_highest_zoneidx properly
-		 * when it starts in the near future.
-		 */
-		pgdat->nr_zones = 0;
-		pgdat->kswapd_order = 0;
-		pgdat->kswapd_highest_zoneidx = 0;
-		for_each_online_cpu(cpu) {
-			struct per_cpu_nodestat *p;
-
-			p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
-			memset(p, 0, sizeof(*p));
-		}
-	}
-
-	pgdat->node_start_pfn = 0;
+	pgdat = NODE_DATA(nid);
 
 	/* init node's zones as empty zones, we don't have any present pages.*/
-	free_area_init_core_hotplug(nid);
+	free_area_init_core_hotplug(pgdat);
 
 	/*
 	 * The node we allocated has no zone fallback lists. For avoiding
@@ -1209,6 +1186,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
 	 * When memory is hot-added, all the memory is in offline state. So
 	 * clear all zones' present_pages because they will be updated in
 	 * online_pages() and offline_pages().
+	 * TODO: should be in free_area_init_core_hotplug?
 	 */
 	reset_node_managed_pages(pgdat);
 	reset_node_present_pages(pgdat);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2ceffadf4eb..34743dcd2d66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7442,12 +7442,33 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
  * NOTE: this function is only called during memory hotplug
  */
 #ifdef CONFIG_MEMORY_HOTPLUG
-void __ref free_area_init_core_hotplug(int nid)
+void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
 {
+	int nid = pgdat->node_id;
 	enum zone_type z;
-	pg_data_t *pgdat = NODE_DATA(nid);
+	int cpu;
 
 	pgdat_init_internals(pgdat);
+
+	if (pgdat->per_cpu_nodestats == &boot_nodestats)
+		pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat);
+
+	/*
+	 * Reset the nr_zones, order and highest_zoneidx before reuse.
+	 * Note that kswapd will init kswapd_highest_zoneidx properly
+	 * when it starts in the near future.
+	 */
+	pgdat->nr_zones = 0;
+	pgdat->kswapd_order = 0;
+	pgdat->kswapd_highest_zoneidx = 0;
+	pgdat->node_start_pfn = 0;
+	for_each_online_cpu(cpu) {
+		struct per_cpu_nodestat *p;
+
+		p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
+		memset(p, 0, sizeof(*p));
+	}
+
 	for (z = 0; z < MAX_NR_ZONES; z++)
 		zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
 }
-- 
2.30.2


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-14 10:07                                                 ` [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully Michal Hocko
@ 2021-12-14 10:33                                                   ` Christoph Lameter
  2021-12-14 10:38                                                     ` Michal Hocko
  2021-12-15  4:47                                                   ` kernel test robot
  1 sibling, 1 reply; 98+ messages in thread
From: Christoph Lameter @ 2021-12-14 10:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Nico Pache, Michal Hocko

On Tue, 14 Dec 2021, Michal Hocko wrote:

> This patch takes a different approach (following a lead of [3]) and it
> pre allocates pgdat for all possible nodes in an arch indipendent code
> - free_area_init. All uninitialized nodes are treated as memoryless
> nodes. node_state of the node is not changed because that would lead to
> other side effects - e.g. sysfs representation of such a node and from
> past discussions [4] it is known that some tools might have problems
> digesting that.

Would it be possible to define a pgdat statically and place it in read
only memory? Populate with values that ensure that the page allocator
does not blow up but does a defined fallback.

Point the pgdat for all nodes not online to that readonly pgdat?

Maybe that would save some memory. When the node comes online then a real
pgdat could be allocated.


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-14 10:33                                                   ` Christoph Lameter
@ 2021-12-14 10:38                                                     ` Michal Hocko
  2022-01-14  0:24                                                       ` Wei Yang
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-14 10:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, David Hildenbrand, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Nico Pache

On Tue 14-12-21 11:33:41, Christoph Lameter wrote:
> On Tue, 14 Dec 2021, Michal Hocko wrote:
> 
> > This patch takes a different approach (following a lead of [3]) and it
> > pre allocates pgdat for all possible nodes in an arch indipendent code
> > - free_area_init. All uninitialized nodes are treated as memoryless
> > nodes. node_state of the node is not changed because that would lead to
> > other side effects - e.g. sysfs representation of such a node and from
> > past discussions [4] it is known that some tools might have problems
> > digesting that.
> 
> Would it be possible to define a pgdat statically and place it in read
> only memory? Populate with values that ensure that the page allocator
> does not blow up but does a defined fallback.
> 
> Point the pgdat for all nodes not online to that readonly pgdat?
> 
> Maybe that would save some memory. When the node comes online then a real
> pgdat could be allocated.

This is certainly possible but also it is more complex. I aim for as
simple as possible at this stage. The reason I am not concerned about
memory overhead so much (even though the pgdat is a large data
structure) is that these unpopulated nodes are rather rare. We might see
more of them in the future but we are not quite there yet so I do not
think this is a major obstacle for now.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-14 10:07                                                 ` [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully Michal Hocko
  2021-12-14 10:33                                                   ` Christoph Lameter
@ 2021-12-15  4:47                                                   ` kernel test robot
  2021-12-15 10:12                                                     ` Michal Hocko
  1 sibling, 1 reply; 98+ messages in thread
From: kernel test robot @ 2021-12-15  4:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, David Hildenbrand, Alexey Makhalov
  Cc: kbuild-all, Linux Memory Management List, LKML, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter

Hi Michal,

I love your patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memory_hotplug-make-arch_alloc_nodedata-independent-on-CONFIG_MEMORY_HOTPLUG/20211214-190817
base:   https://github.com/hnaz/linux-mm master
config: ia64-defconfig (https://download.01.org/0day-ci/archive/20211215/202112151219.xAI8NaQR-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/65c560a3ac2561750c1dc71213f042e660b9bbc0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Michal-Hocko/mm-memory_hotplug-make-arch_alloc_nodedata-independent-on-CONFIG_MEMORY_HOTPLUG/20211214-190817
        git checkout 65c560a3ac2561750c1dc71213f042e660b9bbc0
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> WARNING: modpost: vmlinux.o(.text+0x566a2): Section mismatch in reference from the function arch_alloc_nodedata() to the function .init.text:memblock_alloc_try_nid()
The function arch_alloc_nodedata() references
the function __init memblock_alloc_try_nid().
This is often because arch_alloc_nodedata lacks a __init
annotation or the annotation of memblock_alloc_try_nid is wrong.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-15  4:47                                                   ` kernel test robot
@ 2021-12-15 10:12                                                     ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-15 10:12 UTC (permalink / raw)
  To: kernel test robot
  Cc: Andrew Morton, David Hildenbrand, Alexey Makhalov, kbuild-all,
	Linux Memory Management List, LKML, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Christoph Lameter

On Wed 15-12-21 12:47:16, kernel test robot wrote:
> Hi Michal,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on hnaz-mm/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memory_hotplug-make-arch_alloc_nodedata-independent-on-CONFIG_MEMORY_HOTPLUG/20211214-190817
> base:   https://github.com/hnaz/linux-mm master
> config: ia64-defconfig (https://download.01.org/0day-ci/archive/20211215/202112151219.xAI8NaQR-lkp@intel.com/config)
> compiler: ia64-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/0day-ci/linux/commit/65c560a3ac2561750c1dc71213f042e660b9bbc0
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Michal-Hocko/mm-memory_hotplug-make-arch_alloc_nodedata-independent-on-CONFIG_MEMORY_HOTPLUG/20211214-190817
>         git checkout 65c560a3ac2561750c1dc71213f042e660b9bbc0
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>, old ones prefixed by <<):
> 
> >> WARNING: modpost: vmlinux.o(.text+0x566a2): Section mismatch in reference from the function arch_alloc_nodedata() to the function .init.text:memblock_alloc_try_nid()
> The function arch_alloc_nodedata() references
> the function __init memblock_alloc_try_nid().
> This is often because arch_alloc_nodedata lacks a __init
> annotation or the annotation of memblock_alloc_try_nid is wrong.

Thanks for the report. This should do the trick. I will fold it into the
patch.

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index b4c46925792f..dd0cf4834eaa 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -608,7 +608,7 @@ void __init paging_init(void)
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
 
-pg_data_t *arch_alloc_nodedata(int nid)
+pg_data_t * __init arch_alloc_nodedata(int nid)
 {
 	unsigned long size = compute_pernodesize(nid);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-10  9:11                                                             ` Michal Hocko
@ 2021-12-17 12:53                                                               ` Michal Hocko
  2021-12-21  5:46                                                                 ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-17 12:53 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Fri 10-12-21 10:11:14, Michal Hocko wrote:
> On Thu 09-12-21 19:01:03, Alexey Makhalov wrote:
> > 
> > 
> > > On Dec 9, 2021, at 5:29 AM, Michal Hocko <mhocko@suse.com> wrote:
> > > 
> > > On Thu 09-12-21 10:23:52, Alexey Makhalov wrote:
> > >> 
> > >> 
> > >>> On Dec 9, 2021, at 1:56 AM, Michal Hocko <mhocko@suse.com> wrote:
> > >>> 
> > >>> On Thu 09-12-21 09:28:55, Alexey Makhalov wrote:
> > >>>> 
> > >>>> 
> > >>>> [    0.081777] Node 4 uninitialized by the platform. Please report with boot dmesg.
> > >>>> [    0.081790] Initmem setup node 4 [mem 0x0000000000000000-0x0000000000000000]
> > >>>> ...
> > >>>> [    0.086441] Node 127 uninitialized by the platform. Please report with boot dmesg.
> > >>>> [    0.086454] Initmem setup node 127 [mem 0x0000000000000000-0x0000000000000000]
> > >>> 
> > >>> Interesting that only those two didn't get a proper arch specific
> > >>> initialization. Could you check why? I assume init_cpu_to_node
> > >>> doesn't see any CPU pointing at this node. Wondering why that would be
> > >>> the case but that can be a bug in the affinity tables.
> > >> 
> > >> My bad shrinking. Not just these 2, but all possible and not present nodes from 4 to 127
> > >> are having this message.
> > > 
> > > Does that mean that your possible (but offline) cpus do not set their
> > > affinity?
> > > 
> > Hi Michal,
> > 
> > I didn’t quite gut a question here. Do you mean scheduler affinity for offlined/not present CPUs?
> > From the patch, this message should be printed for every possible offlined node:
> > 	for_each_node(nid) {
> > ...
> > 		if (!node_online(nid)) {
> > 			pr_warn("Node %d uninitialized by the platform. Please report with boot dmesg.\n", nid);
> 
> Sure, let me expand on this a bit. X86 initialization code
> (init_cpu_to_node) does
>         for_each_possible_cpu(cpu) {
>                 int node = numa_cpu_node(cpu);
> 
>                 if (node == NUMA_NO_NODE)
>                         continue;
> 
>                 if (!node_online(node))
>                         init_memory_less_node(node);
> 
>                 numa_set_node(cpu, node);
>         }
> 
> which means that a memory less node is not initialized either when
> 	- your offline CPUs are not listed in possible cpus for some
> 	  reason
> 	- or they do not have any node affinity (numa_cpu_node is
> 	  NUMA_NO_NODE).
> 
> Could you check what is the reason in your particular case please?

Did you have time to look into this Alexey?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
                                                                   ` (3 preceding siblings ...)
  2021-12-14 10:07                                                 ` [PATCH v2 4/4] mm, memory_hotplug: reorganize new pgdat initialization Michal Hocko
@ 2021-12-17 14:51                                                 ` David Hildenbrand
  2021-12-21  9:51                                                   ` Michal Hocko
  2022-01-10 21:16                                                 ` Rafael Aquini
  5 siblings, 1 reply; 98+ messages in thread
From: David Hildenbrand @ 2021-12-17 14:51 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Alexey Makhalov
  Cc: LKML, linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador,
	Tejun Heo, Christoph Lameter, Nico Pache

On 14.12.21 11:07, Michal Hocko wrote:
> Hi,
> this should be the full bundle for now. I have ended up with 4 patches.
> The primary fix is patch 2 (should be reasonably easy to backport to
> older kernels if there is any need for that). Patches 3 and 4 are mere
> clean ups.
> 
> I will repost once this can get some testing from Alexey. Shouldn't be
> too much different from http://lkml.kernel.org/r/YbHfBgPQMkjtuHYF@dhcp22.suse.cz
> with the follow up fix squashed in.
> 
> I would really appreciate to hear more about http://lkml.kernel.org/r/YbMZsczMGpChaWz0@dhcp22.suse.cz
> because I would like to add that information to the changelog as well.
> 
> Thanks for the review and testing.

Playing with memory hotplug only (only one hotpluggable node is possible with QEMU right now as only one will get added to SRAT with the hotplug range)

Start with one empty node:

#! /bin/bash
sudo qemu/build/qemu-system-x86_64 \
    --enable-kvm \
    -m 8G,slots=2,maxmem=16G \
    -object memory-backend-ram,id=mem0,size=4G \
    -object memory-backend-ram,id=mem1,size=4G \
    -numa node,cpus=0-1,nodeid=0,memdev=mem0 \
    -numa node,cpus=2-3,nodeid=1,memdev=mem1 \
    -numa node,nodeid=2 \
    -smp 4 \
    -drive file=/home/dhildenb/git/Fedora-Cloud-Base-33-1.2.x86_64.qcow2,format=qcow2,if=virtio \
    -cpu host \
    -machine q35 \
    -nographic \
    -nodefaults \
    -monitor unix:/var/tmp/monitor,server,nowait \
    -chardev stdio,id=serial,signal=off \
    -device isa-serial,chardev=serial

1. Guest state when booting

[    0.002506] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.002508] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.002510] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[    0.002511] SRAT: PXM 1 -> APIC 0x03 -> Node 1
[    0.002513] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.002515] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[    0.002517] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x17fffffff]
[    0.002518] ACPI: SRAT: Node 1 PXM 1 [mem 0x180000000-0x27fffffff]
[    0.002520] ACPI: SRAT: Node 2 PXM 2 [mem 0x280000000-0x4ffffffff] hotplug
[    0.002523] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00000000
-0x7fffffff]
[    0.002525] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x17fffffff] -> [mem 0x000000
00-0x17fffffff]
[    0.002533] NODE_DATA(0) allocated [mem 0x17ffd5000-0x17fffffff]
[    0.002716] NODE_DATA(1) allocated [mem 0x27ffd5000-0x27fffffff]
[    0.017960] Zone ranges:
[    0.017966]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.017969]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.017971]   Normal   [mem 0x0000000100000000-0x000000027fffffff]
[    0.017972]   Device   empty
[    0.017974] Movable zone start for each node
[    0.017976] Early memory node ranges
[    0.017977]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.017979]   node   0: [mem 0x0000000000100000-0x000000007ffd5fff]
[    0.017980]   node   0: [mem 0x0000000100000000-0x000000017fffffff]
[    0.017982]   node   1: [mem 0x0000000180000000-0x000000027fffffff]
[    0.017984] Initmem setup node 0 [mem 0x0000000000001000-0x000000017fffffff]
[    0.017990] Initmem setup node 1 [mem 0x0000000180000000-0x000000027fffffff]
[    0.017993] Node 2 uninitialized by the platform. Please report with boot dmesg.
[    0.018008] Initmem setup node 2 [mem 0x0000000000000000-0x0000000000000000]
[    0.018011] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.018031] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.023622] On node 0, zone Normal: 42 pages in unavailable ranges

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 3921 MB
node 0 free: 3638 MB
node 1 cpus: 2 3
node 1 size: 4022 MB
node 1 free: 3519 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
# cat /sys/devices/system/node/online 
0-1
# cat /sys/devices/system/node/possible 
0-2


3. Hotplug a DIMM and online it to ZONE_MOVABLE

# echo online_movable > /sys/devices/system/memory/auto_online_blocks 


$ echo "object_add memory-backend-ram,id=hmem0,size=8G" | sudo nc -U /var/tmp/monitor ; echo
$ echo "device_add pc-dimm,id=dimm0,memdev=hmem0,node=2" | sudo nc -U /var/tmp/monitor ; echo


4. Guest state after hotplug

[  334.541452] Built 2 zonelists, mobility grouping on.  Total pages: 1999733
[  334.541908] Policy zone: Normal
[  334.559853] Fallback order for Node 0: 0 2 1 
[  334.560234] Fallback order for Node 1: 1 2 0 
[  334.560524] Fallback order for Node 2: 2 0 1 
[  334.560810] Built 3 zonelists, mobility grouping on.  Total pages: 2032501
[  334.561281] Policy zone: Normal

# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 3921 MB
node 0 free: 3529 MB
node 1 cpus: 2 3
node 1 size: 4022 MB
node 1 free: 3564 MB
node 2 cpus:
node 2 size: 8192 MB
node 2 free: 8192 MB
node distances:
node   0   1   2 
  0:  10  20  20 
  1:  20  10  20 
  2:  20  20  10 
# cat /sys/devices/system/node/online 
0-2
# cat /sys/devices/system/node/possible 
0-2
# cat /sys/devices/system/node/has_memory 
0-2
# cat /sys/devices/system/node/has_normal_memory 
0-1
# cat /sys/devices/system/node/has_cpu 
0-1


5. Unplug DIMM

$ echo "device_del dimm0" | sudo nc -U /var/tmp/monitor ; echo


6. Guest state after unplug

[  494.218938] Fallback order for Node 0: 0 2 1 
[  494.219315] Fallback order for Node 1: 1 2 0 
[  494.219626] Fallback order for Node 2: 2 0 1 
[  494.220430] Built 3 zonelists, mobility grouping on.  Total pages: 1999736
[  494.221024] Policy zone: Normal

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 3921 MB
node 0 free: 3661 MB
node 1 cpus: 2 3
node 1 size: 4022 MB
node 1 free: 3565 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
# cat /sys/devices/system/node/online 
0-1
# cat /sys/devices/system/node/possible 
0-2


7. Hotplug DIMM + online to ZONE_NORMAL

# echo online_kernel > /sys/devices/system/memory/auto_online_blocks 

$ echo "device_add pc-dimm,id=dimm0,memdev=hmem0,node=2" | sudo nc -U /var/tmp/monitor ; echo


8. Guest state after hotplug

# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 3921 MB
node 0 free: 3534 MB
node 1 cpus: 2 3
node 1 size: 4022 MB
node 1 free: 3567 MB
node 2 cpus:
node 2 size: 8192 MB
node 2 free: 8192 MB
node distances:
node   0   1   2 
  0:  10  20  20 
  1:  20  10  20 
  2:  20  20  10 

# cat /sys/devices/system/node/online 
0-2
# cat /sys/devices/system/node/possible 
0-2
# cat /sys/devices/system/node/has_memory 
0-2
# cat /sys/devices/system/node/has_normal_memory 
0-2
# cat /sys/devices/system/node/has_cpu
0-1



No surprises found so far. I'll be most offline for the next 2 weeks,
so an official review might take some more time.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-17 12:53                                                               ` Michal Hocko
@ 2021-12-21  5:46                                                                 ` Alexey Makhalov
  2021-12-21  9:46                                                                   ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-21  5:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

Hi Michal,

The patchset looks good to me. I didn’t find any issues during the testing.
I have one concern regarding dmesg output. Do you think this messaging is
valid if possible node is not yet present?
Or is it only the issue for virtual machines?

  Node XX uninitialized by the platform. Please report with boot dmesg.
  Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]

Thanks,
—Alexey



^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-21  5:46                                                                 ` Alexey Makhalov
@ 2021-12-21  9:46                                                                   ` Michal Hocko
  2021-12-21 20:23                                                                     ` Alexey Makhalov
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-21  9:46 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 21-12-21 05:46:16, Alexey Makhalov wrote:
> Hi Michal,
> 
> The patchset looks good to me. I didn’t find any issues during the testing.

Thanks a lot. Can I add your Tested-by: tag?

> I have one concern regarding dmesg output. Do you think this messaging is
> valid if possible node is not yet present?
> Or is it only the issue for virtual machines?
> 
>   Node XX uninitialized by the platform. Please report with boot dmesg.
>   Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]

AFAIU the Initmem part of the output is what concerns you, right? Yeah,
that really is more cryptic than necessary. Does this look any better?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34743dcd2d66..7e18a924be7e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7618,9 +7618,14 @@ static void __init free_area_init_node(int nid)
 	pgdat->node_start_pfn = start_pfn;
 	pgdat->per_cpu_nodestats = NULL;
 
-	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
-		(u64)start_pfn << PAGE_SHIFT,
-		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
+	if (start_pfn != end_pfn) {
+		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
+			(u64)start_pfn << PAGE_SHIFT,
+			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
+	} else {
+		pr_info("Initmem setup node %d as memoryless\n", nid);
+	}
+
 	calculate_node_totalpages(pgdat, start_pfn, end_pfn);
 
 	alloc_node_mem_map(pgdat);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2021-12-17 14:51                                                 ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully David Hildenbrand
@ 2021-12-21  9:51                                                   ` Michal Hocko
  2022-01-02  7:14                                                     ` Mike Rapoport
  0 siblings, 1 reply; 98+ messages in thread
From: Michal Hocko @ 2021-12-21  9:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexey Makhalov, LKML, linux-mm, Dennis Zhou,
	Eric Dumazet, Oscar Salvador, Tejun Heo, Christoph Lameter,
	Nico Pache

On Fri 17-12-21 15:51:31, David Hildenbrand wrote:
[...]
> No surprises found so far. I'll be most offline for the next 2 weeks,
> so an official review might take some more time.

Thanks a lot for the testing and a very instructive step by step howto.
I will note it down.

Don't worry about the review and enjoy the xmas break. I will likely
resubmit early next year.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-21  9:46                                                                   ` Michal Hocko
@ 2021-12-21 20:23                                                                     ` Alexey Makhalov
  2021-12-22 11:41                                                                       ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Alexey Makhalov @ 2021-12-21 20:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable



> On Dec 21, 2021, at 1:46 AM, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 21-12-21 05:46:16, Alexey Makhalov wrote:
>> Hi Michal,
>> 
>> The patchset looks good to me. I didn’t find any issues during the testing.
> 
> Thanks a lot. Can I add your Tested-by: tag?
Sure, thanks.

> 
>> I have one concern regarding dmesg output. Do you think this messaging is
>> valid if possible node is not yet present?
>> Or is it only the issue for virtual machines?
>> 
>>  Node XX uninitialized by the platform. Please report with boot dmesg.
>>  Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
> 
> AFAIU the Initmem part of the output is what concerns you, right? Yeah,
First line actually, this sentence “Please report with boot dmesg.”. But
there is nothing to fix, at least for VMs.

> that really is more cryptic than necessary. Does this look any better?
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 34743dcd2d66..7e18a924be7e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7618,9 +7618,14 @@ static void __init free_area_init_node(int nid)
> 	pgdat->node_start_pfn = start_pfn;
> 	pgdat->per_cpu_nodestats = NULL;
> 
> -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> -		(u64)start_pfn << PAGE_SHIFT,
> -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	if (start_pfn != end_pfn) {
> +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> +			(u64)start_pfn << PAGE_SHIFT,
> +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> +	} else {
> +		pr_info("Initmem setup node %d as memoryless\n", nid);
> +	}
> +
> 	calculate_node_totalpages(pgdat, start_pfn, end_pfn);
> 
> 	alloc_node_mem_map(pgdat);
Second line looks much better.

Thank you,
—Alexey


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3] mm: fix panic in __alloc_pages
  2021-12-21 20:23                                                                     ` Alexey Makhalov
@ 2021-12-22 11:41                                                                       ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2021-12-22 11:41 UTC (permalink / raw)
  To: Alexey Makhalov
  Cc: Dennis Zhou, Eric Dumazet, linux-mm, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Tejun Heo, Christoph Lameter,
	linux-kernel, stable

On Tue 21-12-21 20:23:34, Alexey Makhalov wrote:
> 
> 
> > On Dec 21, 2021, at 1:46 AM, Michal Hocko <mhocko@suse.com> wrote:
> > 
> > On Tue 21-12-21 05:46:16, Alexey Makhalov wrote:
> >> Hi Michal,
> >> 
> >> The patchset looks good to me. I didn’t find any issues during the testing.
> > 
> > Thanks a lot. Can I add your Tested-by: tag?
> Sure, thanks.

Thanks I will add those then.
 
> >> I have one concern regarding dmesg output. Do you think this messaging is
> >> valid if possible node is not yet present?
> >> Or is it only the issue for virtual machines?
> >> 
> >>  Node XX uninitialized by the platform. Please report with boot dmesg.
> >>  Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
> > 
> > AFAIU the Initmem part of the output is what concerns you, right? Yeah,
> First line actually, this sentence “Please report with boot dmesg.”. But
> there is nothing to fix, at least for VMs.

I am still not sure because at least x86 aims at handling that at the
platform code. David has given us a way to trigger this from kvm/qemu so
I will play with that. I can certainly change the wording but this whole
thing was meant to do a fixup after the arch specific code has initialized
everything.

> > that really is more cryptic than necessary. Does this look any better?
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 34743dcd2d66..7e18a924be7e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -7618,9 +7618,14 @@ static void __init free_area_init_node(int nid)
> > 	pgdat->node_start_pfn = start_pfn;
> > 	pgdat->per_cpu_nodestats = NULL;
> > 
> > -	pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> > -		(u64)start_pfn << PAGE_SHIFT,
> > -		end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> > +	if (start_pfn != end_pfn) {
> > +		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
> > +			(u64)start_pfn << PAGE_SHIFT,
> > +			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
> > +	} else {
> > +		pr_info("Initmem setup node %d as memoryless\n", nid);
> > +	}
> > +
> > 	calculate_node_totalpages(pgdat, start_pfn, end_pfn);
> > 
> > 	alloc_node_mem_map(pgdat);
> Second line looks much better.

OK, I will fold that in. I think it is more descriptive as well.
> 
> Thank you,
> —Alexey
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2021-12-21  9:51                                                   ` Michal Hocko
@ 2022-01-02  7:14                                                     ` Mike Rapoport
  2022-01-10 17:16                                                       ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Mike Rapoport @ 2022-01-02  7:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache

Hi Michal,

On Tue, Dec 21, 2021 at 10:51:14AM +0100, Michal Hocko wrote:
> On Fri 17-12-21 15:51:31, David Hildenbrand wrote:
> [...]
> > No surprises found so far. I'll be most offline for the next 2 weeks,
> > so an official review might take some more time.
> 
> Thanks a lot for the testing and a very instructive step by step howto.
> I will note it down.
> 
> Don't worry about the review and enjoy the xmas break. I will likely
> resubmit early next year.

Can you please cc me on that?
I'm way behind on linux-mm, wouldn't want to miss this.

> -- 
> Michal Hocko
> SUSE Labs
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2022-01-02  7:14                                                     ` Mike Rapoport
@ 2022-01-10 17:16                                                       ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2022-01-10 17:16 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Andrew Morton, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache

On Sun 02-01-22 09:14:45, Mike Rapoport wrote:
> Hi Michal,
> 
> On Tue, Dec 21, 2021 at 10:51:14AM +0100, Michal Hocko wrote:
> > On Fri 17-12-21 15:51:31, David Hildenbrand wrote:
> > [...]
> > > No surprises found so far. I'll be most offline for the next 2 weeks,
> > > so an official review might take some more time.
> > 
> > Thanks a lot for the testing and a very instructive step by step howto.
> > I will note it down.
> > 
> > Don't worry about the review and enjoy the xmas break. I will likely
> > resubmit early next year.
> 
> Can you please cc me on that?
> I'm way behind on linux-mm, wouldn't want to miss this.

Sure thing. I plan to repost after the merge window.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
                                                                   ` (4 preceding siblings ...)
  2021-12-17 14:51                                                 ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully David Hildenbrand
@ 2022-01-10 21:16                                                 ` Rafael Aquini
  2022-01-11  8:34                                                   ` Michal Hocko
  5 siblings, 1 reply; 98+ messages in thread
From: Rafael Aquini @ 2022-01-10 21:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache

On Tue, Dec 14, 2021 at 11:07:28AM +0100, Michal Hocko wrote:
> Hi,
> this should be the full bundle for now. I have ended up with 4 patches.
> The primary fix is patch 2 (should be reasonably easy to backport to
> older kernels if there is any need for that). Patches 3 and 4 are mere
> clean ups.
>
> I will repost once this can get some testing from Alexey. Shouldn't be
> too much different from http://lkml.kernel.org/r/YbHfBgPQMkjtuHYF@dhcp22.suse.cz
> with the follow up fix squashed in.
> 
> I would really appreciate to hear more about http://lkml.kernel.org/r/YbMZsczMGpChaWz0@dhcp22.suse.cz
> because I would like to add that information to the changelog as well.
> 
> Thanks for the review and testing.
> 

FWIW, you can add my Acked-by on your repost Michal.

I reviewed your patches and tested them against that PPC crash on boot 
described at https://lore.kernel.org/all/YdxoXhTqCmVrT0R5@optiplex-fbsd/

Everything has worked like a charm, AFAICT.

Thank you for letting me know about these patches, and thanks for
working on them as a follow-up to that problem reported by Nico.

-- Rafael


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully
  2022-01-10 21:16                                                 ` Rafael Aquini
@ 2022-01-11  8:34                                                   ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2022-01-11  8:34 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Andrew Morton, David Hildenbrand, Alexey Makhalov, LKML,
	linux-mm, Dennis Zhou, Eric Dumazet, Oscar Salvador, Tejun Heo,
	Christoph Lameter, Nico Pache

On Mon 10-01-22 16:16:06, Rafael Aquini wrote:
> On Tue, Dec 14, 2021 at 11:07:28AM +0100, Michal Hocko wrote:
> > Hi,
> > this should be the full bundle for now. I have ended up with 4 patches.
> > The primary fix is patch 2 (should be reasonably easy to backport to
> > older kernels if there is any need for that). Patches 3 and 4 are mere
> > clean ups.
> >
> > I will repost once this can get some testing from Alexey. Shouldn't be
> > too much different from http://lkml.kernel.org/r/YbHfBgPQMkjtuHYF@dhcp22.suse.cz
> > with the follow up fix squashed in.
> > 
> > I would really appreciate to hear more about http://lkml.kernel.org/r/YbMZsczMGpChaWz0@dhcp22.suse.cz
> > because I would like to add that information to the changelog as well.
> > 
> > Thanks for the review and testing.
> > 
> 
> FWIW, you can add my Acked-by on your repost Michal.
> 
> I reviewed your patches and tested them against that PPC crash on boot 
> described at https://lore.kernel.org/all/YdxoXhTqCmVrT0R5@optiplex-fbsd/
> 
> Everything has worked like a charm, AFAICT.
> 
> Thank you for letting me know about these patches, and thanks for
> working on them as a follow-up to that problem reported by Nico.

Thanks a lot for review and testing Rafael!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2021-12-14 10:38                                                     ` Michal Hocko
@ 2022-01-14  0:24                                                       ` Wei Yang
  2022-01-14 10:01                                                         ` Michal Hocko
  0 siblings, 1 reply; 98+ messages in thread
From: Wei Yang @ 2022-01-14  0:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Andrew Morton, David Hildenbrand,
	Alexey Makhalov, LKML, linux-mm, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Nico Pache

On Tue, Dec 14, 2021 at 11:38:47AM +0100, Michal Hocko wrote:
>On Tue 14-12-21 11:33:41, Christoph Lameter wrote:
>> On Tue, 14 Dec 2021, Michal Hocko wrote:
>> 
>> > This patch takes a different approach (following a lead of [3]) and it
>> > pre allocates pgdat for all possible nodes in an arch indipendent code
>> > - free_area_init. All uninitialized nodes are treated as memoryless
>> > nodes. node_state of the node is not changed because that would lead to
>> > other side effects - e.g. sysfs representation of such a node and from
>> > past discussions [4] it is known that some tools might have problems
>> > digesting that.
>> 
>> Would it be possible to define a pgdat statically and place it in read
>> only memory? Populate with values that ensure that the page allocator
>> does not blow up but does a defined fallback.
>> 
>> Point the pgdat for all nodes not online to that readonly pgdat?
>> 
>> Maybe that would save some memory. When the node comes online then a real
>> pgdat could be allocated.
>
>This is certainly possible but also it is more complex. I aim for as
>simple as possible at this stage. The reason I am not concerned about
>memory overhead so much (even though the pgdat is a large data
>structure) is that these unpopulated nodes are rather rare. We might see
>more of them in the future but we are not quite there yet so I do not
>think this is a major obstacle for now.

Another thing is we still have a chance to get NULL NODE_DATA if we failed to
allocate it. And this is the problem we want to address here.

This is not urgent, while we may need to address this later.

>
>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully
  2022-01-14  0:24                                                       ` Wei Yang
@ 2022-01-14 10:01                                                         ` Michal Hocko
  0 siblings, 0 replies; 98+ messages in thread
From: Michal Hocko @ 2022-01-14 10:01 UTC (permalink / raw)
  To: Wei Yang
  Cc: Christoph Lameter, Andrew Morton, David Hildenbrand,
	Alexey Makhalov, LKML, linux-mm, Dennis Zhou, Eric Dumazet,
	Oscar Salvador, Tejun Heo, Nico Pache

On Fri 14-01-22 00:24:15, Wei Yang wrote:
> On Tue, Dec 14, 2021 at 11:38:47AM +0100, Michal Hocko wrote:
> >On Tue 14-12-21 11:33:41, Christoph Lameter wrote:
> >> On Tue, 14 Dec 2021, Michal Hocko wrote:
> >> 
> >> > This patch takes a different approach (following a lead of [3]) and it
> >> > pre allocates pgdat for all possible nodes in an arch indipendent code
> >> > - free_area_init. All uninitialized nodes are treated as memoryless
> >> > nodes. node_state of the node is not changed because that would lead to
> >> > other side effects - e.g. sysfs representation of such a node and from
> >> > past discussions [4] it is known that some tools might have problems
> >> > digesting that.
> >> 
> >> Would it be possible to define a pgdat statically and place it in read
> >> only memory? Populate with values that ensure that the page allocator
> >> does not blow up but does a defined fallback.
> >> 
> >> Point the pgdat for all nodes not online to that readonly pgdat?
> >> 
> >> Maybe that would save some memory. When the node comes online then a real
> >> pgdat could be allocated.
> >
> >This is certainly possible but also it is more complex. I aim for as
> >simple as possible at this stage. The reason I am not concerned about
> >memory overhead so much (even though the pgdat is a large data
> >structure) is that these unpopulated nodes are rather rare. We might see
> >more of them in the future but we are not quite there yet so I do not
> >think this is a major obstacle for now.
> 
> Another thing is we still have a chance to get NULL NODE_DATA if we failed to
> allocate it. And this is the problem we want to address here.

System that is short on memory that early in the boot to fail this
allocation is very likely not going to finish the boot. I do not think
we can make any reasonable allocation failure handling here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2022-01-14 10:01 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01 20:13 [PATCH] mm: fix panic in __alloc_pages Alexey Makhalov
2021-11-01 20:38 ` Matthew Wilcox
2021-11-02  7:47 ` Michal Hocko
2021-11-02  8:12   ` David Hildenbrand
2021-11-02  8:48     ` Alexey Makhalov
2021-11-02  9:04       ` Michal Hocko
2021-11-02  9:24         ` David Hildenbrand
2021-11-02 10:34           ` Alexey Makhalov
2021-11-02 11:00             ` David Hildenbrand
2021-11-02 11:44               ` Michal Hocko
2021-11-02 12:06                 ` David Hildenbrand
2021-11-02 12:27                   ` Michal Hocko
2021-11-02 12:39                     ` David Hildenbrand
2021-11-02 13:25                       ` Michal Hocko
2021-11-02 13:41                         ` David Hildenbrand
2021-11-02 14:12                           ` Michal Hocko
2021-11-02 14:44                             ` David Hildenbrand
2021-11-02 13:52                         ` Oscar Salvador
2021-11-02 14:35                           ` Michal Hocko
2021-11-08  6:12                   ` Alexey Makhalov
2021-11-08  6:36                     ` [PATCH v2] " Alexey Makhalov
2021-11-08  8:32                       ` David Hildenbrand
2021-11-08 20:23                         ` [PATCH v3] " Alexey Makhalov
2021-11-09  2:08                           ` Eric Dumazet
2021-11-09  7:03                             ` David Hildenbrand
2021-11-09 16:55                               ` Eric Dumazet
2021-11-09 17:15                             ` Michal Hocko
2021-11-09 19:06                               ` Dennis Zhou
2021-11-09 19:54                                 ` Michal Hocko
2021-11-16  1:31                                   ` Alexey Makhalov
2021-11-16  9:17                                     ` Michal Hocko
2021-11-16 20:22                                       ` Alexey Makhalov
2021-11-18  8:35                                         ` Michal Hocko
2021-12-07 10:54                                           ` Michal Hocko
2021-12-07 11:08                                             ` David Hildenbrand
2021-12-07 12:13                                               ` Michal Hocko
2021-12-07 12:28                                                 ` David Hildenbrand
2021-12-07 13:23                                                   ` Michal Hocko
2021-12-07 15:09                                                     ` David Hildenbrand
2021-12-07 15:29                                                       ` Michal Hocko
2021-12-07 15:34                                                         ` David Hildenbrand
2021-12-07 15:56                                                           ` Michal Hocko
2021-12-07 16:09                                                             ` David Hildenbrand
2021-12-07 16:27                                                               ` Michal Hocko
2021-12-07 16:36                                                                 ` Michal Hocko
2021-12-07 16:40                                                                   ` David Hildenbrand
2021-12-08  8:28                                                                     ` Michal Hocko
2021-12-07 17:02                                                                   ` Alexey Makhalov
2021-12-07 17:13                                                                     ` David Hildenbrand
2021-12-07 17:17                                                                       ` Alexey Makhalov
2021-12-07 18:03                                                                         ` David Hildenbrand
2021-12-08  8:12                                                                           ` Michal Hocko
2021-12-08  8:24                                                                             ` David Hildenbrand
2021-12-08  8:34                                                                               ` Michal Hocko
2021-12-08  8:38                                                                                 ` David Hildenbrand
2021-12-08  8:04                                                                         ` Michal Hocko
2021-12-08  8:19                                                                           ` Alexey Makhalov
2021-12-08  8:30                                                                             ` Michal Hocko
2021-12-08  8:54                                             ` Michal Hocko
2021-12-08  8:57                                               ` Alexey Makhalov
2021-12-08  9:55                                                 ` Michal Hocko
2021-12-09  2:16                                               ` Alexey Makhalov
2021-12-09  8:46                                                 ` Michal Hocko
2021-12-09  9:28                                                   ` Alexey Makhalov
2021-12-09  9:56                                                     ` Michal Hocko
2021-12-09 10:23                                                       ` Alexey Makhalov
2021-12-09 13:29                                                         ` Michal Hocko
2021-12-09 19:01                                                           ` Alexey Makhalov
2021-12-10  9:11                                                             ` Michal Hocko
2021-12-17 12:53                                                               ` Michal Hocko
2021-12-21  5:46                                                                 ` Alexey Makhalov
2021-12-21  9:46                                                                   ` Michal Hocko
2021-12-21 20:23                                                                     ` Alexey Makhalov
2021-12-22 11:41                                                                       ` Michal Hocko
2021-12-09 10:48                                             ` Michal Hocko
2021-12-13 15:06                                               ` Michal Hocko
2021-12-13 15:07                                                 ` David Hildenbrand
2021-12-14  8:38                                                   ` Michal Hocko
2021-12-14 10:07                                               ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully Michal Hocko
2021-12-14 10:07                                                 ` [PATCH v2 1/4] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG Michal Hocko
2021-12-14 10:07                                                 ` [PATCH v2 2/4] mm: handle uninitialized numa nodes gracefully Michal Hocko
2021-12-14 10:33                                                   ` Christoph Lameter
2021-12-14 10:38                                                     ` Michal Hocko
2022-01-14  0:24                                                       ` Wei Yang
2022-01-14 10:01                                                         ` Michal Hocko
2021-12-15  4:47                                                   ` kernel test robot
2021-12-15 10:12                                                     ` Michal Hocko
2021-12-14 10:07                                                 ` [PATCH v2 3/4] mm, memory_hotplug: drop arch_free_nodedata Michal Hocko
2021-12-14 10:07                                                 ` [PATCH v2 4/4] mm, memory_hotplug: reorganize new pgdat initialization Michal Hocko
2021-12-17 14:51                                                 ` [PATCH v2 0/4] mm, memory_hotplug: handle unitialized numa node gracefully David Hildenbrand
2021-12-21  9:51                                                   ` Michal Hocko
2022-01-02  7:14                                                     ` Mike Rapoport
2022-01-10 17:16                                                       ` Michal Hocko
2022-01-10 21:16                                                 ` Rafael Aquini
2022-01-11  8:34                                                   ` Michal Hocko
2021-11-08 10:37                       ` [PATCH v2] mm: fix panic in __alloc_pages Michal Hocko
2021-11-02  9:40         ` [PATCH] " Alexey Makhalov
2021-11-02  9:40         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).