LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
@ 2021-09-09 14:16 yaozhenguo
  2021-09-15  3:50 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: yaozhenguo @ 2021-09-09 14:16 UTC (permalink / raw)
  To: mike.kravetz
  Cc: corbet, akpm, yaozhenguo, willy, linux-kernel, linux-doc,
	linux-mm, yaozhenguo

We can specify the number of hugepages to allocate at boot. But the
hugepages is balanced in all nodes at present. In some scenarios,
we only need hugepages in one node. For example: DPDK needs hugepages
which are in the same node as NIC. if DPDK needs four hugepages of 1G
size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
in kernel cmdline. But, only four hugepages are used. The others should
be free after boot. If the system memory is low(for example: 64G), it will
be an impossible task. So, Extending hugepages parameter to support
specifying hugepages at a specific node.
For example add following parameter:

hugepagesz=1G hugepages=0:1,1:3

It will allocate 1 hugepage in node0 and 3 hugepages in node1.

Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
---
v3 -> v4: changes
    - fix wrong behavior for parameter: hugepages=0:1,1:3 default_hugepagesz=1G
    - make the change of documentation more reasonable
---
 .../admin-guide/kernel-parameters.txt         |   8 +-
 Documentation/admin-guide/mm/hugetlbpage.rst  |  12 +-
 include/linux/hugetlb.h                       |   1 +
 mm/hugetlb.c                                  | 122 +++++++++++++++++-
 4 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f..a2046b2c5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1588,9 +1588,11 @@
 			the number of pages of hugepagesz to be allocated.
 			If this is the first HugeTLB parameter on the command
 			line, it specifies the number of pages to allocate for
-			the default huge page size.  See also
-			Documentation/admin-guide/mm/hugetlbpage.rst.
-			Format: <integer>
+			the default huge page size. If using node format, the
+			number of pages to allocate per-node can be specified.
+			See also Documentation/admin-guide/mm/hugetlbpage.rst.
+			Format: <integer> or (node format)
+				<node>:<integer>[,<node>:<integer>]
 
 	hugepagesz=
 			[HW] The size of the HugeTLB pages.  This is used in
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144..d70828c07 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -128,7 +128,9 @@ hugepages
 	implicitly specifies the number of huge pages of default size to
 	allocate.  If the number of huge pages of default size is implicitly
 	specified, it can not be overwritten by a hugepagesz,hugepages
-	parameter pair for the default size.
+	parameter pair for the default size.  This parameter also has a
+	node format.  The node format specifies the number of huge pages
+	to allocate on specific nodes.
 
 	For example, on an architecture with 2M default huge page size::
 
@@ -138,6 +140,14 @@ hugepages
 	indicating that the hugepages=512 parameter is ignored.  If a hugepages
 	parameter is preceded by an invalid hugepagesz parameter, it will
 	be ignored.
+
+	Node format example::
+
+		hugepagesz=2M hugepages=0:1,1:2
+
+	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
+	If the node number is invalid,  the parameter will be ignored.
+
 default_hugepagesz
 	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a387..5939ecd4f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -605,6 +605,7 @@ struct hstate {
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int max_huge_pages_node[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d52..c92ab09cf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
+static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
 	}
 }
 
+static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
+{
+	unsigned long i;
+	char buf[32];
+
+	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
+		if (hstate_is_gigantic(h)) {
+			struct huge_bootmem_page *m;
+			void *addr;
+
+			addr = memblock_alloc_try_nid_raw(
+					huge_page_size(h), huge_page_size(h),
+					0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+			if (!addr)
+				break;
+			m = addr;
+			BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
+			/*
+			 * Put them into a private list first because mem_map
+			 * is not up yet
+			 */
+			INIT_LIST_HEAD(&m->list);
+			list_add(&m->list, &huge_boot_pages);
+			m->hstate = h;
+		} else {
+			struct page *page;
+
+			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+
+			page = alloc_fresh_huge_page(h, gfp_mask, nid,
+					&node_states[N_MEMORY], NULL);
+			if (!page)
+				break;
+			put_page(page); /* free it into the hugepage allocator */
+		}
+		cond_resched();
+	}
+	if (i == h->max_huge_pages_node[nid])
+		return;
+
+	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
+		h->max_huge_pages_node[nid], buf, nid, i);
+	h->max_huge_pages_node[nid] = i;
+	h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
+}
+
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 	nodemask_t *node_alloc_noretry;
+	bool hugetlb_node_set = false;
+
+	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
+	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
+		pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+		return;
+	}
+
+	/* do node alloc */
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
+		if (h->max_huge_pages_node[i] > 0) {
+			hugetlb_hstate_alloc_pages_onenode(h, i);
+			hugetlb_node_set = true;
+		}
+	}
+
+	if (hugetlb_node_set)
+		return;
 
 	if (!hstate_is_gigantic(h)) {
 		/*
@@ -2867,10 +2933,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
-			if (hugetlb_cma_size) {
-				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
-				goto free;
-			}
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_pool_huge_page(h,
@@ -2887,7 +2949,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
-free:
 	kfree(node_alloc_noretry);
 }
 
@@ -3578,6 +3639,11 @@ static int __init hugetlb_init(void)
 			}
 			default_hstate.max_huge_pages =
 				default_hstate_max_huge_pages;
+
+			for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+				if (default_hugepages_in_node[i] > 0)
+					default_hstate.max_huge_pages_node[i] =
+						default_hugepages_in_node[i];
 		}
 	}
 
@@ -3649,6 +3715,10 @@ static int __init hugepages_setup(char *s)
 {
 	unsigned long *mhp;
 	static unsigned long *last_mhp;
+	unsigned int node = NUMA_NO_NODE;
+	int count;
+	unsigned long tmp;
+	char *p = s;
 
 	if (!parsed_valid_hugepagesz) {
 		pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
@@ -3672,8 +3742,37 @@ static int __init hugepages_setup(char *s)
 		return 0;
 	}
 
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
+	while (*p) {
+		count = 0;
+		if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+			goto invalid;
+		/* Parameter is node format */
+		if (p[count] == ':') {
+			node = tmp;
+			p += count + 1;
+			if (node < 0 ||
+				node >= nodes_weight(node_states[N_MEMORY]))
+				goto invalid;
+			/* Parse hugepages */
+			if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+				goto invalid;
+			if (!hugetlb_max_hstate)
+				default_hugepages_in_node[node] = tmp;
+			else
+				parsed_hstate->max_huge_pages_node[node] = tmp;
+			*mhp += tmp;
+			/* Go to parse next node*/
+			if (p[count] == ',')
+				p += count + 1;
+			else
+				break;
+		} else {
+			if (p != s)
+				goto invalid;
+			*mhp = tmp;
+			break;
+		}
+	}
 
 	/*
 	 * Global state is always initialized later in hugetlb_init.
@@ -3686,6 +3785,10 @@ static int __init hugepages_setup(char *s)
 	last_mhp = mhp;
 
 	return 1;
+
+invalid:
+	pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+	return 0;
 }
 __setup("hugepages=", hugepages_setup);
 
@@ -3747,6 +3850,7 @@ __setup("hugepagesz=", hugepagesz_setup);
 static int __init default_hugepagesz_setup(char *s)
 {
 	unsigned long size;
+	int i;
 
 	parsed_valid_hugepagesz = false;
 	if (parsed_default_hugepagesz) {
@@ -3775,6 +3879,10 @@ static int __init default_hugepagesz_setup(char *s)
 	 */
 	if (default_hstate_max_huge_pages) {
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
+		for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+			if (default_hugepages_in_node[i] > 0)
+				default_hstate.max_huge_pages_node[i] =
+					default_hugepages_in_node[i];
 		if (hstate_is_gigantic(&default_hstate))
 			hugetlb_hstate_alloc_pages(&default_hstate);
 		default_hstate_max_huge_pages = 0;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-09 14:16 [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
@ 2021-09-15  3:50 ` Andrew Morton
  2021-09-15 13:11   ` zhenguo yao
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2021-09-15  3:50 UTC (permalink / raw)
  To: yaozhenguo
  Cc: mike.kravetz, corbet, yaozhenguo, willy, linux-kernel, linux-doc,
	linux-mm

On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:

> We can specify the number of hugepages to allocate at boot. But the
> hugepages is balanced in all nodes at present. In some scenarios,
> we only need hugepages in one node. For example: DPDK needs hugepages
> which are in the same node as NIC. if DPDK needs four hugepages of 1G
> size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> in kernel cmdline. But, only four hugepages are used. The others should
> be free after boot. If the system memory is low(for example: 64G), it will
> be an impossible task. So, Extending hugepages parameter to support
> specifying hugepages at a specific node.
> For example add following parameter:
> 
> hugepagesz=1G hugepages=0:1,1:3
> 
> It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> 
> ...
>
> @@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
>  	}
>  }
>  
> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> +{
> +	unsigned long i;
> +	char buf[32];
> +
> +	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> +		if (hstate_is_gigantic(h)) {
> +			struct huge_bootmem_page *m;
> +			void *addr;
> +
> +			addr = memblock_alloc_try_nid_raw(
> +					huge_page_size(h), huge_page_size(h),
> +					0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> +			if (!addr)
> +				break;
> +			m = addr;
> +			BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));

We try very hard to avoid adding BUG calls.  Is there any way in which
this code can emit a WARNing then permit the kernel to keep operating?

> +			/*
> +			 * Put them into a private list first because mem_map
> +			 * is not up yet
> +			 */
> +			INIT_LIST_HEAD(&m->list);
> +			list_add(&m->list, &huge_boot_pages);
> +			m->hstate = h;
> +		} else {
> +			struct page *page;
> +
> +			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> +
> +			page = alloc_fresh_huge_page(h, gfp_mask, nid,
> +					&node_states[N_MEMORY], NULL);
> +			if (!page)
> +				break;
> +			put_page(page); /* free it into the hugepage allocator */
> +		}
> +		cond_resched();
> +	}
> +	if (i == h->max_huge_pages_node[nid])
> +		return;
> +
> +	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> +	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> +		h->max_huge_pages_node[nid], buf, nid, i);
> +	h->max_huge_pages_node[nid] = i;
> +	h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> +}
> +


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-15  3:50 ` Andrew Morton
@ 2021-09-15 13:11   ` zhenguo yao
  2021-09-15 22:03     ` Mike Kravetz
  0 siblings, 1 reply; 7+ messages in thread
From: zhenguo yao @ 2021-09-15 13:11 UTC (permalink / raw)
  To: Andrew Morton, Mike Kravetz
  Cc: corbet, yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List

Andrew Morton <akpm@linux-foundation.org> 于2021年9月15日周三 上午11:50写道:
>
> On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepages in one node. For example: DPDK needs hugepages
> > which are in the same node as NIC. if DPDK needs four hugepages of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> > in kernel cmdline. But, only four hugepages are used. The others should
> > be free after boot. If the system memory is low(for example: 64G), it will
> > be an impossible task. So, Extending hugepages parameter to support
> > specifying hugepages at a specific node.
> > For example add following parameter:
> >
> > hugepagesz=1G hugepages=0:1,1:3
> >
> > It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> >
> > ...
> >
> > @@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
> >       }
> >  }
> >
> > +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> > +{
> > +     unsigned long i;
> > +     char buf[32];
> > +
> > +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> > +             if (hstate_is_gigantic(h)) {
> > +                     struct huge_bootmem_page *m;
> > +                     void *addr;
> > +
> > +                     addr = memblock_alloc_try_nid_raw(
> > +                                     huge_page_size(h), huge_page_size(h),
> > +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > +                     if (!addr)
> > +                             break;
> > +                     m = addr;
> > +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
>
> We try very hard to avoid adding BUG calls.  Is there any way in which
> this code can emit a WARNing then permit the kernel to keep operating?
>
Maybe we can rewrite it as below:
                        if (WARN(!IS_ALIGNED(virt_to_phys(m),
huge_page_size(h)),
                                "HugeTLB: page addr:%p is not aligned\n", m))
                                break;
@Mike,  Do you think it's OK?
> > +                     /*
> > +                      * Put them into a private list first because mem_map
> > +                      * is not up yet
> > +                      */
> > +                     INIT_LIST_HEAD(&m->list);
> > +                     list_add(&m->list, &huge_boot_pages);
> > +                     m->hstate = h;
> > +             } else {
> > +                     struct page *page;
> > +
> > +                     gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> > +
> > +                     page = alloc_fresh_huge_page(h, gfp_mask, nid,
> > +                                     &node_states[N_MEMORY], NULL);
> > +                     if (!page)
> > +                             break;
> > +                     put_page(page); /* free it into the hugepage allocator */
> > +             }
> > +             cond_resched();
> > +     }
> > +     if (i == h->max_huge_pages_node[nid])
> > +             return;
> > +
> > +     string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> > +     pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> > +             h->max_huge_pages_node[nid], buf, nid, i);
> > +     h->max_huge_pages_node[nid] = i;
> > +     h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> > +}
> > +
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-15 13:11   ` zhenguo yao
@ 2021-09-15 22:03     ` Mike Kravetz
  2021-09-15 22:05       ` Mike Kravetz
  2021-09-15 22:38       ` Mike Kravetz
  0 siblings, 2 replies; 7+ messages in thread
From: Mike Kravetz @ 2021-09-15 22:03 UTC (permalink / raw)
  To: zhenguo yao, Andrew Morton
  Cc: corbet, yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List

On 9/15/21 6:11 AM, zhenguo yao wrote:
> Andrew Morton <akpm@linux-foundation.org> 于2021年9月15日周三 上午11:50写道:
>>
>> On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>>
>>> We can specify the number of hugepages to allocate at boot. But the
>>> hugepages is balanced in all nodes at present. In some scenarios,
>>> we only need hugepages in one node. For example: DPDK needs hugepages
>>> which are in the same node as NIC. if DPDK needs four hugepages of 1G
>>> size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
>>> in kernel cmdline. But, only four hugepages are used. The others should
>>> be free after boot. If the system memory is low(for example: 64G), it will
>>> be an impossible task. So, Extending hugepages parameter to support
>>> specifying hugepages at a specific node.
>>> For example add following parameter:
>>>
>>> hugepagesz=1G hugepages=0:1,1:3
>>>
>>> It will allocate 1 hugepage in node0 and 3 hugepages in node1.
>>>
>>> ...
>>>
>>> @@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
>>>       }
>>>  }
>>>
>>> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>>> +{
>>> +     unsigned long i;
>>> +     char buf[32];
>>> +
>>> +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
>>> +             if (hstate_is_gigantic(h)) {
>>> +                     struct huge_bootmem_page *m;
>>> +                     void *addr;
>>> +
>>> +                     addr = memblock_alloc_try_nid_raw(
>>> +                                     huge_page_size(h), huge_page_size(h),
>>> +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
>>> +                     if (!addr)
>>> +                             break;
>>> +                     m = addr;
>>> +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
>>
>> We try very hard to avoid adding BUG calls.  Is there any way in which
>> this code can emit a WARNing then permit the kernel to keep operating?
>>
> Maybe we can rewrite it as below:
>                         if (WARN(!IS_ALIGNED(virt_to_phys(m),
> huge_page_size(h)),
>                                 "HugeTLB: page addr:%p is not aligned\n", m))
>                                 break;
> @Mike,  Do you think it's OK?

Sorry, I have not yet reviewed the latest version of this patch.
Quick thought on this question.

The required alignment passed to memblock_alloc_try_nid_raw() is
huge_page_size(h).  Therefore, we know the virtual address m is
huge_page_size(h) aligned.  The BUG is just checking to make sure
the physical address associated with the virtual address is aligned
the same.  I really do not see how this could not be the case.
In fact, the memblock allocator finds a physical address with the
required alignment and then returns phys_to_virt(alloc).
Someone please correct me if I am wrong.  Otherwise, we can drop
the BUG.
Adding Mike Rapport on Cc:

This allocation code and the associated BUG was copied from
__alloc_bootmem_huge_page().  The BUG was added 12 years ago before
the memblock allocator existed and we were using the bootmem allocator.
If there is no need for a BUG in hugetlb_hstate_alloc_pages_onenode,
there is no need for one in __alloc_bootmem_huge_page.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-15 22:03     ` Mike Kravetz
@ 2021-09-15 22:05       ` Mike Kravetz
  2021-09-17 13:59         ` Mike Rapoport
  2021-09-15 22:38       ` Mike Kravetz
  1 sibling, 1 reply; 7+ messages in thread
From: Mike Kravetz @ 2021-09-15 22:05 UTC (permalink / raw)
  To: zhenguo yao, Andrew Morton
  Cc: corbet, yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List, Mike Rapoport

Now, really CC'ing Mike, and sorry for misspelling your name

On 9/15/21 3:03 PM, Mike Kravetz wrote:
> On 9/15/21 6:11 AM, zhenguo yao wrote:
>> Andrew Morton <akpm@linux-foundation.org> 于2021年9月15日周三 上午11:50写道:
>>>
>>> On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>>>
>>>> We can specify the number of hugepages to allocate at boot. But the
>>>> hugepages is balanced in all nodes at present. In some scenarios,
>>>> we only need hugepages in one node. For example: DPDK needs hugepages
>>>> which are in the same node as NIC. if DPDK needs four hugepages of 1G
>>>> size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
>>>> in kernel cmdline. But, only four hugepages are used. The others should
>>>> be free after boot. If the system memory is low(for example: 64G), it will
>>>> be an impossible task. So, Extending hugepages parameter to support
>>>> specifying hugepages at a specific node.
>>>> For example add following parameter:
>>>>
>>>> hugepagesz=1G hugepages=0:1,1:3
>>>>
>>>> It will allocate 1 hugepage in node0 and 3 hugepages in node1.
>>>>
>>>> ...
>>>>
>>>> @@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
>>>>       }
>>>>  }
>>>>
>>>> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>>>> +{
>>>> +     unsigned long i;
>>>> +     char buf[32];
>>>> +
>>>> +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
>>>> +             if (hstate_is_gigantic(h)) {
>>>> +                     struct huge_bootmem_page *m;
>>>> +                     void *addr;
>>>> +
>>>> +                     addr = memblock_alloc_try_nid_raw(
>>>> +                                     huge_page_size(h), huge_page_size(h),
>>>> +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
>>>> +                     if (!addr)
>>>> +                             break;
>>>> +                     m = addr;
>>>> +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
>>>
>>> We try very hard to avoid adding BUG calls.  Is there any way in which
>>> this code can emit a WARNing then permit the kernel to keep operating?
>>>
>> Maybe we can rewrite it as below:
>>                         if (WARN(!IS_ALIGNED(virt_to_phys(m),
>> huge_page_size(h)),
>>                                 "HugeTLB: page addr:%p is not aligned\n", m))
>>                                 break;
>> @Mike,  Do you think it's OK?
> 
> Sorry, I have not yet reviewed the latest version of this patch.
> Quick thought on this question.
> 
> The required alignment passed to memblock_alloc_try_nid_raw() is
> huge_page_size(h).  Therefore, we know the virtual address m is
> huge_page_size(h) aligned.  The BUG is just checking to make sure
> the physical address associated with the virtual address is aligned
> the same.  I really do not see how this could not be the case.
> In fact, the memblock allocator finds a physical address with the
> required alignment and then returns phys_to_virt(alloc).
> Someone please correct me if I am wrong.  Otherwise, we can drop
> the BUG.
> Adding Mike Rapport on Cc:
> 
> This allocation code and the associated BUG was copied from
> __alloc_bootmem_huge_page().  The BUG was added 12 years ago before
> the memblock allocator existed and we were using the bootmem allocator.
> If there is no need for a BUG in hugetlb_hstate_alloc_pages_onenode,
> there is no need for one in __alloc_bootmem_huge_page.
> 


-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-15 22:03     ` Mike Kravetz
  2021-09-15 22:05       ` Mike Kravetz
@ 2021-09-15 22:38       ` Mike Kravetz
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Kravetz @ 2021-09-15 22:38 UTC (permalink / raw)
  To: zhenguo yao, Andrew Morton
  Cc: corbet, yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List

On 9/15/21 3:03 PM, Mike Kravetz wrote:
> On 9/15/21 6:11 AM, zhenguo yao wrote:
>> Andrew Morton <akpm@linux-foundation.org> 于2021年9月15日周三 上午11:50写道:
>>>
>>> On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>>>
>>>> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>>>> +{
>>>> +     unsigned long i;
>>>> +     char buf[32];
>>>> +
>>>> +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
>>>> +             if (hstate_is_gigantic(h)) {
>>>> +                     struct huge_bootmem_page *m;
>>>> +                     void *addr;
>>>> +
>>>> +                     addr = memblock_alloc_try_nid_raw(
>>>> +                                     huge_page_size(h), huge_page_size(h),
>>>> +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
>>>> +                     if (!addr)
>>>> +                             break;
>>>> +                     m = addr;
>>>> +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
>>>
>>> We try very hard to avoid adding BUG calls.  Is there any way in which
>>> this code can emit a WARNing then permit the kernel to keep operating?
>>>
>> Maybe we can rewrite it as below:
>>                         if (WARN(!IS_ALIGNED(virt_to_phys(m),
>> huge_page_size(h)),
>>                                 "HugeTLB: page addr:%p is not aligned\n", m))
>>                                 break;
>> @Mike,  Do you think it's OK?
> 
> Sorry, I have not yet reviewed the latest version of this patch.
> Quick thought on this question.
> 
> The required alignment passed to memblock_alloc_try_nid_raw() is
> huge_page_size(h).  Therefore, we know the virtual address m is
> huge_page_size(h) aligned.  The BUG is just checking to make sure
> the physical address associated with the virtual address is aligned
> the same.  I really do not see how this could not be the case.
> In fact, the memblock allocator finds a physical address with the
> required alignment and then returns phys_to_virt(alloc).
> Someone please correct me if I am wrong.  Otherwise, we can drop
> the BUG.
> Adding Mike Rapport on Cc:
> 
> This allocation code and the associated BUG was copied from
> __alloc_bootmem_huge_page().  The BUG was added 12 years ago before
> the memblock allocator existed and we were using the bootmem allocator.
> If there is no need for a BUG in hugetlb_hstate_alloc_pages_onenode,
> there is no need for one in __alloc_bootmem_huge_page.

One additional thought.

Architectures can provide their own version of alloc_bootmem_huge_page.
powerpc is the only architecture doing so today.  If an architecture
does provide their own version of alloc_bootmem_huge_page, I do not
think we should/can use hugetlb_hstate_alloc_pages_onenode to allocate
node specific gigantic huge pages.

I think we need to disable this feature for such architectures,
-OR-
provide some method to do architecture specific node allocations of
gigantic pages.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-15 22:05       ` Mike Kravetz
@ 2021-09-17 13:59         ` Mike Rapoport
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Rapoport @ 2021-09-17 13:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: zhenguo yao, Andrew Morton, corbet, yaozhenguo, Matthew Wilcox,
	linux-kernel, linux-doc, Linux Memory Management List

Hi Mike,

On Wed, Sep 15, 2021 at 03:05:41PM -0700, Mike Kravetz wrote:
> Now, really CC'ing Mike, and sorry for misspelling your name
> 
> On 9/15/21 3:03 PM, Mike Kravetz wrote:
> > On 9/15/21 6:11 AM, zhenguo yao wrote:
> >> Andrew Morton <akpm@linux-foundation.org> 于2021年9月15日周三 上午11:50写道:
> >>>
> >>> On Thu,  9 Sep 2021 22:16:55 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
> >>>
> >>>> We can specify the number of hugepages to allocate at boot. But the
> >>>> hugepages is balanced in all nodes at present. In some scenarios,
> >>>> we only need hugepages in one node. For example: DPDK needs hugepages
> >>>> which are in the same node as NIC. if DPDK needs four hugepages of 1G
> >>>> size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> >>>> in kernel cmdline. But, only four hugepages are used. The others should
> >>>> be free after boot. If the system memory is low(for example: 64G), it will
> >>>> be an impossible task. So, Extending hugepages parameter to support
> >>>> specifying hugepages at a specific node.
> >>>> For example add following parameter:
> >>>>
> >>>> hugepagesz=1G hugepages=0:1,1:3
> >>>>
> >>>> It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> >>>>
> >>>> ...
> >>>>
> >>>> @@ -2842,10 +2843,75 @@ static void __init gather_bootmem_prealloc(void)
> >>>>       }
> >>>>  }
> >>>>
> >>>> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> >>>> +{
> >>>> +     unsigned long i;
> >>>> +     char buf[32];
> >>>> +
> >>>> +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> >>>> +             if (hstate_is_gigantic(h)) {
> >>>> +                     struct huge_bootmem_page *m;
> >>>> +                     void *addr;
> >>>> +
> >>>> +                     addr = memblock_alloc_try_nid_raw(
> >>>> +                                     huge_page_size(h), huge_page_size(h),
> >>>> +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> >>>> +                     if (!addr)
> >>>> +                             break;
> >>>> +                     m = addr;
> >>>> +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
> >>>
> >>> We try very hard to avoid adding BUG calls.  Is there any way in which
> >>> this code can emit a WARNing then permit the kernel to keep operating?
> >>>
> >> Maybe we can rewrite it as below:
> >>                         if (WARN(!IS_ALIGNED(virt_to_phys(m),
> >> huge_page_size(h)),
> >>                                 "HugeTLB: page addr:%p is not aligned\n", m))
> >>                                 break;
> >> @Mike,  Do you think it's OK?
> > 
> > Sorry, I have not yet reviewed the latest version of this patch.
> > Quick thought on this question.
> > 
> > The required alignment passed to memblock_alloc_try_nid_raw() is
> > huge_page_size(h).  Therefore, we know the virtual address m is
> > huge_page_size(h) aligned.  The BUG is just checking to make sure
> > the physical address associated with the virtual address is aligned
> > the same.  I really do not see how this could not be the case.
> > In fact, the memblock allocator finds a physical address with the
> > required alignment and then returns phys_to_virt(alloc).
> > Someone please correct me if I am wrong.  Otherwise, we can drop
> > the BUG.

I agree with your analysis and I also think the BUG() can be dropped
entirely as well as the BUG() in __alloc_bootmem_huge_page().

> > Adding Mike Rapport on Cc:
> > 
> > This allocation code and the associated BUG was copied from
> > __alloc_bootmem_huge_page().  The BUG was added 12 years ago before
> > the memblock allocator existed and we were using the bootmem allocator.
> > If there is no need for a BUG in hugetlb_hstate_alloc_pages_onenode,
> > there is no need for one in __alloc_bootmem_huge_page.

Hmm, even bootmem had alignment guaranties so it seems to me that the BUG()
was over-protective even then.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-17 13:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09 14:16 [PATCH v4] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
2021-09-15  3:50 ` Andrew Morton
2021-09-15 13:11   ` zhenguo yao
2021-09-15 22:03     ` Mike Kravetz
2021-09-15 22:05       ` Mike Kravetz
2021-09-17 13:59         ` Mike Rapoport
2021-09-15 22:38       ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).