LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] hugetlbfs: add hugepages_node kernel parameter
@ 2021-08-20  3:05 yaozhenguo
  2021-08-22 22:19 ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: yaozhenguo @ 2021-08-20  3:05 UTC (permalink / raw)
  To: mike.kravetz, corbet, akpm
  Cc: yaozhenguo, linux-kernel, linux-doc, linux-mm, yaozhenguo

We can specify the number of hugepages to allocate at boot. But the
hugepages is balanced in all nodes at present. In some scenarios,
we only need hugepags in one node. For example: DPDK needs hugepages
which is in the same node as NIC. if DPDK needs four hugepags of 1G
size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
in kernel cmdline. But, only four hugepages is used. The others should
be free after boot.If the system memory is low(for example: 64G), it will
be an impossible task. So, add hugepages_node kernel parameter to specify
node number of hugepages to allocate at boot.
For example add following parameter:

hugepagesz=1G hugepages_node=1 hugepages=4

It will allocate 4 hugepags in node1 at boot.

Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/hugetlb.h                       |   1 +
 mm/hugetlb.c                                  | 109 +++++++++++++++++-
 3 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f..1f85f2b3d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1583,6 +1583,12 @@
 			hugepages using the CMA allocator. If enabled, the
 			boot-time allocation of gigantic hugepages is skipped.
 
+	hugepages_node=	[HW] Node number of hugepages to allocate at boot.
+			This is used in conjunction with  hugepages (below),
+			The pair hugepages_node=X hugepages=Y can be specified
+			for number of hugepages in numa node X.
+			Format: <integer>
+
 	hugepages=	[HW] Number of HugeTLB pages to allocate at boot.
 			If this follows hugepagesz (below), it specifies
 			the number of pages of hugepagesz to be allocated.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a387..5939ecd4f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -605,6 +605,7 @@ struct hstate {
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int max_huge_pages_node[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d52..1f50f866c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -66,6 +66,8 @@ static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
+static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
+static int parsed_huge_pages_node __initdata = NUMA_NO_NODE;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -2842,10 +2844,68 @@ static void __init gather_bootmem_prealloc(void)
 	}
 }
 
-static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
+static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid, bool *gns)
+{
+	unsigned long i;
+
+	for (i = 0; i < h->max_huge_pages_node[nid]; i++) {
+		if (hstate_is_gigantic(h)) {
+			struct huge_bootmem_page *m;
+			void *addr;
+
+			addr = memblock_alloc_try_nid_raw(
+					huge_page_size(h), huge_page_size(h),
+					0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+			if (!addr)
+				break;
+			m = addr;
+			BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
+			/* Put them into a private list first because mem_map is not up yet */
+			INIT_LIST_HEAD(&m->list);
+			list_add(&m->list, &huge_boot_pages);
+			m->hstate = h;
+			*gns = true;
+		} else {
+			struct page *page;
+
+			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+
+			page = alloc_fresh_huge_page(h, gfp_mask, nid,
+					&node_states[N_MEMORY], NULL);
+			if (page)
+				put_page(page); /* free it into the hugepage allocator */
+
+		}
+	}
+	if (hstate_is_gigantic(h)) {
+		h->max_huge_pages_node[nid] = 0;
+	}
+}
+
+static void __init hugetlb_hstate_alloc_pages(struct hstate *h, int nid)
 {
 	unsigned long i;
 	nodemask_t *node_alloc_noretry;
+	bool hugetlb_node_set = false;
+	bool gigantic_node_set = false;
+
+	/* do node alloc */
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
+		if (h->max_huge_pages_node[i] > 0)
+			hugetlb_hstate_alloc_pages_onenode(h, i, &gigantic_node_set);
+		/* use gigantic_node_set to make a distinction
+		 * between node set and whole set in gigantic hstate
+		 */
+		if (gigantic_node_set || h->nr_huge_pages_node[i] > 0)
+			hugetlb_node_set = true;
+	}
+
+	/* nid != NUMA_NO_NODE  prevent more pages are alloced in gigantic hstate
+	 * for exampe:
+	 *     hugepagesz=1G hugepages_node=0 hugepages=4 hugepages_node=1 hugepages=0
+	 */
+	if (hugetlb_node_set || nid != NUMA_NO_NODE)
+		return;
 
 	if (!hstate_is_gigantic(h)) {
 		/*
@@ -2901,7 +2961,7 @@ static void __init hugetlb_init_hstates(void)
 
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
-			hugetlb_hstate_alloc_pages(h);
+			hugetlb_hstate_alloc_pages(h, NUMA_NO_NODE);
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3580,6 +3640,9 @@ static int __init hugetlb_init(void)
 				default_hstate_max_huge_pages;
 		}
 	}
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+		if (default_hugepages_in_node[i] > 0)
+			default_hstate.max_huge_pages_node[i] = default_hugepages_in_node[i];
 
 	hugetlb_cma_check();
 	hugetlb_init_hstates();
@@ -3663,9 +3726,16 @@ static int __init hugepages_setup(char *s)
 	 * default_hugepagesz.
 	 */
 	else if (!hugetlb_max_hstate)
-		mhp = &default_hstate_max_huge_pages;
+		if (parsed_huge_pages_node == NUMA_NO_NODE)
+			mhp = &default_hstate_max_huge_pages;
+		else
+			mhp = (unsigned long *)&(default_hugepages_in_node[parsed_huge_pages_node]);
 	else
-		mhp = &parsed_hstate->max_huge_pages;
+		if (parsed_huge_pages_node == NUMA_NO_NODE)
+			mhp = &parsed_hstate->max_huge_pages;
+		else
+			mhp = (unsigned long *)
+				&(parsed_hstate->max_huge_pages_node[parsed_huge_pages_node]);
 
 	if (mhp == last_mhp) {
 		pr_warn("HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=%s\n", s);
@@ -3675,20 +3745,47 @@ static int __init hugepages_setup(char *s)
 	if (sscanf(s, "%lu", mhp) <= 0)
 		*mhp = 0;
 
+	if (parsed_huge_pages_node != NUMA_NO_NODE) {
+		if (!hugetlb_max_hstate)
+			default_hstate_max_huge_pages += *mhp;
+		else
+			parsed_hstate->max_huge_pages += *mhp;
+	}
 	/*
 	 * Global state is always initialized later in hugetlb_init.
 	 * But we need to allocate gigantic hstates here early to still
 	 * use the bootmem allocator.
 	 */
 	if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate))
-		hugetlb_hstate_alloc_pages(parsed_hstate);
+		hugetlb_hstate_alloc_pages(parsed_hstate, parsed_huge_pages_node);
 
+	parsed_huge_pages_node = NUMA_NO_NODE;
 	last_mhp = mhp;
 
 	return 1;
 }
 __setup("hugepages=", hugepages_setup);
 
+static int __init hugetlb_node_setup(char *s)
+{
+	int ret;
+
+	if (!parsed_valid_hugepagesz) {
+		pr_warn("hugepages_node=%s preceded by an unsupported hugepagesz, ignoring\n", s);
+		parsed_valid_hugepagesz = true;
+		return 1;
+	}
+
+	ret = kstrtoint(s, 0, &parsed_huge_pages_node);
+	if (ret < 0 || parsed_huge_pages_node < 0) {
+		pr_warn("hugepages_node = %d is invalid\n", parsed_huge_pages_node);
+		parsed_huge_pages_node = NUMA_NO_NODE;
+	}
+
+	return 1;
+}
+__setup("hugepages_node=", hugetlb_node_setup);
+
 /*
  * hugepagesz command line processing
  * A specific huge page size can only be specified once with hugepagesz.
@@ -3776,7 +3873,7 @@ static int __init default_hugepagesz_setup(char *s)
 	if (default_hstate_max_huge_pages) {
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
 		if (hstate_is_gigantic(&default_hstate))
-			hugetlb_hstate_alloc_pages(&default_hstate);
+			hugetlb_hstate_alloc_pages(&default_hstate, NUMA_NO_NODE);
 		default_hstate_max_huge_pages = 0;
 	}
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] hugetlbfs: add hugepages_node kernel parameter
  2021-08-20  3:05 [PATCH] hugetlbfs: add hugepages_node kernel parameter yaozhenguo
@ 2021-08-22 22:19 ` Andrew Morton
  2021-08-22 22:27   ` Matthew Wilcox
  2021-08-23  2:04   ` zhenguo yao
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Morton @ 2021-08-22 22:19 UTC (permalink / raw)
  To: yaozhenguo
  Cc: mike.kravetz, corbet, yaozhenguo, linux-kernel, linux-doc, linux-mm

On Fri, 20 Aug 2021 11:05:36 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:

> We can specify the number of hugepages to allocate at boot. But the
> hugepages is balanced in all nodes at present. In some scenarios,
> we only need hugepags in one node. For example: DPDK needs hugepages
> which is in the same node as NIC. if DPDK needs four hugepags of 1G
> size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> in kernel cmdline. But, only four hugepages is used. The others should
> be free after boot.If the system memory is low(for example: 64G), it will
> be an impossible task. So, add hugepages_node kernel parameter to specify
> node number of hugepages to allocate at boot.
> For example add following parameter:
> 
> hugepagesz=1G hugepages_node=1 hugepages=4
> 
> It will allocate 4 hugepags in node1 at boot.

If were going to do this, shouldn't we permit more than one node?

	hugepages_nodes=1,2,5

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] hugetlbfs: add hugepages_node kernel parameter
  2021-08-22 22:19 ` Andrew Morton
@ 2021-08-22 22:27   ` Matthew Wilcox
  2021-08-23  1:57     ` zhenguo yao
  2021-08-23 16:52     ` Mike Kravetz
  2021-08-23  2:04   ` zhenguo yao
  1 sibling, 2 replies; 6+ messages in thread
From: Matthew Wilcox @ 2021-08-22 22:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: yaozhenguo, mike.kravetz, corbet, yaozhenguo, linux-kernel,
	linux-doc, linux-mm

On Sun, Aug 22, 2021 at 03:19:52PM -0700, Andrew Morton wrote:
> On Fri, 20 Aug 2021 11:05:36 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
> 
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepags in one node. For example: DPDK needs hugepages
> > which is in the same node as NIC. if DPDK needs four hugepags of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> > in kernel cmdline. But, only four hugepages is used. The others should
> > be free after boot.If the system memory is low(for example: 64G), it will
> > be an impossible task. So, add hugepages_node kernel parameter to specify
> > node number of hugepages to allocate at boot.
> > For example add following parameter:
> > 
> > hugepagesz=1G hugepages_node=1 hugepages=4
> > 
> > It will allocate 4 hugepags in node1 at boot.
> 
> If were going to do this, shouldn't we permit more than one node?
> 
> 	hugepages_nodes=1,2,5

I'd think we'd be better off expanding the definition of hugepages.
eg:

hugepagesz=1G hugepages=1:4,3:8,5:2

would say to allocate 4 pages from node 1, 8 pages from node 3 and 2
pages from node 5.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] hugetlbfs: add hugepages_node kernel parameter
  2021-08-22 22:27   ` Matthew Wilcox
@ 2021-08-23  1:57     ` zhenguo yao
  2021-08-23 16:52     ` Mike Kravetz
  1 sibling, 0 replies; 6+ messages in thread
From: zhenguo yao @ 2021-08-23  1:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: mike.kravetz, corbet, akpm, linux-kernel, linux-doc, linux-mm

Yes, the expanding of hugepages is more elegant. I  will change it in
the next version.

Matthew Wilcox <willy@infradead.org> 于2021年8月23日周一 上午6:28写道:
>
> On Sun, Aug 22, 2021 at 03:19:52PM -0700, Andrew Morton wrote:
> > On Fri, 20 Aug 2021 11:05:36 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
> >
> > > We can specify the number of hugepages to allocate at boot. But the
> > > hugepages is balanced in all nodes at present. In some scenarios,
> > > we only need hugepags in one node. For example: DPDK needs hugepages
> > > which is in the same node as NIC. if DPDK needs four hugepags of 1G
> > > size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> > > in kernel cmdline. But, only four hugepages is used. The others should
> > > be free after boot.If the system memory is low(for example: 64G), it will
> > > be an impossible task. So, add hugepages_node kernel parameter to specify
> > > node number of hugepages to allocate at boot.
> > > For example add following parameter:
> > >
> > > hugepagesz=1G hugepages_node=1 hugepages=4
> > >
> > > It will allocate 4 hugepags in node1 at boot.
> >
> > If were going to do this, shouldn't we permit more than one node?
> >
> >       hugepages_nodes=1,2,5
>
> I'd think we'd be better off expanding the definition of hugepages.
> eg:
>
> hugepagesz=1G hugepages=1:4,3:8,5:2
>
> would say to allocate 4 pages from node 1, 8 pages from node 3 and 2
> pages from node 5.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] hugetlbfs: add hugepages_node kernel parameter
  2021-08-22 22:19 ` Andrew Morton
  2021-08-22 22:27   ` Matthew Wilcox
@ 2021-08-23  2:04   ` zhenguo yao
  1 sibling, 0 replies; 6+ messages in thread
From: zhenguo yao @ 2021-08-23  2:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mike.kravetz, corbet, linux-kernel, linux-doc, linux-mm,
	yaozhenguo, Matthew Wilcox

OK,  It's better to use a concise way to add this function.  I will
use a better way in the next version.

Andrew Morton <akpm@linux-foundation.org> 于2021年8月23日周一 上午6:19写道:
>
> On Fri, 20 Aug 2021 11:05:36 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepags in one node. For example: DPDK needs hugepages
> > which is in the same node as NIC. if DPDK needs four hugepags of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> > in kernel cmdline. But, only four hugepages is used. The others should
> > be free after boot.If the system memory is low(for example: 64G), it will
> > be an impossible task. So, add hugepages_node kernel parameter to specify
> > node number of hugepages to allocate at boot.
> > For example add following parameter:
> >
> > hugepagesz=1G hugepages_node=1 hugepages=4
> >
> > It will allocate 4 hugepags in node1 at boot.
>
> If were going to do this, shouldn't we permit more than one node?
>
>         hugepages_nodes=1,2,5

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] hugetlbfs: add hugepages_node kernel parameter
  2021-08-22 22:27   ` Matthew Wilcox
  2021-08-23  1:57     ` zhenguo yao
@ 2021-08-23 16:52     ` Mike Kravetz
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2021-08-23 16:52 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: yaozhenguo, corbet, yaozhenguo, linux-kernel, linux-doc, linux-mm

On 8/22/21 3:27 PM, Matthew Wilcox wrote:
> On Sun, Aug 22, 2021 at 03:19:52PM -0700, Andrew Morton wrote:
>> On Fri, 20 Aug 2021 11:05:36 +0800 yaozhenguo <yaozhenguo1@gmail.com> wrote:
>>
>>> We can specify the number of hugepages to allocate at boot. But the
>>> hugepages is balanced in all nodes at present. In some scenarios,
>>> we only need hugepags in one node. For example: DPDK needs hugepages
>>> which is in the same node as NIC. if DPDK needs four hugepags of 1G
>>> size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
>>> in kernel cmdline. But, only four hugepages is used. The others should
>>> be free after boot.If the system memory is low(for example: 64G), it will
>>> be an impossible task. So, add hugepages_node kernel parameter to specify
>>> node number of hugepages to allocate at boot.
>>> For example add following parameter:
>>>
>>> hugepagesz=1G hugepages_node=1 hugepages=4
>>>
>>> It will allocate 4 hugepags in node1 at boot.
>>
>> If were going to do this, shouldn't we permit more than one node?
>>
>> 	hugepages_nodes=1,2,5
> 
> I'd think we'd be better off expanding the definition of hugepages.
> eg:
> 
> hugepagesz=1G hugepages=1:4,3:8,5:2
> 
> would say to allocate 4 pages from node 1, 8 pages from node 3 and 2
> pages from node 5.

Thanks Matthew and Andrew!

I was trying to wrap my head around the big issue before making any
suggestions.  It is true that the desired functionality of allocating
huge pages from a specific node is lacking today.

I like the idea of expanding the definition of hugepages so that nodes
can be specified.

One word of caution.  It is easy to make mistakes when taking data
directly from the user on the command line.  For example, in the follow
on patch I do not believe node is not checked against MAX_NUMNODES so
it may write beyond the end of array.  Also, we need to think about what
the behavior should be if part of the
'node1:count1,node2:count2,node3:count3' string is invalid?  Suppose
node2 is invalid.  Do we still allocate from node1 and node3, or do we
just discard the entire string?  During a recent rewrite of hugetlb
command line processing, I had a matrix of different command line
options, order, values and expected results.  We need to be as thorough
when adding this new option.

I'll take a closer look at the proposed patch in the next few days.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-23 16:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20  3:05 [PATCH] hugetlbfs: add hugepages_node kernel parameter yaozhenguo
2021-08-22 22:19 ` Andrew Morton
2021-08-22 22:27   ` Matthew Wilcox
2021-08-23  1:57     ` zhenguo yao
2021-08-23 16:52     ` Mike Kravetz
2021-08-23  2:04   ` zhenguo yao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).