LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/4] hugetlb: add demote/split page functionality
@ 2021-09-23 17:53 Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton, Mike Kravetz

The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools.  This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup.  In addition, preallocating
huge pages minimizes the issue of memory fragmentation that increases
the longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain.  In such situations,
it is desirable to convert larger hugetlb pages to smaller hugetlb pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages

On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.

However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.

In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:

Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real    0m42.092s
user    0m0.008s
sys     0m41.467s

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.

Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size   Target page size for demotion, a smaller huge page size.
		File can be written to chose a smaller huge page size if
		multiple are available.
- demote        Writable number of hugetlb pages to be demoted

To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote

Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity of surplus pages and honors reserved
huge pages.  Therefore, when a value is written to the sysfs demote file,
that value is only the maximum number of pages which will be demoted.  It
is possible fewer will actually be demoted.  The recently introduced
per-hstate mutex is used to synchronize demote operations with other
operations that modify hugetlb pools.

Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages are
used to back VMs on x86.  Both issues of long allocation times and not
necessarily getting the expected number of smaller huge pages after a free
and allocate cycle have been experienced.  The occurrence of these issues
is dependent on other activity within the host and can not be predicted.

RESEND -> v2
  - Removed optimizations for vmemmap optimized pages
  - Make demote_size writable
  - Removed demote interfaces for smallest huge page size
  - Updated documentation and commit messages
  - Fixed build break for !CONFIG_ARCH_HAS_GIGANTIC_PAGE

v1 -> RESEND
  - Rebase on next-20210816
  - Fix a few typos in commit messages
RFC -> v1
  - Provides basic support for vmemmap optimized pages
  - Takes speculative page references into account
  - Updated Documentation file
  - Added optimizations for vmemmap optimized pages

Mike Kravetz (4):
  hugetlb: add demote hugetlb page sysfs interfaces
  hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  hugetlb: add demote bool to gigantic page routines
  hugetlb: add hugetlb demote page support

 Documentation/admin-guide/mm/hugetlbpage.rst |  30 +-
 include/linux/hugetlb.h                      |   8 +
 mm/hugetlb.c                                 | 332 +++++++++++++++++--
 3 files changed, 344 insertions(+), 26 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-23 17:53 [PATCH v2 0/4] hugetlb: add demote/split page functionality Mike Kravetz
@ 2021-09-23 17:53 ` Mike Kravetz
  2021-09-23 21:24   ` Andrew Morton
                     ` (2 more replies)
  2021-09-23 17:53 ` [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton, Mike Kravetz

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:
  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst |  30 +++-
 include/linux/hugetlb.h                      |   1 +
 mm/hugetlb.c                                 | 155 ++++++++++++++++++-
 3 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..0e123a347e1e 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
 
 	hugepages-${size}kB
 
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist.  In addition, two additional interfaces for demoting huge
+pages may exist::
 
+        demote
+        demote_size
 	nr_hugepages
 	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
@@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages.  For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
+2MB huge pages.  Demote interfaces are not available for the smallest
+huge page size.  The demote interfaces are:
+
+demote_size
+        is the size of demoted pages.  When a page is demoted a corresponding
+        number of huge pages of demote_size will be created.  By default,
+        demote_size is set to the next smaller huge page size.  If there are
+        multiple smaller huge page sizes, demote_size can be set to any of
+        these smaller sizes.  Only huge page sizes less then the current huge
+        pages size are allowed.
+
+demote
+        is used to demote a number of huge pages.  A user with root privileges
+        can write to this file.  It may not be possible to demote the
+        requested number of huge pages.  To determine how many pages were
+        actually demoted, compare the value of nr_hugepages before and after
+        writing to the demote interface.  demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
 
 .. _mem_policy_and_hp_alloc:
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..f2c3979efd69 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@ struct hstate {
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
+	unsigned int demote_order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6378c1066459..c76ee0bd6374 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h;
+	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
 		if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,17 @@ static void __init hugetlb_init_hstates(void)
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
+
+		/*
+		 * Set demote order for each hstate.  Note that
+		 * h->demote_order is initially 0.
+		 */
+		for_each_hstate(h2) {
+			if (h2 == h)
+				continue;
+			if (h2->order < h->order && h2->order > h->demote_order)
+				h->demote_order = h2->order;
+		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3235,9 +3246,29 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+	__must_hold(&hugetlb_lock)
+{
+	int rc = 0;
+
+	lockdep_assert_held(&hugetlb_lock);
+
+	/* We should never get here if no demote order */
+	if (!h->demote_order)
+		return rc;
+
+	/*
+	 * TODO - demote fucntionality will be added in subsequent patch
+	 */
+	return rc;
+}
+
 #define HSTATE_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 
+#define HSTATE_ATTR_WO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
 #define HSTATE_ATTR(_name) \
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3464,112 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
+static ssize_t demote_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long nr_demote;
+	unsigned long nr_available;
+	nodemask_t nodes_allowed, *n_mask;
+	struct hstate *h;
+	int err;
+	int nid;
+
+	err = kstrtoul(buf, 10, &nr_demote);
+	if (err)
+		return err;
+	h = kobj_to_hstate(kobj, &nid);
+
+	/* Synchronize with other sysfs operations modifying huge pages */
+	mutex_lock(&h->resize_lock);
+
+	spin_lock_irq(&hugetlb_lock);
+	if (nid != NUMA_NO_NODE) {
+		nr_available = h->free_huge_pages_node[nid];
+		init_nodemask_of_node(&nodes_allowed, nid);
+		n_mask = &nodes_allowed;
+	} else {
+		nr_available = h->free_huge_pages;
+		n_mask = &node_states[N_MEMORY];
+	}
+	nr_available -= h->resv_huge_pages;
+	if (nr_available <= 0)
+		goto out;
+	nr_demote = min(nr_available, nr_demote);
+
+	while (nr_demote) {
+		if (!demote_pool_huge_page(h, n_mask))
+			break;
+
+		/*
+		 * We may have dropped the lock in the routines to
+		 * demote/free a page.  Recompute nr_demote as counts could
+		 * have changed and we want to make sure we do not demote
+		 * a reserved huge page.
+		 */
+		nr_demote--;
+		if (nid != NUMA_NO_NODE)
+			nr_available = h->free_huge_pages_node[nid];
+		else
+			nr_available = h->free_huge_pages;
+		nr_available -= h->resv_huge_pages;
+		if (nr_available <= 0)
+			nr_demote = 0;
+		else
+			nr_demote = min(nr_available, nr_demote);
+	}
+
+out:
+	spin_unlock_irq(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
+
+	return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h;
+	unsigned long demote_size;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	demote_size = h->demote_order;
+
+	return sysfs_emit(buf, "%lukB\n",
+			(unsigned long)(PAGE_SIZE << h->demote_order) / SZ_1K);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct hstate *h, *t_hstate;
+	unsigned long demote_size;
+	unsigned int demote_order;
+	int nid;
+
+	demote_size = (unsigned long)memparse(buf, NULL);
+
+	t_hstate = size_to_hstate(demote_size);
+	if (!t_hstate)
+		return -EINVAL;
+	demote_order = t_hstate->order;
+
+	/* demote order must be smaller hstate order */
+	h = kobj_to_hstate(kobj, &nid);
+	if (demote_order >= h->order)
+		return -EINVAL;
+
+	/* resize_lock synchronizes access to demote size and writes */
+	mutex_lock(&h->resize_lock);
+	h->demote_order = demote_order;
+	mutex_unlock(&h->resize_lock);
+
+	return count;
+}
+HSTATE_ATTR(demote_size);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
 	.attrs = hstate_attrs,
 };
 
+static struct attribute *hstate_demote_attrs[] = {
+	&demote_size_attr.attr,
+	&demote_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+	.attrs = hstate_demote_attrs,
+};
+
 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct kobject **hstate_kobjs,
 				    const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 		hstate_kobjs[hi] = NULL;
 	}
 
+	if (h->demote_order) {
+		if (sysfs_create_group(hstate_kobjs[hi],
+					&hstate_demote_attr_group))
+			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+	}
+
 	return retval;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-09-23 17:53 [PATCH v2 0/4] hugetlb: add demote/split page functionality Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
@ 2021-09-23 17:53 ` Mike Kravetz
  2021-09-24  9:36   ` David Hildenbrand
  2021-09-23 17:53 ` [PATCH v2 3/4] hugetlb: add demote bool to gigantic page routines Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 4/4] hugetlb: add hugetlb demote page support Mike Kravetz
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton, Mike Kravetz

When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size.  For example, on x86 a 1G page
can be demoted to 512 2M pages.  However, gigantic pages can potentially
be allocated from CMA.  If a gigantic page which was allocated from CMA
is demoted, the corresponding demoted pages needs to be returned to CMA.

In order to track hugetlb pages that need to be returned to CMA, add the
hugetlb specific flag HPageCma.  Flag is set when a huge page is
allocated from CMA and transferred to any demoted pages.  Non-gigantic
huge page freeing code checks for the flag and takes appropriate action.

This also requires a change to CMA reservations for gigantic pages.
Currently, the 'order_per_bit' is set to the gigantic page size.
However, if gigantic pages can be demoted this needs to be set to the
order of the smallest huge page.  At CMA reservation time we do not know
the size of the smallest huge page size, so use HUGETLB_PAGE_ORDER.
Also, prohibit demotion to huge page sizes smaller than
HUGETLB_PAGE_ORDER.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  7 +++++
 mm/hugetlb.c            | 64 +++++++++++++++++++++++++++++------------
 2 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f2c3979efd69..08668b9f5f71 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -533,6 +533,11 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_freed - Set when page is on the free lists.
  *	Synchronization: hugetlb_lock held for examination and modification.
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
+ * HPG_cma - Set if huge page was directly allocated from CMA area via
+ *	cma_alloc.  Initially set for gigantic page cma allocations, but can
+ *	be set in non-gigantic pages if gigantic pages are demoted.
+ *	Synchronization: Only accessed or modified when there is only one
+ *	reference to the page at allocation, free or demote time.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
@@ -540,6 +545,7 @@ enum hugetlb_page_flags {
 	HPG_temporary,
 	HPG_freed,
 	HPG_vmemmap_optimized,
+	HPG_cma,
 	__NR_HPAGEFLAGS,
 };
 
@@ -586,6 +592,7 @@ HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
+HPAGEFLAG(Cma, cma)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c76ee0bd6374..c3f7da8f0c68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1272,6 +1272,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1283,16 +1284,12 @@ static void destroy_compound_gigantic_page(struct page *page,
 
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
-	/*
-	 * If the page isn't allocated using the cma allocator,
-	 * cma_release() returns false.
-	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
-		return;
+	if (HPageCma(page))
+		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
+	else
 #endif
-
-	free_contig_range(page_to_pfn(page), 1 << order);
+		free_contig_range(page_to_pfn(page), 1 << order);
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
@@ -1311,8 +1308,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		if (hugetlb_cma[nid]) {
 			page = cma_alloc(hugetlb_cma[nid], nr_pages,
 					huge_page_order(h), true);
-			if (page)
+			if (page) {
+				SetHPageCma(page);
 				return page;
+			}
 		}
 
 		if (!(gfp_mask & __GFP_THISNODE)) {
@@ -1322,8 +1321,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 
 				page = cma_alloc(hugetlb_cma[node], nr_pages,
 						huge_page_order(h), true);
-				if (page)
+				if (page) {
+					SetHPageCma(page);
 					return page;
+				}
 			}
 		}
 	}
@@ -1480,6 +1481,20 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
 	} else {
+#ifdef CONFIG_CMA
+		/*
+		 * Could be a page that was demoted from a gigantic page
+		 * which was allocated in a CMA area.
+		 */
+		if (HPageCma(page)) {
+			destroy_compound_gigantic_page(page,
+					huge_page_order(h));
+			if (!cma_release(hugetlb_cma[page_to_nid(page)], page,
+					1 << huge_page_order(h)))
+				VM_BUG_ON_PAGE(1, page);
+			return;
+		}
+#endif
 		__free_pages(page, huge_page_order(h));
 	}
 }
@@ -2997,14 +3012,19 @@ static void __init hugetlb_init_hstates(void)
 			hugetlb_hstate_alloc_pages(h);
 
 		/*
-		 * Set demote order for each hstate.  Note that
-		 * h->demote_order is initially 0.
+		 * Set demote order for each hstate.  hstates are not ordered,
+		 * so this is brute force.  Note that h->demote_order is
+		 * initially 0.  If cma is used for gigantic pages, the smallest
+		 * demote size is HUGETLB_PAGE_ORDER.
 		 */
-		for_each_hstate(h2) {
-			if (h2 == h)
-				continue;
-			if (h2->order < h->order && h2->order > h->demote_order)
-				h->demote_order = h2->order;
+		if (!hugetlb_cma_size || !(h->order <= HUGETLB_PAGE_ORDER)) {
+			for_each_hstate(h2) {
+				if (h2 == h)
+					continue;
+				if (h2->order < h->order &&
+				    h2->order > h->demote_order)
+					h->demote_order = h2->order;
+			}
 		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
@@ -3555,6 +3575,8 @@ static ssize_t demote_size_store(struct kobject *kobj,
 	if (!t_hstate)
 		return -EINVAL;
 	demote_order = t_hstate->order;
+	if (demote_order < HUGETLB_PAGE_ORDER)
+		return -EINVAL;
 
 	/* demote order must be smaller hstate order */
 	h = kobj_to_hstate(kobj, &nid);
@@ -6563,7 +6585,13 @@ void __init hugetlb_cma_reserve(int order)
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+		/*
+		 * Note that 'order per bit' is based on smallest size that
+		 * may be returned to CMA allocator in the case of
+		 * huge page demotion.
+		 */
+		res = cma_declare_contiguous_nid(0, size, 0,
+						PAGE_SIZE << HUGETLB_PAGE_ORDER,
 						 0, false, name,
 						 &hugetlb_cma[nid], nid);
 		if (res) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 3/4] hugetlb: add demote bool to gigantic page routines
  2021-09-23 17:53 [PATCH v2 0/4] hugetlb: add demote/split page functionality Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
@ 2021-09-23 17:53 ` Mike Kravetz
  2021-09-23 17:53 ` [PATCH v2 4/4] hugetlb: add hugetlb demote page support Mike Kravetz
  3 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton, Mike Kravetz

The routines remove_hugetlb_page and destroy_compound_gigantic_page
will remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator.  In the process of doing this, they
make all base pages reference counted.

The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages.  It assumes that all these base pages are reference
counted.

During demotion, a gigantic page will be split into huge pages of a
smaller size.  This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.

When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken.  This could result in errors
while demoting a huge page.  Quite a bit of code would need to be
created to handle all possible issues.

Instead of dealing with the possibility of speculative ref counts, avoid
the possibility by keeping ref counts at zero during the demote process.
Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page.  If the
boolean is set, the remove and destroy routines will not reference count
pages and the prep routine will not expect reference counted pages.

'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3f7da8f0c68..2317d411243d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1261,8 +1261,8 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		nr_nodes--)
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static void __destroy_compound_gigantic_page(struct page *page,
+					unsigned int order, bool demote)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1274,7 +1274,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		p->mapping = NULL;
 		clear_compound_head(p);
-		set_page_refcounted(p);
+		if (!demote)
+			set_page_refcounted(p);
 	}
 
 	set_compound_order(page, 0);
@@ -1282,6 +1283,12 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, false);
+}
+
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 #ifdef CONFIG_CMA
@@ -1354,12 +1361,15 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 
 /*
  * Remove hugetlb page from lists, and update dtor so that page appears
- * as just a compound page.  A reference is held on the page.
+ * as just a compound page.
+ *
+ * A reference is held on the page, except in the case of demote.
  *
  * Must be called with hugetlb lock held.
  */
-static void remove_hugetlb_page(struct hstate *h, struct page *page,
-							bool adjust_surplus)
+static void __remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus,
+							bool demote)
 {
 	int nid = page_to_nid(page);
 
@@ -1397,8 +1407,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	 *
 	 * This handles the case where more than one ref is held when and
 	 * after update_and_free_page is called.
+	 *
+	 * In the case of demote we do not ref count the page as it will soon
+	 * be turned into a page of smaller size.
 	 */
-	set_page_refcounted(page);
+	if (!demote)
+		set_page_refcounted(page);
 	if (hstate_is_gigantic(h))
 		set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	else
@@ -1408,6 +1422,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	h->nr_huge_pages_node[nid]--;
 }
 
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, false);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1679,7 +1699,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
+								bool demote)
 {
 	int i, j;
 	int nr_pages = 1 << order;
@@ -1717,10 +1738,16 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 		 * the set of pages can not be converted to a gigantic page.
 		 * The caller who allocated the pages should then discard the
 		 * pages using the appropriate free interface.
+		 *
+		 * In the case of demote, the ref count will be zero.
 		 */
-		if (!page_ref_freeze(p, 1)) {
-			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
-			goto out_error;
+		if (!demote) {
+			if (!page_ref_freeze(p, 1)) {
+				pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+				goto out_error;
+			}
+		} else {
+			VM_BUG_ON_PAGE(page_count(p), p);
 		}
 		set_page_count(p, 0);
 		set_compound_head(p, page);
@@ -1745,6 +1772,11 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 	return false;
 }
 
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, false);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] hugetlb: add hugetlb demote page support
  2021-09-23 17:53 [PATCH v2 0/4] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-09-23 17:53 ` [PATCH v2 3/4] hugetlb: add demote bool to gigantic page routines Mike Kravetz
@ 2021-09-23 17:53 ` Mike Kravetz
  2021-09-24  9:44   ` David Hildenbrand
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 17:53 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton, Mike Kravetz

Demote page functionality will split a huge page into a number of huge
pages of a smaller size.  For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
splitting the huge page.

Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page for use
by demote code.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2317d411243d..ab7bd0434057 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1260,7 +1260,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+/* used to demote non-gigantic_huge pages as well */
 static void __destroy_compound_gigantic_page(struct page *page,
 					unsigned int order, bool demote)
 {
@@ -1283,6 +1283,13 @@ static void __destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_gigantic_page_for_demote(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, true);
+}
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void destroy_compound_gigantic_page(struct page *page,
 					unsigned int order)
 {
@@ -1428,6 +1435,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	__remove_hugetlb_page(h, page, adjust_surplus, false);
 }
 
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1777,6 +1790,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 	return __prep_compound_gigantic_page(page, order, false);
 }
 
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+							unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, true);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
@@ -3298,9 +3317,55 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+	int i, nid = page_to_nid(page);
+	struct hstate *target_hstate;
+	bool cma_page = HPageCma(page);
+
+	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+	remove_hugetlb_page_for_demote(h, page, false);
+	spin_unlock_irq(&hugetlb_lock);
+
+	if (alloc_huge_page_vmemmap(h, page)) {
+		/* Allocation of vmemmmap failed, we can not demote page */
+		spin_lock_irq(&hugetlb_lock);
+		set_page_refcounted(page);
+		add_hugetlb_page(h, page, false);
+		return 1;
+	}
+
+	/*
+	 * Use destroy_compound_gigantic_page_for_demote for all huge page
+	 * sizes as it will not ref count pages.
+	 */
+	destroy_compound_gigantic_page_for_demote(page, huge_page_order(h));
+
+	for (i = 0; i < pages_per_huge_page(h);
+				i += pages_per_huge_page(target_hstate)) {
+		if (hstate_is_gigantic(target_hstate))
+			prep_compound_gigantic_page_for_demote(page + i,
+							target_hstate->order);
+		else
+			prep_compound_page(page + i, target_hstate->order);
+		set_page_private(page + i, 0);
+		set_page_refcounted(page + i);
+		prep_new_huge_page(target_hstate, page + i, nid);
+		if (cma_page)
+			SetHPageCma(page + i);
+		put_page(page + i);
+	}
+
+	spin_lock_irq(&hugetlb_lock);
+	return 0;
+}
+
 static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 	__must_hold(&hugetlb_lock)
 {
+	int nr_nodes, node;
+	struct page *page;
 	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -3309,9 +3374,15 @@ static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 	if (!h->demote_order)
 		return rc;
 
-	/*
-	 * TODO - demote fucntionality will be added in subsequent patch
-	 */
+	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+		if (!list_empty(&h->hugepage_freelists[node])) {
+			page = list_entry(h->hugepage_freelists[node].next,
+					struct page, lru);
+			rc = !demote_free_huge_page(h, page);
+			break;
+		}
+	}
+
 	return rc;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
@ 2021-09-23 21:24   ` Andrew Morton
  2021-09-23 22:08     ` Mike Kravetz
  2021-09-24  7:08   ` Aneesh Kumar K.V
  2021-09-24  9:28   ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2021-09-23 21:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Aneesh Kumar K . V

On Thu, 23 Sep 2021 10:53:44 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Two new sysfs files are added to demote hugtlb pages.  These files are
> both per-hugetlb page size and per node.  Files are:
>   demote_size - The size in Kb that pages are demoted to. (read-write)
>   demote - The number of huge pages to demote. (write-only)
> 
> By default, demote_size is the next smallest huge page size.  Valid huge
> page sizes less than huge page size may be written to this file.  When
> huge pages are demoted, they are demoted to this size.
> 
> Writing a value to demote will result in an attempt to demote that
> number of hugetlb pages to an appropriate number of demote_size pages.
> 
> NOTE: Demote interfaces are only provided for huge page sizes if there
> is a smaller target demote huge page size.  For example, on x86 1GB huge
> pages will have demote interfaces.  2MB huge pages will not have demote
> interfaces.
> 
> This patch does not provide full demote functionality.  It only provides
> the sysfs interfaces.
> 
> It also provides documentation for the new interfaces.
> 
> ...
>
> +static ssize_t demote_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	unsigned long nr_demote;
> +	unsigned long nr_available;
> +	nodemask_t nodes_allowed, *n_mask;
> +	struct hstate *h;
> +	int err;
> +	int nid;
> +
> +	err = kstrtoul(buf, 10, &nr_demote);
> +	if (err)
> +		return err;
> +	h = kobj_to_hstate(kobj, &nid);
> +
> +	/* Synchronize with other sysfs operations modifying huge pages */
> +	mutex_lock(&h->resize_lock);
> +
> +	spin_lock_irq(&hugetlb_lock);
> +	if (nid != NUMA_NO_NODE) {
> +		nr_available = h->free_huge_pages_node[nid];
> +		init_nodemask_of_node(&nodes_allowed, nid);
> +		n_mask = &nodes_allowed;
> +	} else {
> +		nr_available = h->free_huge_pages;
> +		n_mask = &node_states[N_MEMORY];
> +	}
> +	nr_available -= h->resv_huge_pages;
> +	if (nr_available <= 0)
> +		goto out;
> +	nr_demote = min(nr_available, nr_demote);
> +
> +	while (nr_demote) {
> +		if (!demote_pool_huge_page(h, n_mask))
> +			break;
> +
> +		/*
> +		 * We may have dropped the lock in the routines to
> +		 * demote/free a page.  Recompute nr_demote as counts could
> +		 * have changed and we want to make sure we do not demote
> +		 * a reserved huge page.
> +		 */

This comment doesn't become true until patch #4, and is a bit confusing
in patch #1.  Also, saying "the lock" is far less helpful than saying
"hugetlb_lock"!


> +		nr_demote--;
> +		if (nid != NUMA_NO_NODE)
> +			nr_available = h->free_huge_pages_node[nid];
> +		else
> +			nr_available = h->free_huge_pages;
> +		nr_available -= h->resv_huge_pages;
> +		if (nr_available <= 0)
> +			nr_demote = 0;
> +		else
> +			nr_demote = min(nr_available, nr_demote);
> +	}
> +
> +out:
> +	spin_unlock_irq(&hugetlb_lock);

How long can we spend with IRQs disabled here (after patch #4!)?

> +	mutex_unlock(&h->resize_lock);
> +
> +	return len;
> +}
> +HSTATE_ATTR_WO(demote);
> +


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-23 21:24   ` Andrew Morton
@ 2021-09-23 22:08     ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-23 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Aneesh Kumar K . V

On 9/23/21 2:24 PM, Andrew Morton wrote:
> On Thu, 23 Sep 2021 10:53:44 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> Two new sysfs files are added to demote hugtlb pages.  These files are
>> both per-hugetlb page size and per node.  Files are:
>>   demote_size - The size in Kb that pages are demoted to. (read-write)
>>   demote - The number of huge pages to demote. (write-only)
>>
>> By default, demote_size is the next smallest huge page size.  Valid huge
>> page sizes less than huge page size may be written to this file.  When
>> huge pages are demoted, they are demoted to this size.
>>
>> Writing a value to demote will result in an attempt to demote that
>> number of hugetlb pages to an appropriate number of demote_size pages.
>>
>> NOTE: Demote interfaces are only provided for huge page sizes if there
>> is a smaller target demote huge page size.  For example, on x86 1GB huge
>> pages will have demote interfaces.  2MB huge pages will not have demote
>> interfaces.
>>
>> This patch does not provide full demote functionality.  It only provides
>> the sysfs interfaces.
>>
>> It also provides documentation for the new interfaces.
>>
>> ...
>>
>> +static ssize_t demote_store(struct kobject *kobj,
>> +	       struct kobj_attribute *attr, const char *buf, size_t len)
>> +{
>> +	unsigned long nr_demote;
>> +	unsigned long nr_available;
>> +	nodemask_t nodes_allowed, *n_mask;
>> +	struct hstate *h;
>> +	int err;
>> +	int nid;
>> +
>> +	err = kstrtoul(buf, 10, &nr_demote);
>> +	if (err)
>> +		return err;
>> +	h = kobj_to_hstate(kobj, &nid);
>> +
>> +	/* Synchronize with other sysfs operations modifying huge pages */
>> +	mutex_lock(&h->resize_lock);
>> +
>> +	spin_lock_irq(&hugetlb_lock);
>> +	if (nid != NUMA_NO_NODE) {
>> +		nr_available = h->free_huge_pages_node[nid];
>> +		init_nodemask_of_node(&nodes_allowed, nid);
>> +		n_mask = &nodes_allowed;
>> +	} else {
>> +		nr_available = h->free_huge_pages;
>> +		n_mask = &node_states[N_MEMORY];
>> +	}
>> +	nr_available -= h->resv_huge_pages;
>> +	if (nr_available <= 0)
>> +		goto out;
>> +	nr_demote = min(nr_available, nr_demote);
>> +
>> +	while (nr_demote) {
>> +		if (!demote_pool_huge_page(h, n_mask))
>> +			break;
>> +
>> +		/*
>> +		 * We may have dropped the lock in the routines to
>> +		 * demote/free a page.  Recompute nr_demote as counts could
>> +		 * have changed and we want to make sure we do not demote
>> +		 * a reserved huge page.
>> +		 */
> 
> This comment doesn't become true until patch #4, and is a bit confusing
> in patch #1.  Also, saying "the lock" is far less helpful than saying
> "hugetlb_lock"!

Right.  That is the result of slicing and dicing working code to create
individual patches.  Sorry.  I will correct.

The comment is also not 100% accurate.  demote_pool_huge_page will
always drop hugetlb_lock except in the quick error case which is not
really interesting.  This helps answer your next question.

> 
> 
>> +		nr_demote--;
>> +		if (nid != NUMA_NO_NODE)
>> +			nr_available = h->free_huge_pages_node[nid];
>> +		else
>> +			nr_available = h->free_huge_pages;
>> +		nr_available -= h->resv_huge_pages;
>> +		if (nr_available <= 0)
>> +			nr_demote = 0;
>> +		else
>> +			nr_demote = min(nr_available, nr_demote);
>> +	}
>> +
>> +out:
>> +	spin_unlock_irq(&hugetlb_lock);
> 
> How long can we spend with IRQs disabled here (after patch #4!)?

Not very long.  We will drop the lock on page demote.  This is because
we need to potentially allocate vmemmap pages.  We will actually go
through quite a few acquire/drop lock cycles for each demoted page.
Something like:
	dequeue page to be demoted
	drop lock
	potentially allocate vmemmap pages
	for each page of demoted size
		prep page
		acquire lock
		enqueue page to new pool
		drop lock
	reacquire lock

This is 'no worse' than the lock cycling that happens with existing pool
adjustment mechanisms such as "echo > nr_hugepages".

The updated comment will point out that there is little need to worry
about lock hold/irq disable time.
-- 
Mike Kravetz

>> +	mutex_unlock(&h->resize_lock);
>> +
>> +	return len;
>> +}
>> +HSTATE_ATTR_WO(demote);
>> +
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
  2021-09-23 21:24   ` Andrew Morton
@ 2021-09-24  7:08   ` Aneesh Kumar K.V
  2021-09-29 18:22     ` Mike Kravetz
  2021-09-24  9:28   ` David Hildenbrand
  2 siblings, 1 reply; 17+ messages in thread
From: Aneesh Kumar K.V @ 2021-09-24  7:08 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Mike Kravetz <mike.kravetz@oracle.com> writes:

> Two new sysfs files are added to demote hugtlb pages.  These files are
> both per-hugetlb page size and per node.  Files are:
>   demote_size - The size in Kb that pages are demoted to. (read-write)
>   demote - The number of huge pages to demote. (write-only)
>
> By default, demote_size is the next smallest huge page size.  Valid huge
> page sizes less than huge page size may be written to this file.  When
> huge pages are demoted, they are demoted to this size.
>
> Writing a value to demote will result in an attempt to demote that
> number of hugetlb pages to an appropriate number of demote_size pages.
>
> NOTE: Demote interfaces are only provided for huge page sizes if there
> is a smaller target demote huge page size.  For example, on x86 1GB huge
> pages will have demote interfaces.  2MB huge pages will not have demote
> interfaces.

Should we also check if the platform allows for
gigantic_page_runtime_supported() ? 

-aneesh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
  2021-09-23 21:24   ` Andrew Morton
  2021-09-24  7:08   ` Aneesh Kumar K.V
@ 2021-09-24  9:28   ` David Hildenbrand
  2021-09-29 19:34     ` Mike Kravetz
  2 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2021-09-24  9:28 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 23.09.21 19:53, Mike Kravetz wrote:
> Two new sysfs files are added to demote hugtlb pages.  These files are
> both per-hugetlb page size and per node.  Files are:
>    demote_size - The size in Kb that pages are demoted to. (read-write)
>    demote - The number of huge pages to demote. (write-only)
> 
> By default, demote_size is the next smallest huge page size.  Valid huge
> page sizes less than huge page size may be written to this file.  When
> huge pages are demoted, they are demoted to this size.
> 
> Writing a value to demote will result in an attempt to demote that
> number of hugetlb pages to an appropriate number of demote_size pages.
> 
> NOTE: Demote interfaces are only provided for huge page sizes if there
> is a smaller target demote huge page size.  For example, on x86 1GB huge
> pages will have demote interfaces.  2MB huge pages will not have demote
> interfaces.
> 
> This patch does not provide full demote functionality.  It only provides
> the sysfs interfaces.
> 
> It also provides documentation for the new interfaces.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   Documentation/admin-guide/mm/hugetlbpage.rst |  30 +++-
>   include/linux/hugetlb.h                      |   1 +
>   mm/hugetlb.c                                 | 155 ++++++++++++++++++-
>   3 files changed, 183 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144e44..0e123a347e1e 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -234,8 +234,12 @@ will exist, of the form::
>   
>   	hugepages-${size}kB
>   
> -Inside each of these directories, the same set of files will exist::
> +Inside each of these directories, the set of files contained in ``/proc``
> +will exist.  In addition, two additional interfaces for demoting huge
> +pages may exist::
>   
> +        demote
> +        demote_size
>   	nr_hugepages
>   	nr_hugepages_mempolicy
>   	nr_overcommit_hugepages
> @@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
>   	resv_hugepages
>   	surplus_hugepages
>   
> -which function as described above for the default huge page-sized case.
> +The demote interfaces provide the ability to split a huge page into
> +smaller huge pages.  For example, the x86 architecture supports both
> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
> +2MB huge pages.  Demote interfaces are not available for the smallest
> +huge page size.  The demote interfaces are:
> +
> +demote_size
> +        is the size of demoted pages.  When a page is demoted a corresponding
> +        number of huge pages of demote_size will be created.  By default,
> +        demote_size is set to the next smaller huge page size.  If there are
> +        multiple smaller huge page sizes, demote_size can be set to any of
> +        these smaller sizes.  Only huge page sizes less then the current huge
> +        pages size are allowed.
> +
> +demote
> +        is used to demote a number of huge pages.  A user with root privileges
> +        can write to this file.  It may not be possible to demote the
> +        requested number of huge pages.  To determine how many pages were
> +        actually demoted, compare the value of nr_hugepages before and after
> +        writing to the demote interface.  demote is a write only interface.
> +
> +The interfaces which are the same as in ``/proc`` (all except demote and
> +demote_size) function as described above for the default huge page-sized case.
>   
>   .. _mem_policy_and_hp_alloc:
>   
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..f2c3979efd69 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -596,6 +596,7 @@ struct hstate {
>   	int next_nid_to_alloc;
>   	int next_nid_to_free;
>   	unsigned int order;
> +	unsigned int demote_order;
>   	unsigned long mask;
>   	unsigned long max_huge_pages;
>   	unsigned long nr_huge_pages;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6378c1066459..c76ee0bd6374 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   
>   static void __init hugetlb_init_hstates(void)
>   {
> -	struct hstate *h;
> +	struct hstate *h, *h2;
>   
>   	for_each_hstate(h) {
>   		if (minimum_order > huge_page_order(h))
> @@ -2995,6 +2995,17 @@ static void __init hugetlb_init_hstates(void)
>   		/* oversize hugepages were init'ed in early boot */
>   		if (!hstate_is_gigantic(h))
>   			hugetlb_hstate_alloc_pages(h);
> +
> +		/*
> +		 * Set demote order for each hstate.  Note that
> +		 * h->demote_order is initially 0.
> +		 */
> +		for_each_hstate(h2) {
> +			if (h2 == h)
> +				continue;
> +			if (h2->order < h->order && h2->order > h->demote_order)
> +				h->demote_order = h2->order;
> +		}
>   	}
>   	VM_BUG_ON(minimum_order == UINT_MAX);
>   }
> @@ -3235,9 +3246,29 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	return 0;
>   }
>   
> +static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> +	__must_hold(&hugetlb_lock)
> +{
> +	int rc = 0;
> +
> +	lockdep_assert_held(&hugetlb_lock);
> +
> +	/* We should never get here if no demote order */
> +	if (!h->demote_order)
> +		return rc;
> +
> +	/*
> +	 * TODO - demote fucntionality will be added in subsequent patch
> +	 */
> +	return rc;
> +}
> +
>   #define HSTATE_ATTR_RO(_name) \
>   	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>   
> +#define HSTATE_ATTR_WO(_name) \
> +	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
> +
>   #define HSTATE_ATTR(_name) \
>   	static struct kobj_attribute _name##_attr = \
>   		__ATTR(_name, 0644, _name##_show, _name##_store)
> @@ -3433,6 +3464,112 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
>   }
>   HSTATE_ATTR_RO(surplus_hugepages);
>   
> +static ssize_t demote_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	unsigned long nr_demote;
> +	unsigned long nr_available;
> +	nodemask_t nodes_allowed, *n_mask;
> +	struct hstate *h;
> +	int err;
> +	int nid;
> +
> +	err = kstrtoul(buf, 10, &nr_demote);
> +	if (err)
> +		return err;
> +	h = kobj_to_hstate(kobj, &nid);
> +
> +	/* Synchronize with other sysfs operations modifying huge pages */
> +	mutex_lock(&h->resize_lock);
> +
> +	spin_lock_irq(&hugetlb_lock);
> +	if (nid != NUMA_NO_NODE) {
> +		nr_available = h->free_huge_pages_node[nid];
> +		init_nodemask_of_node(&nodes_allowed, nid);
> +		n_mask = &nodes_allowed;
> +	} else {
> +		nr_available = h->free_huge_pages;
> +		n_mask = &node_states[N_MEMORY];
> +	}
> +	nr_available -= h->resv_huge_pages;
> +	if (nr_available <= 0)
> +		goto out;
> +	nr_demote = min(nr_available, nr_demote);
> +
> +	while (nr_demote) {
> +		if (!demote_pool_huge_page(h, n_mask))
> +			break;
> +
> +		/*
> +		 * We may have dropped the lock in the routines to
> +		 * demote/free a page.  Recompute nr_demote as counts could
> +		 * have changed and we want to make sure we do not demote
> +		 * a reserved huge page.
> +		 */
> +		nr_demote--;
> +		if (nid != NUMA_NO_NODE)
> +			nr_available = h->free_huge_pages_node[nid];
> +		else
> +			nr_available = h->free_huge_pages;
> +		nr_available -= h->resv_huge_pages;
> +		if (nr_available <= 0)
> +			nr_demote = 0;
> +		else
> +			nr_demote = min(nr_available, nr_demote);
> +	}
>

Wonder if you could compress that quite a bit:


...
spin_lock_irq(&hugetlb_lock);

if (nid != NUMA_NO_NODE) {
	init_nodemask_of_node(&nodes_allowed, nid);
	n_mask = &nodes_allowed;
} else {
	n_mask = &node_states[N_MEMORY];
}

while (nr_demote) {
	/*
	 * Update after each iteration because we might have temporarily
	 * dropped the lock and our counters changes.
	 */
	if (nid != NUMA_NO_NODE)
		nr_available = h->free_huge_pages_node[nid];
	else
		nr_available = h->free_huge_pages;
	nr_available -= h->resv_huge_pages;
	if (nr_available <= 0)
		break;
	if (!demote_pool_huge_page(h, n_mask))
		break;
	nr_demote--;
};
spin_unlock_irq(&hugetlb_lock);

Not sure if that "nr_demote = min(nr_available, nr_demote);" logic is 
really required. Once nr_available hits <= 0 we'll just stop denoting.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-09-23 17:53 ` [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
@ 2021-09-24  9:36   ` David Hildenbrand
  2021-09-29 19:42     ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2021-09-24  9:36 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 23.09.21 19:53, Mike Kravetz wrote:
> When huge page demotion is fully implemented, gigantic pages can be
> demoted to a smaller huge page size.  For example, on x86 a 1G page
> can be demoted to 512 2M pages.  However, gigantic pages can potentially
> be allocated from CMA.  If a gigantic page which was allocated from CMA
> is demoted, the corresponding demoted pages needs to be returned to CMA.
> 
> In order to track hugetlb pages that need to be returned to CMA, add the
> hugetlb specific flag HPageCma.  Flag is set when a huge page is
> allocated from CMA and transferred to any demoted pages.  Non-gigantic
> huge page freeing code checks for the flag and takes appropriate action.

Do we really need that flag or couldn't we simply always try 
cma_release() and fallback to out ordinary freeing-path?

IIRC, cma knows exactly if something was allocated via a CMA are and can 
be free via it. No need for additional tracking usually.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] hugetlb: add hugetlb demote page support
  2021-09-23 17:53 ` [PATCH v2 4/4] hugetlb: add hugetlb demote page support Mike Kravetz
@ 2021-09-24  9:44   ` David Hildenbrand
  2021-09-29 19:54     ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2021-09-24  9:44 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 23.09.21 19:53, Mike Kravetz wrote:
> Demote page functionality will split a huge page into a number of huge
> pages of a smaller size.  For example, on x86 a 1GB huge page can be
> demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
> splitting the huge page.
> 
> Added '*_for_demote' wrappers for remove_hugetlb_page,
> destroy_compound_gigantic_page and prep_compound_gigantic_page for use
> by demote code.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 75 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2317d411243d..ab7bd0434057 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1260,7 +1260,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>   		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>   		nr_nodes--)
>   
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> +/* used to demote non-gigantic_huge pages as well */
>   static void __destroy_compound_gigantic_page(struct page *page,
>   					unsigned int order, bool demote)
>   {
> @@ -1283,6 +1283,13 @@ static void __destroy_compound_gigantic_page(struct page *page,
>   	__ClearPageHead(page);
>   }
>   
> +static void destroy_compound_gigantic_page_for_demote(struct page *page,
> +					unsigned int order)
> +{
> +	__destroy_compound_gigantic_page(page, order, true);
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>   static void destroy_compound_gigantic_page(struct page *page,
>   					unsigned int order)
>   {
> @@ -1428,6 +1435,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
>   	__remove_hugetlb_page(h, page, adjust_surplus, false);
>   }
>   
> +static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
> +							bool adjust_surplus)
> +{
> +	__remove_hugetlb_page(h, page, adjust_surplus, true);
> +}
> +
>   static void add_hugetlb_page(struct hstate *h, struct page *page,
>   			     bool adjust_surplus)
>   {
> @@ -1777,6 +1790,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
>   	return __prep_compound_gigantic_page(page, order, false);
>   }
>   
> +static bool prep_compound_gigantic_page_for_demote(struct page *page,
> +							unsigned int order)
> +{
> +	return __prep_compound_gigantic_page(page, order, true);
> +}
> +
>   /*
>    * PageHuge() only returns true for hugetlbfs pages, but not for normal or
>    * transparent huge pages.  See the PageTransHuge() documentation for more
> @@ -3298,9 +3317,55 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>   	return 0;
>   }
>   
> +static int demote_free_huge_page(struct hstate *h, struct page *page)
> +{
> +	int i, nid = page_to_nid(page);
> +	struct hstate *target_hstate;
> +	bool cma_page = HPageCma(page);
> +
> +	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
> +
> +	remove_hugetlb_page_for_demote(h, page, false);
> +	spin_unlock_irq(&hugetlb_lock);
> +
> +	if (alloc_huge_page_vmemmap(h, page)) {
> +		/* Allocation of vmemmmap failed, we can not demote page */
> +		spin_lock_irq(&hugetlb_lock);
> +		set_page_refcounted(page);
> +		add_hugetlb_page(h, page, false);

I dislike using 0/1 as return values as it will just hide the actual issue.

This here would be -ENOMEM, right?



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-24  7:08   ` Aneesh Kumar K.V
@ 2021-09-29 18:22     ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-29 18:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton

On 9/24/21 12:08 AM, Aneesh Kumar K.V wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
>> Two new sysfs files are added to demote hugtlb pages.  These files are
>> both per-hugetlb page size and per node.  Files are:
>>   demote_size - The size in Kb that pages are demoted to. (read-write)
>>   demote - The number of huge pages to demote. (write-only)
>>
>> By default, demote_size is the next smallest huge page size.  Valid huge
>> page sizes less than huge page size may be written to this file.  When
>> huge pages are demoted, they are demoted to this size.
>>
>> Writing a value to demote will result in an attempt to demote that
>> number of hugetlb pages to an appropriate number of demote_size pages.
>>
>> NOTE: Demote interfaces are only provided for huge page sizes if there
>> is a smaller target demote huge page size.  For example, on x86 1GB huge
>> pages will have demote interfaces.  2MB huge pages will not have demote
>> interfaces.
> 
> Should we also check if the platform allows for
> gigantic_page_runtime_supported() ? 
> 

Yes, thanks!

Looks like this may only be an issue for giganitc pages on power managed
by firmware.  Still, needs to be checked.  Will update.

Thanks,
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces
  2021-09-24  9:28   ` David Hildenbrand
@ 2021-09-29 19:34     ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-29 19:34 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 9/24/21 2:28 AM, David Hildenbrand wrote:
> On 23.09.21 19:53, Mike Kravetz wrote:
>> +    spin_lock_irq(&hugetlb_lock);
>> +    if (nid != NUMA_NO_NODE) {
>> +        nr_available = h->free_huge_pages_node[nid];
>> +        init_nodemask_of_node(&nodes_allowed, nid);
>> +        n_mask = &nodes_allowed;
>> +    } else {
>> +        nr_available = h->free_huge_pages;
>> +        n_mask = &node_states[N_MEMORY];
>> +    }
>> +    nr_available -= h->resv_huge_pages;
>> +    if (nr_available <= 0)
>> +        goto out;
>> +    nr_demote = min(nr_available, nr_demote);
>> +
>> +    while (nr_demote) {
>> +        if (!demote_pool_huge_page(h, n_mask))
>> +            break;
>> +
>> +        /*
>> +         * We may have dropped the lock in the routines to
>> +         * demote/free a page.  Recompute nr_demote as counts could
>> +         * have changed and we want to make sure we do not demote
>> +         * a reserved huge page.
>> +         */
>> +        nr_demote--;
>> +        if (nid != NUMA_NO_NODE)
>> +            nr_available = h->free_huge_pages_node[nid];
>> +        else
>> +            nr_available = h->free_huge_pages;
>> +        nr_available -= h->resv_huge_pages;
>> +        if (nr_available <= 0)
>> +            nr_demote = 0;
>> +        else
>> +            nr_demote = min(nr_available, nr_demote);
>> +    }
>>
> 
> Wonder if you could compress that quite a bit:
> 
> 
> ...
> spin_lock_irq(&hugetlb_lock);
> 
> if (nid != NUMA_NO_NODE) {
>     init_nodemask_of_node(&nodes_allowed, nid);
>     n_mask = &nodes_allowed;
> } else {
>     n_mask = &node_states[N_MEMORY];
> }
> 
> while (nr_demote) {
>     /*
>      * Update after each iteration because we might have temporarily
>      * dropped the lock and our counters changes.
>      */
>     if (nid != NUMA_NO_NODE)
>         nr_available = h->free_huge_pages_node[nid];
>     else
>         nr_available = h->free_huge_pages;
>     nr_available -= h->resv_huge_pages;
>     if (nr_available <= 0)
>         break;
>     if (!demote_pool_huge_page(h, n_mask))
>         break;
>     nr_demote--;
> };
> spin_unlock_irq(&hugetlb_lock);
> 
> Not sure if that "nr_demote = min(nr_available, nr_demote);" logic is really required. Once nr_available hits <= 0 we'll just stop denoting.
> 

No, it is not needed.  You suggested code looks much nicer.  I will
incorporate into the next version.

Thanks!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-09-24  9:36   ` David Hildenbrand
@ 2021-09-29 19:42     ` Mike Kravetz
  2021-09-29 23:21       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2021-09-29 19:42 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 9/24/21 2:36 AM, David Hildenbrand wrote:
> On 23.09.21 19:53, Mike Kravetz wrote:
>> When huge page demotion is fully implemented, gigantic pages can be
>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>
>> In order to track hugetlb pages that need to be returned to CMA, add the
>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>> huge page freeing code checks for the flag and takes appropriate action.
> 
> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
> 
> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
> 

Yes, I think this is possible.
Initially, I thought the check for whether pages were part of CMA
involved a mutex or some type of locking.  But, it really is
lightweight.  So, should not be in issue calling in every case.

I will most likely add a !CMA_CONFIG stub for cma_release() and remove
some of the #ifdefs in hugetlb.c

Thanks, I will update in next version.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] hugetlb: add hugetlb demote page support
  2021-09-24  9:44   ` David Hildenbrand
@ 2021-09-29 19:54     ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-09-29 19:54 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 9/24/21 2:44 AM, David Hildenbrand wrote:
> On 23.09.21 19:53, Mike Kravetz wrote:
>> +
>> +    if (alloc_huge_page_vmemmap(h, page)) {
>> +        /* Allocation of vmemmmap failed, we can not demote page */
>> +        spin_lock_irq(&hugetlb_lock);
>> +        set_page_refcounted(page);
>> +        add_hugetlb_page(h, page, false);
> 
> I dislike using 0/1 as return values as it will just hide the actual issue.
> 
> This here would be -ENOMEM, right?
> 

I will pass along the return value from alloc_huge_page_vmemmap().  You
are right, -ENOMEM is the only non-zero value.

Thanks,
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-09-29 19:42     ` Mike Kravetz
@ 2021-09-29 23:21       ` Mike Kravetz
  2021-10-01 17:50         ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2021-09-29 23:21 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 9/29/21 12:42 PM, Mike Kravetz wrote:
> On 9/24/21 2:36 AM, David Hildenbrand wrote:
>> On 23.09.21 19:53, Mike Kravetz wrote:
>>> When huge page demotion is fully implemented, gigantic pages can be
>>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>>
>>> In order to track hugetlb pages that need to be returned to CMA, add the
>>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>>> huge page freeing code checks for the flag and takes appropriate action.
>>
>> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
>>
>> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
>>
> 
> Yes, I think this is possible.
> Initially, I thought the check for whether pages were part of CMA
> involved a mutex or some type of locking.  But, it really is
> lightweight.  So, should not be in issue calling in every case.

When modifying the code, I did come across one issue.  Sorry I did not
immediately remember this.

Gigantic pages are allocated as a 'set of pages' and turned into a compound
page by the hugetlb code.  They must be restored to a 'set of pages' before
calling cma_release.  You can not pass a compound page to cma_release.
Non-gigantic page are allocated from the buddy directly as compound pages.
They are returned to buddy as a compound page.

So, the issue comes about when freeing a non-gigantic page.  We would
need to convert to a 'set of pages' before calling cma_release just to
see if cma_release succeeds.  Then, if it fails convert back to a
compound page to call __free_pages.  Conversion is somewhat expensive as
we must modify every tail page struct.

Some possible solutions:
- Create a cma_pages_valid() routine that checks whether the pages
  belong to a cma region.  Only convert to 'set of pages' if cma_pages_valid
  and we know subsequent cma_release will succeed.
- Always convert non-gigantic pages to a 'set of pages' before freeing.
  Alternatively, don't allocate compound pages from buddy and just use
  the hugetlb gigantic page prep and destroy routines for all hugetlb
  page sizes.
- Use some kind of flag as in proposed patch.

Having hugetlb just allocate a set of pages from buddy is interesting.
This would make the allocate/free code paths for gigantic and
non-gigantic pages align more closely.  It may in overall code simplification,
not just for demote.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-09-29 23:21       ` Mike Kravetz
@ 2021-10-01 17:50         ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2021-10-01 17:50 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Michal Hocko, Oscar Salvador, Zi Yan, Muchun Song,
	Naoya Horiguchi, David Rientjes, Aneesh Kumar K . V,
	Andrew Morton

On 9/29/21 4:21 PM, Mike Kravetz wrote:
> On 9/29/21 12:42 PM, Mike Kravetz wrote:
>> On 9/24/21 2:36 AM, David Hildenbrand wrote:
>>> On 23.09.21 19:53, Mike Kravetz wrote:
>>>> When huge page demotion is fully implemented, gigantic pages can be
>>>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>>>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>>>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>>>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>>>
>>>> In order to track hugetlb pages that need to be returned to CMA, add the
>>>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>>>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>>>> huge page freeing code checks for the flag and takes appropriate action.
>>>
>>> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
>>>
>>> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
>>>
>>
>> Yes, I think this is possible.
>> Initially, I thought the check for whether pages were part of CMA
>> involved a mutex or some type of locking.  But, it really is
>> lightweight.  So, should not be in issue calling in every case.
> 
> When modifying the code, I did come across one issue.  Sorry I did not
> immediately remember this.
> 
> Gigantic pages are allocated as a 'set of pages' and turned into a compound
> page by the hugetlb code.  They must be restored to a 'set of pages' before
> calling cma_release.  You can not pass a compound page to cma_release.
> Non-gigantic page are allocated from the buddy directly as compound pages.
> They are returned to buddy as a compound page.
> 
> So, the issue comes about when freeing a non-gigantic page.  We would
> need to convert to a 'set of pages' before calling cma_release just to
> see if cma_release succeeds.  Then, if it fails convert back to a
> compound page to call __free_pages.  Conversion is somewhat expensive as
> we must modify every tail page struct.
> 
> Some possible solutions:
> - Create a cma_pages_valid() routine that checks whether the pages
>   belong to a cma region.  Only convert to 'set of pages' if cma_pages_valid
>   and we know subsequent cma_release will succeed.
> - Always convert non-gigantic pages to a 'set of pages' before freeing.
>   Alternatively, don't allocate compound pages from buddy and just use
>   the hugetlb gigantic page prep and destroy routines for all hugetlb
>   page sizes.
> - Use some kind of flag as in proposed patch.
> 
> Having hugetlb just allocate a set of pages from buddy is interesting.
> This would make the allocate/free code paths for gigantic and
> non-gigantic pages align more closely.  It may in overall code simplification,
> not just for demote.

Taking this approach actually got a bit ugly in alloc_and_dissolve_huge_page
which is used in migration.  Instead, I took the approach of adding a
'cma_pages_valid()' interface to check at page freeing time.  Sending
out v3 shortly.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-01 17:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-23 17:53 [PATCH v2 0/4] hugetlb: add demote/split page functionality Mike Kravetz
2021-09-23 17:53 ` [PATCH v2 1/4] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
2021-09-23 21:24   ` Andrew Morton
2021-09-23 22:08     ` Mike Kravetz
2021-09-24  7:08   ` Aneesh Kumar K.V
2021-09-29 18:22     ` Mike Kravetz
2021-09-24  9:28   ` David Hildenbrand
2021-09-29 19:34     ` Mike Kravetz
2021-09-23 17:53 ` [PATCH v2 2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
2021-09-24  9:36   ` David Hildenbrand
2021-09-29 19:42     ` Mike Kravetz
2021-09-29 23:21       ` Mike Kravetz
2021-10-01 17:50         ` Mike Kravetz
2021-09-23 17:53 ` [PATCH v2 3/4] hugetlb: add demote bool to gigantic page routines Mike Kravetz
2021-09-23 17:53 ` [PATCH v2 4/4] hugetlb: add hugetlb demote page support Mike Kravetz
2021-09-24  9:44   ` David Hildenbrand
2021-09-29 19:54     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).