LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
@ 2021-08-16 22:49 Mike Kravetz
  2021-08-16 22:49 ` [PATCH 1/8] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
                   ` (9 more replies)
  0 siblings, 10 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

This is a resend of PATCHes sent here [4].  There was some discussion
and interest when the RFC [5] was sent, but little after that.  The
resend is just a rebase of [4] to next-20210816 with a few typos in
commmit messages fixed.

Original Cover Letter
---------------------
The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs in such environments, it is
sometimes desirable to preallocate hugetlb pools.  This avoids the delay
and uncertainty of allocating hugetlb pages at VM startup.  In addition,
preallocating huge pages minimizes the issue of memory fragmentation that
increases the longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain.  In such situations,
it may be desirable to convert larger hugetlb pages to smaller hugetlb
pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  However, there are two issues with this approach:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting a hugetlb page to
pages of a smaller size.  For example, on x86 one 1G page can be
demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
- demote_size   Read only target page size for demotion
- demote        Writable number of hugetlb pages to be demoted

Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity surplus pages.  Demotion also honors
reserved huge pages.  Therefore, when a value is written to the sysfs demote
file, that value is only the maximum number of pages which will be demoted.
It is possible fewer will actually be demoted.

If demote_size is PAGESIZE, demote will simply free pages to the buddy
allocator.

Real world use cases
--------------------
There are groups today using hugetlb pages to back VMs on x86.  Their
use case is as described above.  They have experienced the issues with
performance and not necessarily getting the excepted number smaller huge
pages after free/allocate cycle.

Note to reviewers
-----------------
Patches 1-5 provide the basic demote functionality.  They are built on
	next-20210721.
Patch 3 deals with this issue of speculative page references as
	discussed in [1] and [2].  It builds on the ideas used in
	patches currently in mmotm.  There have been few comments on
	those patches in mmotm, so I do not feel the approach has been
	well vetted.
Patches 6-8 are an optimization to deal with vmemmap optimized pages.
	This was discussed in the RFC.  IMO, the code may not be worth
	the benefit.  They could be dropped with no loss of
	functionality.  In addition, Muchun has recently sent patches to
	further optimize hugetlb vmemmap reduction by only requiring one
	vmemmap page per huge page [3].  These patches do not take Muchun's
	new patches into account.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/20210710002441.167759-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210714091800.42645-1-songmuchun@bytedance.com/
[4] https://lore.kernel.org/linux-mm/20210721230511.201823-1-mike.kravetz@oracle.com/
[5] https://lore.kernel.org/linux-mm/20210309001855.142453-1-mike.kravetz@oracle.com/

v1 -> RESEND
  - Rebase on next-20210816
  - Fix a few typos in commit messages
RFC -> v1
  - Provides basic support for vmemmap optimized pages
  - Takes speculative page references into account
  - Updated Documentation file
  - Added optimizations for vmemmap optimized pages

Mike Kravetz (8):
  hugetlb: add demote hugetlb page sysfs interfaces
  hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  hugetlb: add demote bool to gigantic page routines
  hugetlb: add hugetlb demote page support
  hugetlb: document the demote sysfs interfaces
  hugetlb: vmemmap optimizations when demoting hugetlb pages
  hugetlb: prepare destroy and prep routines for vmemmap optimized pages
  hugetlb: Optimized demote vmemmap optimizatized pages

 Documentation/admin-guide/mm/hugetlbpage.rst |  29 +-
 include/linux/hugetlb.h                      |   8 +
 include/linux/mm.h                           |   4 +
 mm/hugetlb.c                                 | 327 +++++++++++++++++--
 mm/hugetlb_vmemmap.c                         |  72 +++-
 mm/hugetlb_vmemmap.h                         |  16 +
 mm/sparse-vmemmap.c                          | 123 ++++++-
 7 files changed, 538 insertions(+), 41 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/8] hugetlb: add demote hugetlb page sysfs interfaces
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 2/8] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:
  demote_size - The size in Kb that pages are demoted to. (read-only)
  demote - The number of huge pages to demote. (write-only)

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces and uses existing code to free pages to the buddy
allocator if demote_size == PAGESIZE.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |   1 +
 mm/hugetlb.c            | 121 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a3870ea..d96e11ce986c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@ struct hstate {
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
+	unsigned int demote_order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6337697f7ee4..0f16306993b3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h;
+	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
 		if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,17 @@ static void __init hugetlb_init_hstates(void)
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
+
+		/*
+		 * Set demote order for each hstate.  Note that
+		 * h->demote_order is initially 0.
+		 */
+		for_each_hstate(h2) {
+			if (h2 == h)
+				continue;
+			if (h2->order < h->order && h2->order > h->demote_order)
+				h->demote_order = h2->order;
+		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3235,9 +3246,36 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+	__must_hold(&hugetlb_lock)
+{
+	int rc = 0;
+
+	lockdep_assert_held(&hugetlb_lock);
+	/* If no demote order, free to buddy */
+	if (!h->demote_order) {
+		struct page *page = remove_pool_huge_page(h, nodes_allowed, 0);
+
+		if (!page)
+			return rc;
+		spin_unlock_irq(&hugetlb_lock);
+		update_and_free_page(h, page, false);
+		spin_lock_irq(&hugetlb_lock);
+		return 1;
+	}
+
+	/*
+	 * TODO - demote fucntionality will be added in subsequent patch
+	 */
+	return rc;
+}
+
 #define HSTATE_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 
+#define HSTATE_ATTR_WO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
 #define HSTATE_ATTR(_name) \
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,12 +3471,91 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
+static ssize_t demote_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long nr_demote;
+	unsigned long nr_available;
+	nodemask_t nodes_allowed, *n_mask;
+	struct hstate *h;
+	int err;
+	int nid;
+
+	err = kstrtoul(buf, 10, &nr_demote);
+	if (err)
+		return err;
+	h = kobj_to_hstate(kobj, &nid);
+
+	/* Synchronize with other sysfs operations modifying huge pages */
+	mutex_lock(&h->resize_lock);
+
+	spin_lock_irq(&hugetlb_lock);
+	if (nid != NUMA_NO_NODE) {
+		nr_available = h->free_huge_pages_node[nid];
+		init_nodemask_of_node(&nodes_allowed, nid);
+		n_mask = &nodes_allowed;
+	} else {
+		nr_available = h->free_huge_pages;
+		n_mask = &node_states[N_MEMORY];
+	}
+	nr_available -= h->resv_huge_pages;
+	if (nr_available <= 0)
+		goto out;
+	nr_demote = min(nr_available, nr_demote);
+
+	while (nr_demote) {
+		if (!demote_pool_huge_page(h, n_mask))
+			break;
+
+		/*
+		 * We may have dropped the lock in the routines to
+		 * demote/free a page.  Recompute nr_demote as counts could
+		 * have changed and we want to make sure we do not demote
+		 * a reserved huge page.
+		 */
+		nr_demote--;
+		if (nid != NUMA_NO_NODE)
+			nr_available = h->free_huge_pages_node[nid];
+		else
+			nr_available = h->free_huge_pages;
+		nr_available -= h->resv_huge_pages;
+		if (nr_available <= 0)
+			nr_demote = 0;
+		else
+			nr_demote = min(nr_available, nr_demote);
+	}
+
+out:
+	spin_unlock_irq(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
+
+	return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct hstate *h;
+	unsigned long demote_size;
+	int nid;
+
+	h = kobj_to_hstate(kobj, &nid);
+	demote_size = h->demote_order;
+
+	return sysfs_emit(buf, "%lukB\n",
+			(unsigned long)(PAGE_SIZE << h->demote_order) / SZ_1K);
+}
+HSTATE_ATTR_RO(demote_size);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
 	&resv_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+	&demote_size_attr.attr,
+	&demote_attr.attr,
 #ifdef CONFIG_NUMA
 	&nr_hugepages_mempolicy_attr.attr,
 #endif
@@ -3508,6 +3625,8 @@ static struct attribute *per_node_hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&free_hugepages_attr.attr,
 	&surplus_hugepages_attr.attr,
+	&demote_size_attr.attr,
+	&demote_attr.attr,
 	NULL,
 };
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/8] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
  2021-08-16 22:49 ` [PATCH 1/8] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 3/8] hugetlb: add demote bool to gigantic page routines Mike Kravetz
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size.  For example, on x86 a 1G page
can be demoted to 512 2M pages.  However, gigantic pages can potentially
be allocated from CMA.  If a gigantic page which was allocated from CMA
is demoted, the corresponding demoted pages needs to be returned to CMA.

In order to track hugetlb pages that need to be returned to CMA, add the
hugetlb specific flag HPageCma.  Flag is set when a huge page is
allocated from CMA and transferred to any demoted pages.  Non-gigantic
huge page freeing code checks for the flag and takes appropriate action.

This also requires a change to CMA reservations for gigantic pages.
Currently, the 'order_per_bit' is set to the gigantic page size.
However, if gigantic pages can be demoted this needs to be set to the
order of the smallest huge page.  At CMA reservation time we do not know
the size of the smallest huge page size, so use HUGETLB_PAGE_ORDER.
Also, prohibit demotion to huge page sizes smaller than
HUGETLB_PAGE_ORDER.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  7 +++++++
 mm/hugetlb.c            | 46 ++++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d96e11ce986c..60aa7e9fe2b9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -533,6 +533,11 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_freed - Set when page is on the free lists.
  *	Synchronization: hugetlb_lock held for examination and modification.
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
+ * HPG_cma - Set if huge page was directly allocated from CMA area via
+ *	cma_alloc.  Initially set for gigantic page cma allocations, but can
+ *	be set in non-gigantic pages if gigantic pages are demoted.
+ *	Synchronization: Only accessed or modified when there is only one
+ *	reference to the page at allocation, free or demote time.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
@@ -540,6 +545,7 @@ enum hugetlb_page_flags {
 	HPG_temporary,
 	HPG_freed,
 	HPG_vmemmap_optimized,
+	HPG_cma,
 	__NR_HPAGEFLAGS,
 };
 
@@ -586,6 +592,7 @@ HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
+HPAGEFLAG(Cma, cma)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f16306993b3..47b4b4d1a8f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1272,6 +1272,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1283,16 +1284,12 @@ static void destroy_compound_gigantic_page(struct page *page,
 
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
-	/*
-	 * If the page isn't allocated using the cma allocator,
-	 * cma_release() returns false.
-	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
-		return;
+	if (HPageCma(page))
+		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
+	else
 #endif
-
-	free_contig_range(page_to_pfn(page), 1 << order);
+		free_contig_range(page_to_pfn(page), 1 << order);
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
@@ -1311,8 +1308,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		if (hugetlb_cma[nid]) {
 			page = cma_alloc(hugetlb_cma[nid], nr_pages,
 					huge_page_order(h), true);
-			if (page)
+			if (page) {
+				SetHPageCma(page);
 				return page;
+			}
 		}
 
 		if (!(gfp_mask & __GFP_THISNODE)) {
@@ -1322,8 +1321,10 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 
 				page = cma_alloc(hugetlb_cma[node], nr_pages,
 						huge_page_order(h), true);
-				if (page)
+				if (page) {
+					SetHPageCma(page);
 					return page;
+				}
 			}
 		}
 	}
@@ -1480,6 +1481,20 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
 	} else {
+#ifdef CONFIG_CMA
+		/*
+		 * Could be a page that was demoted from a gigantic page
+		 * which was allocated in a CMA area.
+		 */
+		if (HPageCma(page)) {
+			destroy_compound_gigantic_page(page,
+					huge_page_order(h));
+			if (!cma_release(hugetlb_cma[page_to_nid(page)], page,
+					1 << huge_page_order(h)))
+				VM_BUG_ON_PAGE(1, page);
+			return;
+		}
+#endif
 		__free_pages(page, huge_page_order(h));
 	}
 }
@@ -3003,7 +3018,8 @@ static void __init hugetlb_init_hstates(void)
 		for_each_hstate(h2) {
 			if (h2 == h)
 				continue;
-			if (h2->order < h->order && h2->order > h->demote_order)
+			if (h2->order >= HUGETLB_PAGE_ORDER &&
+			    h2->order < h->order && h2->order > h->demote_order)
 				h->demote_order = h2->order;
 		}
 	}
@@ -6518,7 +6534,13 @@ void __init hugetlb_cma_reserve(int order)
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+		/*
+		 * Note that 'order per bit' is based on smallest size that
+		 * may be returned to CMA allocator in the case of
+		 * huge page demotion.
+		 */
+		res = cma_declare_contiguous_nid(0, size, 0,
+						PAGE_SIZE << HUGETLB_PAGE_ORDER,
 						 0, false, name,
 						 &hugetlb_cma[nid], nid);
 		if (res) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/8] hugetlb: add demote bool to gigantic page routines
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
  2021-08-16 22:49 ` [PATCH 1/8] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
  2021-08-16 22:49 ` [PATCH 2/8] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 4/8] hugetlb: add hugetlb demote page support Mike Kravetz
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

The routines remove_hugetlb_page and destroy_compound_gigantic_page
will remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator.  In the process of doing this, they
make all base pages reference counted.

The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages.  It assumes that all these base pages are reference
counted.

During demotion, a gigantic page will be split into huge pages of a
smaller size.  This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.

When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken.  This could result in errors
while demoting a huge page.  Quite a bit of code would need to be
created to handle all possible issues.

Instead of dealing with the possibility of speculative ref counts, avoid
the possibility by keeping ref counts at zero during the demote process.
Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page.  If the
boolean is set, the remove and destroy routines will not reference count
pages and the prep routine will not expect reference counted pages.

'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 47b4b4d1a8f9..2f2d5002fe73 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1261,8 +1261,8 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		nr_nodes--)
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static void __destroy_compound_gigantic_page(struct page *page,
+					unsigned int order, bool demote)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1274,7 +1274,8 @@ static void destroy_compound_gigantic_page(struct page *page,
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		p->mapping = NULL;
 		clear_compound_head(p);
-		set_page_refcounted(p);
+		if (!demote)
+			set_page_refcounted(p);
 	}
 
 	set_compound_order(page, 0);
@@ -1282,6 +1283,12 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, false);
+}
+
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 #ifdef CONFIG_CMA
@@ -1354,12 +1361,15 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 
 /*
  * Remove hugetlb page from lists, and update dtor so that page appears
- * as just a compound page.  A reference is held on the page.
+ * as just a compound page.
+ *
+ * A reference is held on the page, except in the case of demote.
  *
  * Must be called with hugetlb lock held.
  */
-static void remove_hugetlb_page(struct hstate *h, struct page *page,
-							bool adjust_surplus)
+static void __remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus,
+							bool demote)
 {
 	int nid = page_to_nid(page);
 
@@ -1397,8 +1407,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	 *
 	 * This handles the case where more than one ref is held when and
 	 * after update_and_free_page is called.
+	 *
+	 * In the case of demote we do not ref count the page as it will soon
+	 * be turned into a page of smaller size.
 	 */
-	set_page_refcounted(page);
+	if (!demote)
+		set_page_refcounted(page);
 	if (hstate_is_gigantic(h))
 		set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	else
@@ -1408,6 +1422,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	h->nr_huge_pages_node[nid]--;
 }
 
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, false);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1679,7 +1699,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
+								bool demote)
 {
 	int i, j;
 	int nr_pages = 1 << order;
@@ -1717,10 +1738,16 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 		 * the set of pages can not be converted to a gigantic page.
 		 * The caller who allocated the pages should then discard the
 		 * pages using the appropriate free interface.
+		 *
+		 * In the case of demote, the ref count will be zero.
 		 */
-		if (!page_ref_freeze(p, 1)) {
-			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
-			goto out_error;
+		if (!demote) {
+			if (!page_ref_freeze(p, 1)) {
+				pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+				goto out_error;
+			}
+		} else {
+			VM_BUG_ON_PAGE(page_count(p), p);
 		}
 		set_page_count(p, 0);
 		set_compound_head(p, page);
@@ -1745,6 +1772,11 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 	return false;
 }
 
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, false);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/8] hugetlb: add hugetlb demote page support
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 3/8] hugetlb: add demote bool to gigantic page routines Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 5/8] hugetlb: document the demote sysfs interfaces Mike Kravetz
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Demote page functionality will split a huge page into a number of huge
pages of a smaller size.  For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
splitting the huge page.

Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page for use
by demote code.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2f2d5002fe73..6d43523a1a4d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1289,6 +1289,12 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__destroy_compound_gigantic_page(page, order, false);
 }
 
+static void destroy_compound_gigantic_page_for_demote(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, true);
+}
+
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 #ifdef CONFIG_CMA
@@ -1428,6 +1434,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 	__remove_hugetlb_page(h, page, adjust_surplus, false);
 }
 
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1777,6 +1789,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 	return __prep_compound_gigantic_page(page, order, false);
 }
 
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+							unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, true);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
@@ -3294,9 +3312,55 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	return 0;
 }
 
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+	int i, nid = page_to_nid(page);
+	struct hstate *target_hstate;
+	bool cma_page = HPageCma(page);
+
+	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+	remove_hugetlb_page_for_demote(h, page, false);
+	spin_unlock_irq(&hugetlb_lock);
+
+	if (alloc_huge_page_vmemmap(h, page)) {
+		/* Allocation of vmemmmap failed, we can not demote page */
+		spin_lock_irq(&hugetlb_lock);
+		set_page_refcounted(page);
+		add_hugetlb_page(h, page, false);
+		return 1;
+	}
+
+	/*
+	 * Use destroy_compound_gigantic_page_for_demote for all huge page
+	 * sizes as it will not ref count pages.
+	 */
+	destroy_compound_gigantic_page_for_demote(page, huge_page_order(h));
+
+	for (i = 0; i < pages_per_huge_page(h);
+				i += pages_per_huge_page(target_hstate)) {
+		if (hstate_is_gigantic(target_hstate))
+			prep_compound_gigantic_page_for_demote(page + i,
+							target_hstate->order);
+		else
+			prep_compound_page(page + i, target_hstate->order);
+		set_page_private(page + i, 0);
+		set_page_refcounted(page + i);
+		prep_new_huge_page(target_hstate, page + i, nid);
+		if (cma_page)
+			SetHPageCma(page + i);
+		put_page(page + i);
+	}
+
+	spin_lock_irq(&hugetlb_lock);
+	return 0;
+}
+
 static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 	__must_hold(&hugetlb_lock)
 {
+	int nr_nodes, node;
+	struct page *page;
 	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -3312,9 +3376,15 @@ static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 		return 1;
 	}
 
-	/*
-	 * TODO - demote fucntionality will be added in subsequent patch
-	 */
+	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+		if (!list_empty(&h->hugepage_freelists[node])) {
+			page = list_entry(h->hugepage_freelists[node].next,
+					struct page, lru);
+			rc = !demote_free_huge_page(h, page);
+			break;
+		}
+	}
+
 	return rc;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 5/8] hugetlb: document the demote sysfs interfaces
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (3 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 4/8] hugetlb: add hugetlb demote page support Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 23:28   ` Andrew Morton
  2021-09-21 13:52   ` Aneesh Kumar K.V
  2021-08-16 22:49 ` [PATCH 6/8] hugetlb: vmemmap optimizations when demoting hugetlb pages Mike Kravetz
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Describe demote and demote_size interfaces.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 Documentation/admin-guide/mm/hugetlbpage.rst | 29 ++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..902059a0257b 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
 
 	hugepages-${size}kB
 
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist.  In addition, two additional interfaces for demoting huge
+pages will exist::
 
+        demote
+        demote_size
 	nr_hugepages
 	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
@@ -243,7 +247,28 @@ Inside each of these directories, the same set of files will exist::
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages.  For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
+2MB huge pages.  The demote interfaces are:
+
+demote_size
+        is the size of demoted pages.  When a page is demoted a corresponding
+        number of huge pages of demote_size will be created.  For huge pages
+        of the smallest supported size (2MB on x86), demote_size will be the
+        system page size (PAGE_SIZE).  If demote_size is the system page size
+        then demoting a page will simply free the huge page.  demote_size is
+        a read only interface.
+
+demote
+        is used to demote a number of huge pages.  A user with root privileges
+        can write to this file.  It may not be possible to demote the
+        requested number of huge pages.  To determine how many pages were
+        actually demoted, compare the value of nr_hugepages before and after
+        writing to the demote interface.  demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` function as described
+above for the default huge page-sized case.
 
 .. _mem_policy_and_hp_alloc:
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 6/8] hugetlb: vmemmap optimizations when demoting hugetlb pages
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (4 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 5/8] hugetlb: document the demote sysfs interfaces Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 7/8] hugetlb: prepare destroy and prep routines for vmemmap optimized pages Mike Kravetz
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

When demoting a hugetlb page with optimized vmemmap, we allocate vmemmap
for the entire address range represented by the huge page.  Then, we
split the huge page into huge pages of a smaller size.  When preparing
these smaller huge pages for use, we optimize vmemmap for their
associated address range and free vmemmap pages.

It is sub-optimal to allocate all the vmemmap pages associated with the
original page and then free most of those pages when preparing the
smaller huge pages after the split.  Instead, calculate the number of
vmemmap pages needed after the split and just allocate that number of
pages.  Then, only populate those areas of the vmmap required for the
smaller huge pages.

Introduce two new routines:
- demote_huge_page_vmemmap - This has knowledge of the hugetlb demote
  process and will calculate the number of huge pages needed for the
  smaller huge pages after the split.  It also creates a 'demote_mask'
  that is used to indicate where within the address range of the
  original huge page the smaller huge pages will reside.  This mask is
  used to map vmemmap pages within the range.
- vmemmap_remap_demote - This is the routine which actually allocates
  the vmemmap pages and performs vmemmap manipulations based on the
  passed demote_mask.

These routines will be used in a subsequent patch.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/mm.h   |   4 ++
 mm/hugetlb_vmemmap.c |  60 +++++++++++++++++++++
 mm/hugetlb_vmemmap.h |   6 +++
 mm/sparse-vmemmap.c  | 123 +++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4df7b0a437a8..5302ab4aa260 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3172,6 +3172,10 @@ int vmemmap_remap_free(unsigned long start, unsigned long end,
 		       unsigned long reuse);
 int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 			unsigned long reuse, gfp_t gfp_mask);
+int vmemmap_remap_demote(unsigned long start, unsigned long end,
+			unsigned long reuse, unsigned long demote_nr_pages,
+			unsigned long demote_mask,
+			unsigned long demote_map_pages, gfp_t gfp_mask);
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c540c21e26f5..c82d60398c16 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -264,6 +264,66 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 		SetHPageVmemmapOptimized(head);
 }
 
+/*
+ * vmammap pages will be allocated and mapped such that this range which
+ * previously represented a single huge page will now represent a set of
+ * pages of a smaller size.
+ */
+int demote_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	int ret;
+	unsigned long vmemmap_addr = (unsigned long)head;
+	unsigned long vmemmap_end, vmemmap_reuse;
+	unsigned long demote_mask;
+	unsigned long demote_nr_pages;
+	struct hstate *target;
+
+	VM_BUG_ON(!h->demote_order);
+	if (!HPageVmemmapOptimized(head))
+		return 0;
+
+	target = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+	/* Number of vmemmap pages required to demote page */
+	demote_nr_pages = pages_per_huge_page(h) / pages_per_huge_page(target);
+	demote_nr_pages *= RESERVE_VMEMMAP_NR;
+	demote_nr_pages -= RESERVE_VMEMMAP_NR;	/* pages currently present */
+
+	/*
+	 * mask to identify where within the range new smaller pages will
+	 * reside.  This will be used to map new vmemmap pages.
+	 */
+	demote_mask = ((unsigned long) pages_per_huge_page(target) *
+			sizeof(struct page)) - 1;
+
+	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
+	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
+	vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+	/*
+	 * The range [@vmemmap_addr, @vmemmap_end) represents a single huge
+	 * page of size h->order.  It is vmemmap optimized and is only mapped
+	 * with RESERVE_VMEMMAP_NR pages.  The huge page will be split into
+	 * multiple pages of a smaller size (h->demote_order).  vmemmap pages
+	 * must be callocated for each of these smaller size pages and
+	 * appropriately mapped.
+	 */
+	ret = vmemmap_remap_demote(vmemmap_addr, vmemmap_end, vmemmap_reuse,
+				  demote_nr_pages, demote_mask,
+				  RESERVE_VMEMMAP_NR,
+				  GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
+
+	if (!ret) {
+		int i;
+
+		for (i = pages_per_huge_page(target);
+					i < pages_per_huge_page(h);
+					i += pages_per_huge_page(target))
+			SetHPageVmemmapOptimized(head + i);
+	}
+
+	return ret;
+}
+
 void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int nr_pages = pages_per_huge_page(h);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index cb2bef8f9e73..44382504efc3 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -13,6 +13,7 @@
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 int alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+int demote_huge_page_vmemmap(struct hstate *h, struct page *head);
 void hugetlb_vmemmap_init(struct hstate *h);
 
 /*
@@ -33,6 +34,11 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
 
+static inline int demote_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	return 0;
+}
+
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
 }
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bdce883f9286..ac2681bf006b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -43,6 +43,15 @@
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
  *			or is mapped from.
+ * @demote_mask		demote specific.  mask to know virtual address of
+ *			where to start mapping pages during a demote operation.
+ * @demote_map_pages	demote specific.  number of pages which mapped for
+ *			each demoted page.
+ * @demote_tmp_count	demote specific.  counter for the number of pages
+ *			mapped per page.
+ * @demote_tmp_addr	demote specific.  when more then one page must be
+ *			mapped for each demoted size page, virtual address
+ *			of the next page to be mapped.
  */
 struct vmemmap_remap_walk {
 	void (*remap_pte)(pte_t *pte, unsigned long addr,
@@ -51,6 +60,10 @@ struct vmemmap_remap_walk {
 	struct page *reuse_page;
 	unsigned long reuse_addr;
 	struct list_head *vmemmap_pages;
+	unsigned long demote_mask;
+	unsigned long demote_map_pages;
+	unsigned long demote_tmp_count;
+	unsigned long demote_tmp_addr;
 };
 
 static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
@@ -262,6 +275,51 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
 	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
 }
 
+static void vmemmap_demote_pte(pte_t *pte, unsigned long addr,
+				struct vmemmap_remap_walk *walk)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct page *page;
+	void *to;
+
+	if (!(addr & walk->demote_mask)) {
+		/* head page */
+		page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+		list_del(&page->lru);
+		to = page_to_virt(page);
+		copy_page(to, (void *)walk->reuse_addr);
+		set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+		/*
+		 * after mapping head page, set demote_tmp_reuse for
+		 * the following tail page to be mapped (if any).
+		 */
+		walk->demote_tmp_count = walk->demote_map_pages;
+		if (--walk->demote_tmp_count)
+			walk->demote_tmp_addr = addr + PAGE_SIZE;
+	} else if (addr == walk->demote_tmp_addr) {
+		/* first tall page */
+		page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+		list_del(&page->lru);
+		to = page_to_virt(page);
+		copy_page(to, (void *)walk->reuse_addr);
+		set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+		if (--walk->demote_tmp_count) {
+			walk->demote_tmp_addr = addr + PAGE_SIZE;
+		} else {
+			walk->demote_tmp_addr = 0UL;
+			/* remaining tail pages mapped to this page */
+			walk->reuse_page = page;
+		}
+	} else {
+		/* remaining tail pages */
+		pgprot_t pgprot = PAGE_KERNEL_RO;
+		pte_t entry = mk_pte(walk->reuse_page, pgprot);
+
+		page = pte_page(*pte);
+		set_pte_at(&init_mm, addr, pte, entry);
+	}
+}
+
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
  *			to the page which @reuse is mapped to, then free vmemmap
@@ -327,11 +385,9 @@ int vmemmap_remap_free(unsigned long start, unsigned long end,
 	return ret;
 }
 
-static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
+static int __alloc_vmemmap_pages(unsigned long nr_pages, int nid,
 				   gfp_t gfp_mask, struct list_head *list)
 {
-	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
-	int nid = page_to_nid((struct page *)start);
 	struct page *page, *next;
 
 	while (nr_pages--) {
@@ -348,6 +404,21 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
 	return -ENOMEM;
 }
 
+static int alloc_vmemmap_pages(unsigned long nr_pages, int nid,
+				   gfp_t gfp_mask, struct list_head *list)
+{
+	return __alloc_vmemmap_pages(nr_pages, nid, gfp_mask, list);
+}
+
+static int alloc_vmemmap_range(unsigned long start, unsigned long end,
+				   gfp_t gfp_mask, struct list_head *list)
+{
+	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
+	int nid = page_to_nid((struct page *)start);
+
+	return __alloc_vmemmap_pages(nr_pages, nid, gfp_mask, list);
+}
+
 /**
  * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
  *			 to the page which is from the @vmemmap_pages
@@ -374,7 +445,51 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 	/* See the comment in the vmemmap_remap_free(). */
 	BUG_ON(start - reuse != PAGE_SIZE);
 
-	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
+	if (alloc_vmemmap_range(start, end, gfp_mask, &vmemmap_pages))
+		return -ENOMEM;
+
+	mmap_read_lock(&init_mm);
+	vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
+
+	return 0;
+}
+
+/**
+ * vmemmap_remap_demote - remap the optimized vmemmap virtual address range
+ *		for a huge page to accommodate splitting that huge
+ *		page into pages of a smaller size.  That smaller size
+ *		is specified by demote_mask.
+ * @start:	start address of the vmemmap virtual address range to remap
+ *		for smaller pages.
+ * @end:	end address of the vmemmap virtual address range to remap.
+ * @reuse:	reuse address.
+ * @demote_nr_pages: number of vmammap pages to allocate for remapping.
+ * @demote_mask: mask specifying where to perform remapping within the passed
+ *		 range.
+ * @demote_map_pages: number of pages to map for each demoted page
+ * @gfp_mask:	GFP flag for allocating vmemmap pages.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int vmemmap_remap_demote(unsigned long start, unsigned long end,
+			unsigned long reuse, unsigned long demote_nr_pages,
+			unsigned long demote_mask,
+			unsigned long demote_map_pages, gfp_t gfp_mask)
+{
+	LIST_HEAD(vmemmap_pages);
+	int nid = page_to_nid((struct page *)start);
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= vmemmap_demote_pte,
+		.reuse_addr	= reuse,
+		.vmemmap_pages	= &vmemmap_pages,
+		.demote_mask	= demote_mask,
+		.demote_map_pages = demote_map_pages,
+	};
+
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
+
+	if (alloc_vmemmap_pages(demote_nr_pages, nid, gfp_mask, &vmemmap_pages))
 		return -ENOMEM;
 
 	mmap_read_lock(&init_mm);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 7/8] hugetlb: prepare destroy and prep routines for vmemmap optimized pages
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (5 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 6/8] hugetlb: vmemmap optimizations when demoting hugetlb pages Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 22:49 ` [PATCH 8/8] hugetlb: Optimized demote vmemmap optimizatized pages Mike Kravetz
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

When optimally demoting huge pages with vmemmap optimizations, the
routines which destroy and prep hugetlb pages need to be modified.
Currently, these routines expect all vmemmap pages to be present as
they will write to all page structs for tail pages.  To optimallly
handle demotion of huge pages not all vmemmap pages will be present.
Only those pages required for the demoted pages will be present.
Therefore, the destroy and prep routines must only write to struct pages
for which vmammap pages are present.

Modify destroy_compound_gigantic_page_for_demote and
prep_compound_gigantic_page_for_demote to take vmemmap optimized pages
into account.  Use the hugetlb specific flag HPageVmemmapOptimized to
determine if this special processing is needed.

These modifications will be used in subsequent patches where vmemmap
optimizations for demote are fully enabled.

Also modify the routine free_huge_page_vmemmap to immediately return if
the passed page is already optimized.  With demotion, prep_new_huge_page
can be called for vmemmap optimized pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c         | 17 +++++++++++++++--
 mm/hugetlb_vmemmap.c | 12 ++----------
 mm/hugetlb_vmemmap.h | 10 ++++++++++
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d43523a1a4d..259b840718f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1265,12 +1265,18 @@ static void __destroy_compound_gigantic_page(struct page *page,
 					unsigned int order, bool demote)
 {
 	int i;
-	int nr_pages = 1 << order;
+	int nr_pages;
 	struct page *p = page + 1;
 
 	atomic_set(compound_mapcount_ptr(page), 0);
 	atomic_set(compound_pincount_ptr(page), 0);
 
+	if (demote && HPageVmemmapOptimized(page)) {
+		clear_compound_head(page);
+		nr_pages = RESERVE_VMEMMAP_SIZE / sizeof(struct page);
+	} else
+		nr_pages = 1 << order;
+
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		p->mapping = NULL;
 		clear_compound_head(p);
@@ -1527,6 +1533,7 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 			return;
 		}
 #endif
+		prep_compound_page(page, huge_page_order(h));
 		__free_pages(page, huge_page_order(h));
 	}
 }
@@ -1715,9 +1722,14 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
 								bool demote)
 {
 	int i, j;
-	int nr_pages = 1 << order;
+	int nr_pages;
 	struct page *p = page + 1;
 
+	if (demote && HPageVmemmapOptimized(page))
+		nr_pages = RESERVE_VMEMMAP_SIZE / sizeof(struct page);
+	else
+		nr_pages = 1 << order;
+
 	/* we rely on prep_new_huge_page to set the destructor */
 	set_compound_order(page, order);
 	__ClearPageReserved(page);
@@ -1761,6 +1773,7 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
 		} else {
 			VM_BUG_ON_PAGE(page_count(p), p);
 		}
+		p->mapping = TAIL_MAPPING;
 		set_page_count(p, 0);
 		set_compound_head(p, page);
 	}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c82d60398c16..01c2cc959824 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -172,16 +172,6 @@
 
 #include "hugetlb_vmemmap.h"
 
-/*
- * There are a lot of struct page structures associated with each HugeTLB page.
- * For tail pages, the value of compound_head is the same. So we can reuse first
- * page of tail page structures. We map the virtual addresses of the remaining
- * pages of tail page structures to the first tail page struct, and then free
- * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
- */
-#define RESERVE_VMEMMAP_NR		2U
-#define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
-
 bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
 
 static int __init early_hugetlb_free_vmemmap_param(char *buf)
@@ -250,6 +240,8 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 
 	if (!free_vmemmap_pages_per_hpage(h))
 		return;
+	if (HPageVmemmapOptimized(head))	/* possible for demote */
+		return;
 
 	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
 	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 44382504efc3..36274bf0256c 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -10,6 +10,16 @@
 #define _LINUX_HUGETLB_VMEMMAP_H
 #include <linux/hugetlb.h>
 
+/*
+ * There are a lot of struct page structures associated with each HugeTLB page.
+ * For tail pages, the value of compound_head is the same. So we can reuse first
+ * page of tail page structures. We map the virtual addresses of the remaining
+ * pages of tail page structures to the first tail page struct, and then free
+ * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
+ */
+#define RESERVE_VMEMMAP_NR		2U
+#define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 int alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 8/8] hugetlb: Optimized demote vmemmap optimizatized pages
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (6 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 7/8] hugetlb: prepare destroy and prep routines for vmemmap optimized pages Mike Kravetz
@ 2021-08-16 22:49 ` Mike Kravetz
  2021-08-16 23:23 ` [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Andrew Morton
  2021-08-16 23:27 ` Andrew Morton
  9 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-16 22:49 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Put all the pieces together to optimize the process of demoting vmemmap
optimized pages.

Instead of allocating all vmemmap pages for a page to be demoted, use
the demote_huge_page_vmemmap routine which will only allocate/map
pages needed for the demoted pages.

For vmemmap optimized pages, use the destroy_compound_gigantic_page and
prep_compound_gigantic_page routines during demote.  These routines can
deal with vmemmap optimized pages, and know which page structs are
writable.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 259b840718f1..77052ab464b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3330,13 +3330,14 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
 	int i, nid = page_to_nid(page);
 	struct hstate *target_hstate;
 	bool cma_page = HPageCma(page);
+	bool vmemmap_optimized = HPageVmemmapOptimized(page);
 
 	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
 
 	remove_hugetlb_page_for_demote(h, page, false);
 	spin_unlock_irq(&hugetlb_lock);
 
-	if (alloc_huge_page_vmemmap(h, page)) {
+	if (demote_huge_page_vmemmap(h, page)) {
 		/* Allocation of vmemmmap failed, we can not demote page */
 		spin_lock_irq(&hugetlb_lock);
 		set_page_refcounted(page);
@@ -3348,16 +3349,36 @@ static int demote_free_huge_page(struct hstate *h, struct page *page)
 	 * Use destroy_compound_gigantic_page_for_demote for all huge page
 	 * sizes as it will not ref count pages.
 	 */
-	destroy_compound_gigantic_page_for_demote(page, huge_page_order(h));
+	if (vmemmap_optimized)
+		/*
+		 * If page is vmemmmap optimized, then demote_huge_page_vmemmap
+		 * added vmammap for each smaller page of target order size.
+		 * We must update/destroy all each of these smaller pages.
+		 */
+		for (i = 0; i < pages_per_huge_page(h);
+					i += pages_per_huge_page(target_hstate))
+			destroy_compound_gigantic_page_for_demote(page + i,
+					huge_page_order(target_hstate));
+	else
+		destroy_compound_gigantic_page_for_demote(page,
+							huge_page_order(h));
 
 	for (i = 0; i < pages_per_huge_page(h);
 				i += pages_per_huge_page(target_hstate)) {
-		if (hstate_is_gigantic(target_hstate))
+		/*
+		 * Use gigantic page prep for vmemmap_optimized pages of
+		 * all sizes as it has special vmemmap logic.  The generic
+		 * prep routine does not and should not know about hugetlb
+		 * vmemmap optimizations.
+		 */
+		if (hstate_is_gigantic(target_hstate) || vmemmap_optimized)
 			prep_compound_gigantic_page_for_demote(page + i,
 							target_hstate->order);
 		else
 			prep_compound_page(page + i, target_hstate->order);
 		set_page_private(page + i, 0);
+		if (vmemmap_optimized)
+			SetHPageVmemmapOptimized(page + i);
 		set_page_refcounted(page + i);
 		prep_new_huge_page(target_hstate, page + i, nid);
 		if (cma_page)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (7 preceding siblings ...)
  2021-08-16 22:49 ` [PATCH 8/8] hugetlb: Optimized demote vmemmap optimizatized pages Mike Kravetz
@ 2021-08-16 23:23 ` Andrew Morton
  2021-08-17  0:17   ` Mike Kravetz
  2021-08-16 23:27 ` Andrew Morton
  9 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2021-08-16 23:23 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> This is a resend of PATCHes sent here [4].  There was some discussion
> and interest when the RFC [5] was sent, but little after that.  The
> resend is just a rebase of [4] to next-20210816 with a few typos in
> commmit messages fixed.
> 
> Original Cover Letter
> ---------------------
> The concurrent use of multiple hugetlb page sizes on a single system
> is becoming more common.  One of the reasons is better TLB support for
> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
> being used to back VMs in hosting environments.
> 
> When using hugetlb pages to back VMs in such environments, it is
> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
> preallocating huge pages minimizes the issue of memory fragmentation that
> increases the longer the system is up and running.
> 
> In such environments, a combination of larger and smaller hugetlb pages
> are preallocated in anticipation of backing VMs of various sizes.  Over
> time, the preallocated pool of smaller hugetlb pages may become
> depleted while larger hugetlb pages still remain.  In such situations,
> it may be desirable to convert larger hugetlb pages to smaller hugetlb
> pages.
> 
> Converting larger to smaller hugetlb pages can be accomplished today by
> first freeing the larger page to the buddy allocator and then allocating
> the smaller pages.  However, there are two issues with this approach:
> 1) This process can take quite some time, especially if allocation of
>    the smaller pages is not immediate and requires migration/compaction.
> 2) There is no guarantee that the total size of smaller pages allocated
>    will match the size of the larger page which was freed.  This is
>    because the area freed by the larger page could quickly be
>    fragmented.
> 
> To address these issues, introduce the concept of hugetlb page demotion.
> Demotion provides a means of 'in place' splitting a hugetlb page to
> pages of a smaller size.  For example, on x86 one 1G page can be
> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
> - demote_size   Read only target page size for demotion

Should this be "write only"?  If not, I'm confused.

If "yes" then "write only" would be a misnomer - clearly this file is
readable (looks at demote_size_show()).

> - demote        Writable number of hugetlb pages to be demoted

So how does this interface work?  Write the target size to
`demote_size', write the number of to-be-demoted larger pages to
`demote' and then the operation happens?

If so, how does one select which size pages should be selected for
the demotion?

And how does one know the operation has completed so the sysfs files
can be reloaded for another operation?

> Only hugetlb pages which are free at the time of the request can be demoted.
> Demotion does not add to the complexity surplus pages.  Demotion also honors
> reserved huge pages.  Therefore, when a value is written to the sysfs demote
> file, that value is only the maximum number of pages which will be demoted.
> It is possible fewer will actually be demoted.
> 
> If demote_size is PAGESIZE, demote will simply free pages to the buddy
> allocator.
> 
> Real world use cases
> --------------------
> There are groups today using hugetlb pages to back VMs on x86.  Their
> use case is as described above.  They have experienced the issues with
> performance and not necessarily getting the excepted number smaller huge

("expected")

> pages after free/allocate cycle.
> 

It seems odd to add the interfaces in patch 1 then document them in
patch 5.  Why not add-and-document in a single patch?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
                   ` (8 preceding siblings ...)
  2021-08-16 23:23 ` [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Andrew Morton
@ 2021-08-16 23:27 ` Andrew Morton
  2021-08-17  0:46   ` Mike Kravetz
  9 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2021-08-16 23:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

Also, pushback...

On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> 
> Real world use cases
> --------------------
> There are groups today using hugetlb pages to back VMs on x86.  Their
> use case is as described above.  They have experienced the issues with
> performance and not necessarily getting the excepted number smaller huge

("number of")

> pages after free/allocate cycle.
> 

It really is a ton of new code.  I think we're owed much more detail
about the problem than the above.  To be confident that all this
material is truly justified?

Also, some selftests and benchmarking code in tools/testing/selftests
would be helpful?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] hugetlb: document the demote sysfs interfaces
  2021-08-16 22:49 ` [PATCH 5/8] hugetlb: document the demote sysfs interfaces Mike Kravetz
@ 2021-08-16 23:28   ` Andrew Morton
  2021-08-17  1:04     ` Mike Kravetz
  2021-09-21 13:52   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2021-08-16 23:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On Mon, 16 Aug 2021 15:49:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Describe demote and demote_size interfaces.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  Documentation/admin-guide/mm/hugetlbpage.rst | 29 ++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144e44..902059a0257b 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -234,8 +234,12 @@ will exist, of the form::
>  
>  	hugepages-${size}kB
>  
> -Inside each of these directories, the same set of files will exist::
> +Inside each of these directories, the set of files contained in ``/proc``
> +will exist.  In addition, two additional interfaces for demoting huge
> +pages will exist::
>  
> +        demote
> +        demote_size
>  	nr_hugepages
>  	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
> @@ -243,7 +247,28 @@ Inside each of these directories, the same set of files will exist::
>  	resv_hugepages
>  	surplus_hugepages
>  
> -which function as described above for the default huge page-sized case.
> +The demote interfaces provide the ability to split a huge page into
> +smaller huge pages.  For example, the x86 architecture supports both
> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
> +2MB huge pages.  The demote interfaces are:
> +
> +demote_size
> +        is the size of demoted pages.  When a page is demoted a corresponding
> +        number of huge pages of demote_size will be created.  For huge pages
> +        of the smallest supported size (2MB on x86), demote_size will be the
> +        system page size (PAGE_SIZE).  If demote_size is the system page size
> +        then demoting a page will simply free the huge page.  demote_size is
> +        a read only interface.
> +
> +demote
> +        is used to demote a number of huge pages.  A user with root privileges
> +        can write to this file.  It may not be possible to demote the
> +        requested number of huge pages.  To determine how many pages were
> +        actually demoted, compare the value of nr_hugepages before and after
> +        writing to the demote interface.  demote is a write only interface.
> +
> +The interfaces which are the same as in ``/proc`` function as described
> +above for the default huge page-sized case.

Are these new demote interfaces duplicated in /proc? 
Documentation/admin-guide/mm/hugetlbpage.rst says "The ``/proc``
interfaces discussed above have been retained for backwards
compatibility.", so new interfaces need not appear in /proc?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-16 23:23 ` [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Andrew Morton
@ 2021-08-17  0:17   ` Mike Kravetz
  2021-08-17  0:39     ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-08-17  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On 8/16/21 4:23 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> This is a resend of PATCHes sent here [4].  There was some discussion
>> and interest when the RFC [5] was sent, but little after that.  The
>> resend is just a rebase of [4] to next-20210816 with a few typos in
>> commmit messages fixed.
>>
>> Original Cover Letter
>> ---------------------
>> The concurrent use of multiple hugetlb page sizes on a single system
>> is becoming more common.  One of the reasons is better TLB support for
>> gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
>> being used to back VMs in hosting environments.
>>
>> When using hugetlb pages to back VMs in such environments, it is
>> sometimes desirable to preallocate hugetlb pools.  This avoids the delay
>> and uncertainty of allocating hugetlb pages at VM startup.  In addition,
>> preallocating huge pages minimizes the issue of memory fragmentation that
>> increases the longer the system is up and running.
>>
>> In such environments, a combination of larger and smaller hugetlb pages
>> are preallocated in anticipation of backing VMs of various sizes.  Over
>> time, the preallocated pool of smaller hugetlb pages may become
>> depleted while larger hugetlb pages still remain.  In such situations,
>> it may be desirable to convert larger hugetlb pages to smaller hugetlb
>> pages.
>>
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages.  However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>>    the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>>    will match the size of the larger page which was freed.  This is
>>    because the area freed by the larger page could quickly be
>>    fragmented.
>>
>> To address these issues, introduce the concept of hugetlb page demotion.
>> Demotion provides a means of 'in place' splitting a hugetlb page to
>> pages of a smaller size.  For example, on x86 one 1G page can be
>> demoted to 512 2M pages.  Page demotion is controlled via sysfs files.
>> - demote_size   Read only target page size for demotion
> 
> Should this be "write only"?  If not, I'm confused.
> 
> If "yes" then "write only" would be a misnomer - clearly this file is
> readable (looks at demote_size_show()).
> 

It is read only and is there mostly as information for the user.  When
they demote a page, this is the size to which the page will be demoted.

For example,
# pwd
/sys/kernel/mm/hugepages/hugepages-1048576kB
# cat demote_size
2048kB
# pwd
/sys/kernel/mm/hugepages/hugepages-2048kB
# cat demote_size
4kB

The "demote size" is not user configurable.  Although, that is
something brought up by Oscar previously.  I did not directly address
this in the RFC.  My bad.  However, I do not like the idea of making
demote_size writable/selectable.  My concern would be someone changing
the value and not resetting.  It certainly is something that can be done
with minor code changes.

>> - demote        Writable number of hugetlb pages to be demoted
> 
> So how does this interface work?  Write the target size to
> `demote_size', write the number of to-be-demoted larger pages to
> `demote' and then the operation happens?
> 
> If so, how does one select which size pages should be selected for
> the demotion?

The location in the sysfs directory tells you what size pages will be
demoted.  For example,

echo 5 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote

says to demote 5 1GB pages.

demote files are also in node specific directories so you can even pick
huge pages from a specific node.

echo 5 >
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/demote

> 
> And how does one know the operation has completed so the sysfs files
> can be reloaded for another operation?
> 

When the write to the file is complete, the operation has completed.
Not exactly sure what you mean by reloading the sysfs files for
another operation?

>> Only hugetlb pages which are free at the time of the request can be demoted.
>> Demotion does not add to the complexity surplus pages.  Demotion also honors
>> reserved huge pages.  Therefore, when a value is written to the sysfs demote
>> file, that value is only the maximum number of pages which will be demoted.
>> It is possible fewer will actually be demoted.
>>
>> If demote_size is PAGESIZE, demote will simply free pages to the buddy
>> allocator.
>>
>> Real world use cases
>> --------------------
>> There are groups today using hugetlb pages to back VMs on x86.  Their
>> use case is as described above.  They have experienced the issues with
>> performance and not necessarily getting the excepted number smaller huge
> 
> ("expected")

yes, will fix typo

> 
>> pages after free/allocate cycle.
>>
> 
> It seems odd to add the interfaces in patch 1 then document them in
> patch 5.  Why not add-and-document in a single patch?
> 

Yes, makes sense.  Will combine these.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  0:17   ` Mike Kravetz
@ 2021-08-17  0:39     ` Andrew Morton
  2021-08-17  0:58       ` Mike Kravetz
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2021-08-17  0:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > 
> > And how does one know the operation has completed so the sysfs files
> > can be reloaded for another operation?
> > 
> 
> When the write to the file is complete, the operation has completed.
> Not exactly sure what you mean by reloading the sysfs files for
> another operation?

If userspace wishes to perform another demote operation, it must wait
for the preceding one to complete.

Presumably if thread A is blocked in a write to `demote' and thread B
concurrently tries to perform a demotion, thread B will be blocked as
well?  Was this tested?

Lots of things are to be added to changelogs & documentation, methinks.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-16 23:27 ` Andrew Morton
@ 2021-08-17  0:46   ` Mike Kravetz
  2021-08-17  1:46     ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-08-17  0:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On 8/16/21 4:27 PM, Andrew Morton wrote:
> Also, pushback...

That is welcome.  I only have the one specific use case mentioned here.

> 
> On Mon, 16 Aug 2021 15:49:45 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>
>> Real world use cases
>> --------------------
>> There are groups today using hugetlb pages to back VMs on x86.  Their
>> use case is as described above.  They have experienced the issues with
>> performance and not necessarily getting the excepted number smaller huge
> 
> ("number of")

thanks, another typo to fix.

> 
>> pages after free/allocate cycle.
>>
> 
> It really is a ton of new code.  I think we're owed much more detail
> about the problem than the above.  To be confident that all this
> material is truly justified?

The desired functionality for this specific use case is to simply
convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned

"Converting larger to smaller hugetlb pages can be accomplished today by
 first freeing the larger page to the buddy allocator and then allocating
 the smaller pages.  However, there are two issues with this approach:
 1) This process can take quite some time, especially if allocation of
    the smaller pages is not immediate and requires migration/compaction.
 2) There is no guarantee that the total size of smaller pages allocated
    will match the size of the larger page which was freed.  This is
    because the area freed by the larger page could quickly be
    fragmented."

These two issues have been experienced in practice.

A big chunk of the code changes (aprox 50%) is for the vmemmap
optimizations.  This is also the most complex part of the changes.
I added the code as interaction with vmemmap reduction was discussed
during the RFC.  It is only a performance enhancement and honestly
may not be worth the cost/risk.  I will get some numbers to measure
the actual benefit.

> 
> Also, some selftests and benchmarking code in tools/testing/selftests
> would be helpful?
> 

Will do.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  0:39     ` Andrew Morton
@ 2021-08-17  0:58       ` Mike Kravetz
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-17  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On 8/16/21 5:39 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:17:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>>
>>> And how does one know the operation has completed so the sysfs files
>>> can be reloaded for another operation?
>>>
>>
>> When the write to the file is complete, the operation has completed.
>> Not exactly sure what you mean by reloading the sysfs files for
>> another operation?
> 
> If userspace wishes to perform another demote operation, it must wait
> for the preceding one to complete.
> 
> Presumably if thread A is blocked in a write to `demote' and thread B
> concurrently tries to perform a demotion, thread B will be blocked as
> well?  Was this tested?

I must admit that I did not specifically test this.  However, the
patch series to make freeing of hugetlb pages IRQ safe added a
(per-hstate)mutex that is taken when sysfs files modify the number of
huge pages.  demote writes take this mutex (patch 1).  That synchronizes
not only concurrent writes to demote but also concurrent modifications
of nr_hugepages.

> 
> Lots of things are to be added to changelogs & documentation, methinks.
> 

OK.

Thank you for taking a look at this!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] hugetlb: document the demote sysfs interfaces
  2021-08-16 23:28   ` Andrew Morton
@ 2021-08-17  1:04     ` Mike Kravetz
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-08-17  1:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On 8/16/21 4:28 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 15:49:50 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> Describe demote and demote_size interfaces.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  Documentation/admin-guide/mm/hugetlbpage.rst | 29 ++++++++++++++++++--
>>  1 file changed, 27 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
>> index 8abaeb144e44..902059a0257b 100644
>> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
>> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
>> @@ -234,8 +234,12 @@ will exist, of the form::
>>  
>>  	hugepages-${size}kB
>>  
>> -Inside each of these directories, the same set of files will exist::
>> +Inside each of these directories, the set of files contained in ``/proc``
>> +will exist.  In addition, two additional interfaces for demoting huge
>> +pages will exist::
>>  
>> +        demote
>> +        demote_size
>>  	nr_hugepages
>>  	nr_hugepages_mempolicy
>>  	nr_overcommit_hugepages
>> @@ -243,7 +247,28 @@ Inside each of these directories, the same set of files will exist::
>>  	resv_hugepages
>>  	surplus_hugepages
>>  
>> -which function as described above for the default huge page-sized case.
>> +The demote interfaces provide the ability to split a huge page into
>> +smaller huge pages.  For example, the x86 architecture supports both
>> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
>> +2MB huge pages.  The demote interfaces are:
>> +
>> +demote_size
>> +        is the size of demoted pages.  When a page is demoted a corresponding
>> +        number of huge pages of demote_size will be created.  For huge pages
>> +        of the smallest supported size (2MB on x86), demote_size will be the
>> +        system page size (PAGE_SIZE).  If demote_size is the system page size
>> +        then demoting a page will simply free the huge page.  demote_size is
>> +        a read only interface.
>> +
>> +demote
>> +        is used to demote a number of huge pages.  A user with root privileges
>> +        can write to this file.  It may not be possible to demote the
>> +        requested number of huge pages.  To determine how many pages were
>> +        actually demoted, compare the value of nr_hugepages before and after
>> +        writing to the demote interface.  demote is a write only interface.
>> +
>> +The interfaces which are the same as in ``/proc`` function as described
>> +above for the default huge page-sized case.
> 
> Are these new demote interfaces duplicated in /proc? 
> Documentation/admin-guide/mm/hugetlbpage.rst says "The ``/proc``
> interfaces discussed above have been retained for backwards
> compatibility.", so new interfaces need not appear in /proc?
> 

The new demote interfaces are only in sysfs, they are not /proc.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  0:46   ` Mike Kravetz
@ 2021-08-17  1:46     ` Andrew Morton
  2021-08-17  7:30       ` David Hildenbrand
  2021-08-24 22:08       ` Mike Kravetz
  0 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2021-08-17  1:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes

On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > It really is a ton of new code.  I think we're owed much more detail
> > about the problem than the above.  To be confident that all this
> > material is truly justified?
> 
> The desired functionality for this specific use case is to simply
> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
> 
> "Converting larger to smaller hugetlb pages can be accomplished today by
>  first freeing the larger page to the buddy allocator and then allocating
>  the smaller pages.  However, there are two issues with this approach:
>  1) This process can take quite some time, especially if allocation of
>     the smaller pages is not immediate and requires migration/compaction.
>  2) There is no guarantee that the total size of smaller pages allocated
>     will match the size of the larger page which was freed.  This is
>     because the area freed by the larger page could quickly be
>     fragmented."
> 
> These two issues have been experienced in practice.

Well the first issue is quantifiable.  What is "some time"?  If it's
people trying to get a 5% speedup on a rare operation because hey,
bugging the kernel developers doesn't cost me anything then perhaps we
have better things to be doing.

And the second problem would benefit from some words to help us
understand how much real-world hurt this causes, and how frequently.
And let's understand what the userspace workarounds look like, etc.

> A big chunk of the code changes (aprox 50%) is for the vmemmap
> optimizations.  This is also the most complex part of the changes.
> I added the code as interaction with vmemmap reduction was discussed
> during the RFC.  It is only a performance enhancement and honestly
> may not be worth the cost/risk.  I will get some numbers to measure
> the actual benefit.

Cool.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  1:46     ` Andrew Morton
@ 2021-08-17  7:30       ` David Hildenbrand
  2021-08-17 16:19         ` Mike Kravetz
  2021-08-24 22:08       ` Mike Kravetz
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2021-08-17  7:30 UTC (permalink / raw)
  To: Andrew Morton, Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes

On 17.08.21 03:46, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> It really is a ton of new code.  I think we're owed much more detail
>>> about the problem than the above.  To be confident that all this
>>> material is truly justified?
>>
>> The desired functionality for this specific use case is to simply
>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>
>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>   first freeing the larger page to the buddy allocator and then allocating
>>   the smaller pages.  However, there are two issues with this approach:
>>   1) This process can take quite some time, especially if allocation of
>>      the smaller pages is not immediate and requires migration/compaction.
>>   2) There is no guarantee that the total size of smaller pages allocated
>>      will match the size of the larger page which was freed.  This is
>>      because the area freed by the larger page could quickly be
>>      fragmented."
>>
>> These two issues have been experienced in practice.
> 
> Well the first issue is quantifiable.  What is "some time"?  If it's
> people trying to get a 5% speedup on a rare operation because hey,
> bugging the kernel developers doesn't cost me anything then perhaps we
> have better things to be doing.
> 
> And the second problem would benefit from some words to help us
> understand how much real-world hurt this causes, and how frequently.
> And let's understand what the userspace workarounds look like, etc.
> 
>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>> optimizations.  This is also the most complex part of the changes.
>> I added the code as interaction with vmemmap reduction was discussed
>> during the RFC.  It is only a performance enhancement and honestly
>> may not be worth the cost/risk.  I will get some numbers to measure
>> the actual benefit.

If it really makes that much of a difference code/complexity wise, would 
it make sense to just limit denote functionality to the !vmemmap case 
for now?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  7:30       ` David Hildenbrand
@ 2021-08-17 16:19         ` Mike Kravetz
  2021-08-17 18:49           ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-08-17 16:19 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes

On 8/17/21 12:30 AM, David Hildenbrand wrote:
> On 17.08.21 03:46, Andrew Morton wrote:
>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>>> It really is a ton of new code.  I think we're owed much more detail
>>>> about the problem than the above.  To be confident that all this
>>>> material is truly justified?
>>>
>>> The desired functionality for this specific use case is to simply
>>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>>
>>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>>   first freeing the larger page to the buddy allocator and then allocating
>>>   the smaller pages.  However, there are two issues with this approach:
>>>   1) This process can take quite some time, especially if allocation of
>>>      the smaller pages is not immediate and requires migration/compaction.
>>>   2) There is no guarantee that the total size of smaller pages allocated
>>>      will match the size of the larger page which was freed.  This is
>>>      because the area freed by the larger page could quickly be
>>>      fragmented."
>>>
>>> These two issues have been experienced in practice.
>>
>> Well the first issue is quantifiable.  What is "some time"?  If it's
>> people trying to get a 5% speedup on a rare operation because hey,
>> bugging the kernel developers doesn't cost me anything then perhaps we
>> have better things to be doing.
>>
>> And the second problem would benefit from some words to help us
>> understand how much real-world hurt this causes, and how frequently.
>> And let's understand what the userspace workarounds look like, etc.
>>
>>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>>> optimizations.  This is also the most complex part of the changes.
>>> I added the code as interaction with vmemmap reduction was discussed
>>> during the RFC.  It is only a performance enhancement and honestly
>>> may not be worth the cost/risk.  I will get some numbers to measure
>>> the actual benefit.
> 
> If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now?
> 

Handling vmemmap optimized huge pages is not that big of a deal.  We
just use the existing functionality to populate vmemmap for the page
being demoted, and free vmemmap for resulting pages of demoted size.

This obviously is not 'optimal' for demote as we will allocate more
vmemmap pages than needed and then free the excess pages.  The complex
part is not over allocating vmemmap and only sparsely populating vmemmap
for the target pages of demote size.  This is all done in patches 6-8.
I am happy to drop these patches for now.  The are the most complex (and
ugly) of this series.  As mentioned, they do not provide any additional
functionality.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17 16:19         ` Mike Kravetz
@ 2021-08-17 18:49           ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2021-08-17 18:49 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes

On 17.08.21 18:19, Mike Kravetz wrote:
> On 8/17/21 12:30 AM, David Hildenbrand wrote:
>> On 17.08.21 03:46, Andrew Morton wrote:
>>> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>>> It really is a ton of new code.  I think we're owed much more detail
>>>>> about the problem than the above.  To be confident that all this
>>>>> material is truly justified?
>>>>
>>>> The desired functionality for this specific use case is to simply
>>>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>>>
>>>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>>>    first freeing the larger page to the buddy allocator and then allocating
>>>>    the smaller pages.  However, there are two issues with this approach:
>>>>    1) This process can take quite some time, especially if allocation of
>>>>       the smaller pages is not immediate and requires migration/compaction.
>>>>    2) There is no guarantee that the total size of smaller pages allocated
>>>>       will match the size of the larger page which was freed.  This is
>>>>       because the area freed by the larger page could quickly be
>>>>       fragmented."
>>>>
>>>> These two issues have been experienced in practice.
>>>
>>> Well the first issue is quantifiable.  What is "some time"?  If it's
>>> people trying to get a 5% speedup on a rare operation because hey,
>>> bugging the kernel developers doesn't cost me anything then perhaps we
>>> have better things to be doing.
>>>
>>> And the second problem would benefit from some words to help us
>>> understand how much real-world hurt this causes, and how frequently.
>>> And let's understand what the userspace workarounds look like, etc.
>>>
>>>> A big chunk of the code changes (aprox 50%) is for the vmemmap
>>>> optimizations.  This is also the most complex part of the changes.
>>>> I added the code as interaction with vmemmap reduction was discussed
>>>> during the RFC.  It is only a performance enhancement and honestly
>>>> may not be worth the cost/risk.  I will get some numbers to measure
>>>> the actual benefit.
>>
>> If it really makes that much of a difference code/complexity wise, would it make sense to just limit denote functionality to the !vmemmap case for now?
>>
> 
> Handling vmemmap optimized huge pages is not that big of a deal.  We
> just use the existing functionality to populate vmemmap for the page
> being demoted, and free vmemmap for resulting pages of demoted size.
> 
> This obviously is not 'optimal' for demote as we will allocate more
> vmemmap pages than needed and then free the excess pages.  The complex
> part is not over allocating vmemmap and only sparsely populating vmemmap
> for the target pages of demote size.  This is all done in patches 6-8.
> I am happy to drop these patches for now.  The are the most complex (and
> ugly) of this series.  As mentioned, they do not provide any additional
> functionality.
> 

Just looking at the diffstat, that looks like a good idea to me :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-17  1:46     ` Andrew Morton
  2021-08-17  7:30       ` David Hildenbrand
@ 2021-08-24 22:08       ` Mike Kravetz
  2021-08-27 17:22         ` Vlastimil Babka
  1 sibling, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-08-24 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Vlastimil Babka, Hillf Danton

Add Vlastimil and Hillf,

On 8/16/21 6:46 PM, Andrew Morton wrote:
> On Mon, 16 Aug 2021 17:46:58 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> It really is a ton of new code.  I think we're owed much more detail
>>> about the problem than the above.  To be confident that all this
>>> material is truly justified?
>>
>> The desired functionality for this specific use case is to simply
>> convert a 1G huegtlb page to 512 2MB hugetlb pages.  As mentioned
>>
>> "Converting larger to smaller hugetlb pages can be accomplished today by
>>  first freeing the larger page to the buddy allocator and then allocating
>>  the smaller pages.  However, there are two issues with this approach:
>>  1) This process can take quite some time, especially if allocation of
>>     the smaller pages is not immediate and requires migration/compaction.
>>  2) There is no guarantee that the total size of smaller pages allocated
>>     will match the size of the larger page which was freed.  This is
>>     because the area freed by the larger page could quickly be
>>     fragmented."
>>
>> These two issues have been experienced in practice.
> 
> Well the first issue is quantifiable.  What is "some time"?  If it's
> people trying to get a 5% speedup on a rare operation because hey,
> bugging the kernel developers doesn't cost me anything then perhaps we
> have better things to be doing.

Well, I set up a test environment on a larger system to get some
numbers.  My 'load' on the system was filling the page cache with
clean pages.  The thought is that these pages could easily be reclaimed.

When trying to get numbers I hit a hugetlb page allocation stall where
__alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
least an hour).  It was very much like the symptoms addressed here:
https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/

This was on 5.14.0-rc6-next-20210820.

I'll do some more digging as this appears to be some dark corner case of
reclaim and/or compaction.  The 'good news' is that I can reproduce
this.

> And the second problem would benefit from some words to help us
> understand how much real-world hurt this causes, and how frequently.
> And let's understand what the userspace workarounds look like, etc.

The stall above was from doing a simple 'free 1GB page' followed by
'allocate 512 MB pages' from userspace.

Getting out another version of this series will be delayed, as I think
we need to address or understand this issue first.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-24 22:08       ` Mike Kravetz
@ 2021-08-27 17:22         ` Vlastimil Babka
  2021-08-27 23:04           ` Mike Kravetz
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2021-08-27 17:22 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Hillf Danton

On 8/25/21 00:08, Mike Kravetz wrote:
> Add Vlastimil and Hillf,
> 
> Well, I set up a test environment on a larger system to get some
> numbers.  My 'load' on the system was filling the page cache with
> clean pages.  The thought is that these pages could easily be reclaimed.
> 
> When trying to get numbers I hit a hugetlb page allocation stall where
> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
> least an hour).  It was very much like the symptoms addressed here:
> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/
> 
> This was on 5.14.0-rc6-next-20210820.
> 
> I'll do some more digging as this appears to be some dark corner case of
> reclaim and/or compaction.  The 'good news' is that I can reproduce
> this.

Interesting, let's see if that's some kind of new regression.

>> And the second problem would benefit from some words to help us
>> understand how much real-world hurt this causes, and how frequently.
>> And let's understand what the userspace workarounds look like, etc.
> 
> The stall above was from doing a simple 'free 1GB page' followed by
> 'allocate 512 MB pages' from userspace.

Is the allocation different in any way than the usual hugepage allocation
possible today?

> Getting out another version of this series will be delayed, as I think
> we need to address or understand this issue first.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-27 17:22         ` Vlastimil Babka
@ 2021-08-27 23:04           ` Mike Kravetz
  2021-08-30 10:11             ` Vlastimil Babka
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-08-27 23:04 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Hillf Danton

On 8/27/21 10:22 AM, Vlastimil Babka wrote:
> On 8/25/21 00:08, Mike Kravetz wrote:
>> Add Vlastimil and Hillf,
>>
>> Well, I set up a test environment on a larger system to get some
>> numbers.  My 'load' on the system was filling the page cache with
>> clean pages.  The thought is that these pages could easily be reclaimed.
>>
>> When trying to get numbers I hit a hugetlb page allocation stall where
>> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
>> least an hour).  It was very much like the symptoms addressed here:
>> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@oracle.com/
>>
>> This was on 5.14.0-rc6-next-20210820.
>>
>> I'll do some more digging as this appears to be some dark corner case of
>> reclaim and/or compaction.  The 'good news' is that I can reproduce
>> this.
> 
> Interesting, let's see if that's some kind of new regression.
> 
>>> And the second problem would benefit from some words to help us
>>> understand how much real-world hurt this causes, and how frequently.
>>> And let's understand what the userspace workarounds look like, etc.
>>
>> The stall above was from doing a simple 'free 1GB page' followed by
>> 'allocate 512 MB pages' from userspace.
> 
> Is the allocation different in any way than the usual hugepage allocation
> possible today?

No, it is the same.  I was just following the demote use case of free
1GB page and allocate 512 2MB pages.  Of course, you would mostly expect
that to succeed.  The exception is if something else is running and
grabs some of those 1GB worth of contiguous pages such that you can not
allocate 512 2MB pages.  My test case was to have those freed pages be
used for file I/O but kept clean so that could easily be reclaimed.

I 'may' have been over stressing the system with all CPUs doing file
reads to fill the page cache with clean pages.  I certainly need to
spend some more debug/analysis time on this.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-27 23:04           ` Mike Kravetz
@ 2021-08-30 10:11             ` Vlastimil Babka
  2021-09-02 18:17               ` Mike Kravetz
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2021-08-30 10:11 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Hillf Danton

On 8/28/21 01:04, Mike Kravetz wrote:
> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
> I 'may' have been over stressing the system with all CPUs doing file
> reads to fill the page cache with clean pages.  I certainly need to
> spend some more debug/analysis time on this.

Hm that *could* play a role, as these will allow reclaim to make progress, but
also the reclaimed pages might be stolen immediately and compaction will return
COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:

        /*      
         * compaction was skipped because there are not enough order-0 pages
         * to work with, so we retry only if it looks like reclaim can help.
         */
        if (compaction_needs_reclaim(compact_result)) {
                ret = compaction_zonelist_suitable(ac, order, alloc_flags);
                goto out;
        }       

where compaction_zonelist_suitable() will return true because it appears
reclaim can free pages to allow progress. And there are no max retries
applied for this case.
With the reclaim and compaction tracepoints it should be possible to
confirm this scenario.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-08-30 10:11             ` Vlastimil Babka
@ 2021-09-02 18:17               ` Mike Kravetz
  2021-09-06 14:40                 ` Vlastimil Babka
       [not found]                 ` <20210907085001.3773-1-hdanton@sina.com>
  0 siblings, 2 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-09-02 18:17 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Hillf Danton

On 8/30/21 3:11 AM, Vlastimil Babka wrote:
> On 8/28/21 01:04, Mike Kravetz wrote:
>> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
>> I 'may' have been over stressing the system with all CPUs doing file
>> reads to fill the page cache with clean pages.  I certainly need to
>> spend some more debug/analysis time on this.
> 
> Hm that *could* play a role, as these will allow reclaim to make progress, but
> also the reclaimed pages might be stolen immediately and compaction will return
> COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:
> 
>         /*      
>          * compaction was skipped because there are not enough order-0 pages
>          * to work with, so we retry only if it looks like reclaim can help.
>          */
>         if (compaction_needs_reclaim(compact_result)) {
>                 ret = compaction_zonelist_suitable(ac, order, alloc_flags);
>                 goto out;
>         }       
> 
> where compaction_zonelist_suitable() will return true because it appears
> reclaim can free pages to allow progress. And there are no max retries
> applied for this case.
> With the reclaim and compaction tracepoints it should be possible to
> confirm this scenario.

Here is some very high level information from a long stall that was
interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().

55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
[55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     

Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
should_reclaim_retry returned true 25 times and should_compact_retry returned
true 609647 times.
Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
607 jiffies spent in __alloc_pages_direct_compact.

Looks like both
reclaim retries > MAX_RECLAIM_RETRIES
and
compaction retries > MAX_COMPACT_RETRIES
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-02 18:17               ` Mike Kravetz
@ 2021-09-06 14:40                 ` Vlastimil Babka
       [not found]                 ` <20210907085001.3773-1-hdanton@sina.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2021-09-06 14:40 UTC (permalink / raw)
  To: Mike Kravetz, Andrew Morton
  Cc: linux-mm, linux-kernel, David Hildenbrand, Michal Hocko,
	Oscar Salvador, Zi Yan, Muchun Song, Naoya Horiguchi,
	David Rientjes, Hillf Danton

On 9/2/21 20:17, Mike Kravetz wrote:
> On 8/30/21 3:11 AM, Vlastimil Babka wrote:
>> On 8/28/21 01:04, Mike Kravetz wrote:
>>> On 8/27/21 10:22 AM, Vlastimil Babka wrote:
>>> I 'may' have been over stressing the system with all CPUs doing file
>>> reads to fill the page cache with clean pages.  I certainly need to
>>> spend some more debug/analysis time on this.
>>
>> Hm that *could* play a role, as these will allow reclaim to make progress, but
>> also the reclaimed pages might be stolen immediately and compaction will return
>> COMPACT_SKIPPED and in should_compact_retry() we might go through this code path:
>>
>>         /*      
>>          * compaction was skipped because there are not enough order-0 pages
>>          * to work with, so we retry only if it looks like reclaim can help.
>>          */
>>         if (compaction_needs_reclaim(compact_result)) {
>>                 ret = compaction_zonelist_suitable(ac, order, alloc_flags);
>>                 goto out;
>>         }       
>>
>> where compaction_zonelist_suitable() will return true because it appears
>> reclaim can free pages to allow progress. And there are no max retries
>> applied for this case.
>> With the reclaim and compaction tracepoints it should be possible to
>> confirm this scenario.
> 
> Here is some very high level information from a long stall that was
> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
> 
> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
> 
> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
> should_reclaim_retry returned true 25 times and should_compact_retry returned
> true 609647 times.
> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
> 607 jiffies spent in __alloc_pages_direct_compact.
> 
> Looks like both
> reclaim retries > MAX_RECLAIM_RETRIES
> and
> compaction retries > MAX_COMPACT_RETRIES
> 

Yeah AFAICS that's only possible with the scenario I suspected. I guess
we should put a limit on compact retries (maybe some multiple of
MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
clearly it doesn't (i.e. because somebody else is stealing the page like
in your test case).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
       [not found]                 ` <20210907085001.3773-1-hdanton@sina.com>
@ 2021-09-08 21:00                   ` Mike Kravetz
  2021-09-09 11:54                     ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-09-08 21:00 UTC (permalink / raw)
  To: Hillf Danton, Vlastimil Babka; +Cc: linux-mm, linux-kernel, Michal Hocko

On 9/7/21 1:50 AM, Hillf Danton wrote:
> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>> On 9/2/21 20:17, Mike Kravetz wrote:
>>>
>>> Here is some very high level information from a long stall that was
>>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
>>>
>>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
>>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
>>>
>>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
>>> should_reclaim_retry returned true 25 times and should_compact_retry returned
>>> true 609647 times.
>>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
>>> 607 jiffies spent in __alloc_pages_direct_compact.
>>>
>>> Looks like both
>>> reclaim retries > MAX_RECLAIM_RETRIES
>>> and
>>> compaction retries > MAX_COMPACT_RETRIES
>>>
>> Yeah AFAICS that's only possible with the scenario I suspected. I guess
>> we should put a limit on compact retries (maybe some multiple of
>> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
>> clearly it doesn't (i.e. because somebody else is stealing the page like
>> in your test case).
> 
> And/or clamp reclaim retries for costly orders
> 
> 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> 
> to pull down the chance for stall as low as possible.

Thanks, and sorry for not replying quickly.  I only get back to this as
time allows.

We could clamp the number of compaction and reclaim retries in
__alloc_pages_slowpath as suggested.  However, I noticed that a single
reclaim call could take a bunch of time.  As a result, I instrumented
shrink_node to see what might be happening.  Here is some information
from a long stall.  Note that I only dump stats when jiffies > 100000.

[ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
[ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
[ 8136.887643]              compaction_suitable results:
[ 8136.893276]     idx COMPACT_SKIPPED, 3557109
[ 8672.399839] shrink_node: 522076 total jiffies,  3466228 tries
[ 8672.406268]              124427720 reclaimed, 32 nr_to_reclaim
[ 8672.412782]              compaction_suitable results:
[ 8672.418421]     idx COMPACT_SKIPPED, 3466227
[ 8908.099592] __alloc_pages_slowpath: jiffies 2939938  tries 17068 cpu_tries 1   node 0 success
[ 8908.109120]     r_tries 11       c_tries 17056    reclaim 2939865  compact 9

In this case, clamping the number of retries from should_compact_retry
and should_reclaim_retry could help.  Mostly because we will not be
calling back into the reclaim code?  Notice the long amount of time spent
in shrink_node.  The 'tries' in shrink_node come about from that:

	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
				    sc))
		goto again;

compaction_suitable results is the values returned from calls to
should_continue_reclaim -> compaction_suitable.

Trying to think if there might be an intelligent way to quit early.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-08 21:00                   ` Mike Kravetz
@ 2021-09-09 11:54                     ` Michal Hocko
  2021-09-09 13:45                       ` Vlastimil Babka
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-09-09 11:54 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Hillf Danton, Vlastimil Babka, linux-mm, linux-kernel

On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
> On 9/7/21 1:50 AM, Hillf Danton wrote:
> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
> >> On 9/2/21 20:17, Mike Kravetz wrote:
> >>>
> >>> Here is some very high level information from a long stall that was
> >>> interrupted.  This was an order 9 allocation from alloc_buddy_huge_page().
> >>>
> >>> 55269.530564] __alloc_pages_slowpath: jiffies 47329325 tries 609673 cpu_tries 1   node 0 FAIL
> >>> [55269.539893]     r_tries 25       c_tries 609647   reclaim 47325161 compact 607     
> >>>
> >>> Yes, in __alloc_pages_slowpath for 47329325 jiffies before being interrupted.
> >>> should_reclaim_retry returned true 25 times and should_compact_retry returned
> >>> true 609647 times.
> >>> Almost all time (47325161 jiffies) spent in __alloc_pages_direct_reclaim, and
> >>> 607 jiffies spent in __alloc_pages_direct_compact.
> >>>
> >>> Looks like both
> >>> reclaim retries > MAX_RECLAIM_RETRIES
> >>> and
> >>> compaction retries > MAX_COMPACT_RETRIES
> >>>
> >> Yeah AFAICS that's only possible with the scenario I suspected. I guess
> >> we should put a limit on compact retries (maybe some multiple of
> >> MAX_COMPACT_RETRIES) even if it thinks that reclaim could help, while
> >> clearly it doesn't (i.e. because somebody else is stealing the page like
> >> in your test case).
> > 
> > And/or clamp reclaim retries for costly orders
> > 
> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> > 
> > to pull down the chance for stall as low as possible.
> 
> Thanks, and sorry for not replying quickly.  I only get back to this as
> time allows.
> 
> We could clamp the number of compaction and reclaim retries in
> __alloc_pages_slowpath as suggested.  However, I noticed that a single
> reclaim call could take a bunch of time.  As a result, I instrumented
> shrink_node to see what might be happening.  Here is some information
> from a long stall.  Note that I only dump stats when jiffies > 100000.
> 
> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
> [ 8136.887643]              compaction_suitable results:
> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109

Can you get a more detailed break down of where the time is spent. Also
How come the number of reclaimed pages is so excessive comparing to the
reclaim target? There is something fishy going on here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-09 11:54                     ` Michal Hocko
@ 2021-09-09 13:45                       ` Vlastimil Babka
  2021-09-09 21:31                         ` Mike Kravetz
  2021-09-10  8:20                         ` Michal Hocko
  0 siblings, 2 replies; 38+ messages in thread
From: Vlastimil Babka @ 2021-09-09 13:45 UTC (permalink / raw)
  To: Michal Hocko, Mike Kravetz; +Cc: Hillf Danton, linux-mm, linux-kernel

On 9/9/21 13:54, Michal Hocko wrote:
> On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
>> On 9/7/21 1:50 AM, Hillf Danton wrote:
>> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>> > 
>> > And/or clamp reclaim retries for costly orders
>> > 
>> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>> > 
>> > to pull down the chance for stall as low as possible.
>> 
>> Thanks, and sorry for not replying quickly.  I only get back to this as
>> time allows.
>> 
>> We could clamp the number of compaction and reclaim retries in
>> __alloc_pages_slowpath as suggested.  However, I noticed that a single
>> reclaim call could take a bunch of time.  As a result, I instrumented
>> shrink_node to see what might be happening.  Here is some information
>> from a long stall.  Note that I only dump stats when jiffies > 100000.
>> 
>> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>> [ 8136.887643]              compaction_suitable results:
>> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
> 
> Can you get a more detailed break down of where the time is spent. Also
> How come the number of reclaimed pages is so excessive comparing to the
> reclaim target? There is something fishy going on here.

I would say it's simply should_continue_reclaim() behaving similarly to
should_compact_retry(). We'll get compaction_suitable() returning
COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
and compaction indicates there's not enough base pages to begin with to form
a high-order pages. Since the stolen pages will appear on inactive lru, it
seems to be worth continuing reclaim to make enough free base pages for
compaction to no longer be skipped, because "inactive_lru_pages >
pages_for_compaction" is true.

So, both should_continue_reclaim() and should_compact_retry() are unable to
recognize that reclaimed pages are being stolen and limit the retries in
that case. The scenario seems to be uncommon, otherwise we'd be getting more
reports of that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-09 13:45                       ` Vlastimil Babka
@ 2021-09-09 21:31                         ` Mike Kravetz
  2021-09-10  8:20                         ` Michal Hocko
  1 sibling, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-09-09 21:31 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko; +Cc: Hillf Danton, linux-mm, linux-kernel

On 9/9/21 6:45 AM, Vlastimil Babka wrote:
> On 9/9/21 13:54, Michal Hocko wrote:
>> On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
>>> On 9/7/21 1:50 AM, Hillf Danton wrote:
>>>> On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
>>>>
>>>> And/or clamp reclaim retries for costly orders
>>>>
>>>> 	reclaim retries = MAX_RECLAIM_RETRIES - order;
>>>>
>>>> to pull down the chance for stall as low as possible.
>>>
>>> Thanks, and sorry for not replying quickly.  I only get back to this as
>>> time allows.
>>>
>>> We could clamp the number of compaction and reclaim retries in
>>> __alloc_pages_slowpath as suggested.  However, I noticed that a single
>>> reclaim call could take a bunch of time.  As a result, I instrumented
>>> shrink_node to see what might be happening.  Here is some information
>>> from a long stall.  Note that I only dump stats when jiffies > 100000.
>>>
>>> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
>>> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
>>> [ 8136.887643]              compaction_suitable results:
>>> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
>>
>> Can you get a more detailed break down of where the time is spent. Also
>> How come the number of reclaimed pages is so excessive comparing to the
>> reclaim target? There is something fishy going on here.
> 
> I would say it's simply should_continue_reclaim() behaving similarly to
> should_compact_retry(). We'll get compaction_suitable() returning
> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
> and compaction indicates there's not enough base pages to begin with to form
> a high-order pages. Since the stolen pages will appear on inactive lru, it
> seems to be worth continuing reclaim to make enough free base pages for
> compaction to no longer be skipped, because "inactive_lru_pages >
> pages_for_compaction" is true.
> 
> So, both should_continue_reclaim() and should_compact_retry() are unable to
> recognize that reclaimed pages are being stolen and limit the retries in
> that case. The scenario seems to be uncommon, otherwise we'd be getting more
> reports of that.

Yes, I believe this is what is happening.

I honestly do not know if my test/recreation scenario is realistic.

I do know that our DB team has had issues with allocating a number of
hugetlb pages (after much uptime) taking forever or a REALLY long time.
These are all 2MB huge page allocations, so going through the normal
page allocation code path.  No idea what else is running on the system
at the time of the allocation stalls.  Unfortunately, this can not be
reproduced at will in their environment.  As a result, I have no data
and just this brief description of the issue.  When I stumbled on an
easy way to recreate, I thought it would be worth investigating/fixing.

It certainly does not seem to be a common scenario.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-09 13:45                       ` Vlastimil Babka
  2021-09-09 21:31                         ` Mike Kravetz
@ 2021-09-10  8:20                         ` Michal Hocko
  2021-09-11  0:11                           ` Mike Kravetz
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-09-10  8:20 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mike Kravetz, Hillf Danton, linux-mm, linux-kernel

On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
> On 9/9/21 13:54, Michal Hocko wrote:
> > On Wed 08-09-21 14:00:19, Mike Kravetz wrote:
> >> On 9/7/21 1:50 AM, Hillf Danton wrote:
> >> > On Mon, 6 Sep 2021 16:40:28 +0200 Vlastimil Babka wrote:
> >> > 
> >> > And/or clamp reclaim retries for costly orders
> >> > 
> >> > 	reclaim retries = MAX_RECLAIM_RETRIES - order;
> >> > 
> >> > to pull down the chance for stall as low as possible.
> >> 
> >> Thanks, and sorry for not replying quickly.  I only get back to this as
> >> time allows.
> >> 
> >> We could clamp the number of compaction and reclaim retries in
> >> __alloc_pages_slowpath as suggested.  However, I noticed that a single
> >> reclaim call could take a bunch of time.  As a result, I instrumented
> >> shrink_node to see what might be happening.  Here is some information
> >> from a long stall.  Note that I only dump stats when jiffies > 100000.
> >> 
> >> [ 8136.874706] shrink_node: 507654 total jiffies,  3557110 tries
> >> [ 8136.881130]              130596341 reclaimed, 32 nr_to_reclaim
> >> [ 8136.887643]              compaction_suitable results:
> >> [ 8136.893276]     idx COMPACT_SKIPPED, 3557109
> > 
> > Can you get a more detailed break down of where the time is spent. Also
> > How come the number of reclaimed pages is so excessive comparing to the
> > reclaim target? There is something fishy going on here.
> 
> I would say it's simply should_continue_reclaim() behaving similarly to
> should_compact_retry(). We'll get compaction_suitable() returning
> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
> and compaction indicates there's not enough base pages to begin with to form
> a high-order pages. Since the stolen pages will appear on inactive lru, it
> seems to be worth continuing reclaim to make enough free base pages for
> compaction to no longer be skipped, because "inactive_lru_pages >
> pages_for_compaction" is true.
> 
> So, both should_continue_reclaim() and should_compact_retry() are unable to
> recognize that reclaimed pages are being stolen and limit the retries in
> that case. The scenario seems to be uncommon, otherwise we'd be getting more
> reports of that.

Thanks this sounds like a viable explanation. For some reason I have
thought that a single reclaim round can take that much time but reading
again it seems like a cumulative thing.

I do agree that we should consider the above scenario when bailing out.
It is simply unreasonable to reclaim so much memory without any forward
progress.

A more robust way to address this problem (which is not really new)
would be to privatize reclaimed pages in the direct reclaim context and
reuse them in the compaction so that it doesn't have to just
pro-actively reclaim to keep watermarks good enough. A normal low order
requests would benefit from that as well because the reclaim couldn't be
stolen from them as well.

Another approach would be to detect the situation and treat reclaim
success which immediatelly leads to compaction deferral due to
watermarks as a failure.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-10  8:20                         ` Michal Hocko
@ 2021-09-11  0:11                           ` Mike Kravetz
  2021-09-13 15:50                             ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-09-11  0:11 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka; +Cc: Hillf Danton, linux-mm, linux-kernel

On 9/10/21 1:20 AM, Michal Hocko wrote:
> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
>>
>> I would say it's simply should_continue_reclaim() behaving similarly to
>> should_compact_retry(). We'll get compaction_suitable() returning
>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
>> and compaction indicates there's not enough base pages to begin with to form
>> a high-order pages. Since the stolen pages will appear on inactive lru, it
>> seems to be worth continuing reclaim to make enough free base pages for
>> compaction to no longer be skipped, because "inactive_lru_pages >
>> pages_for_compaction" is true.
>>
>> So, both should_continue_reclaim() and should_compact_retry() are unable to
>> recognize that reclaimed pages are being stolen and limit the retries in
>> that case. The scenario seems to be uncommon, otherwise we'd be getting more
>> reports of that.
> 
> Thanks this sounds like a viable explanation. For some reason I have
> thought that a single reclaim round can take that much time but reading
> again it seems like a cumulative thing.
> 
> I do agree that we should consider the above scenario when bailing out.
> It is simply unreasonable to reclaim so much memory without any forward
> progress.

A very simple patch to bail early eliminated the LONG stalls and
decreased the time of a bulk allocation request.

Recall, the use case is converting 1G huge pages to the corresponding
amount of 2MB huge pages.  I have been told that 50GB is not an uncommon
amount of pages to 'convert'.  So, test case is free 50GB hugetlb pages
followed by allocate 25600 2MB hugetlb pages.  I have not done enough
runs to get anything statistically valid, but here are some ballpark
numbers.

Unmodified baseline:
-------------------
Unexpected number of 2MB pages after demote
   Expected 25600, have 19944

real    0m42.092s
user    0m0.008s
sys     0m41.467s

Retries capped by patch below:
------------------------------
Unexpected number of 2MB pages after demote
   Expected 25600, have 19395

real    0m12.951s
user    0m0.010s
sys     0m12.840s


The time of the operation is certainly better, but do note that we
allocated 549 fewer pages.  So, bailing early might cause some failures
if we continue trying.  It is all a tradeoff.  One of the reasons for
considering demote functionality is that the conversion would be done in
the hugetlb code without getting the page allocator, recalim,
compaction, etc involved.

> A more robust way to address this problem (which is not really new)
> would be to privatize reclaimed pages in the direct reclaim context and
> reuse them in the compaction so that it doesn't have to just
> pro-actively reclaim to keep watermarks good enough. A normal low order
> requests would benefit from that as well because the reclaim couldn't be
> stolen from them as well.

That does sound interesting.  Perhaps a longer term project.

> Another approach would be to detect the situation and treat reclaim
> success which immediatelly leads to compaction deferral due to
> watermarks as a failure.

I'll look into detecting and responding to this.

Simple limit retries patch:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..16f3055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
+	unsigned long tries = 0, max_tries = 0;
 
 	/*
 	 * We also sanity check to catch abuse of atomic reserves being used by
@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1Ul << order;
 
 retry_cpuset:
 	compaction_retries = 0;
@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	}
 
 retry:
+	tries++;
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
 				compact_result, &compact_priority,
-				&compaction_retries))
+				&compaction_retries)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after reclaimed.  It looks like we are making progress, and
+		 * compaction_retries is not incremented.  This could cause
+		 * an indefinite number of retries.  Cap the number of retries
+		 * for costly orders.
+		 */
+		if (max_tries && tries > max_tries)
+			goto nopage;
 		goto retry;
+	}
 
 
 	/* Deal with possible cpuset update races before we start OOM killing */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..519e16e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
 	unsigned long file;
+	unsigned long tries = 0, max_tries = 0;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1UL << sc->order;
 
 again:
+	tries++;
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
 	nr_reclaimed = sc->nr_reclaimed;
@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-				    sc))
+				    sc)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after being reclaimed.  This can cause an indefinite number
+		 * of retries.  Limit the number of retries for costly orders.
+		 */
+		if (!current_is_kswapd() &&
+		    max_tries && tries > max_tries &&
+		    sc->nr_reclaimed > sc->nr_to_reclaim)
+			goto done;
 		goto again;
+	}
 
+done:
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
 	 * many failures to reclaim anything from them and goes to


-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-11  0:11                           ` Mike Kravetz
@ 2021-09-13 15:50                             ` Michal Hocko
  2021-09-15 16:57                               ` Mike Kravetz
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-09-13 15:50 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Vlastimil Babka, Hillf Danton, linux-mm, linux-kernel

On Fri 10-09-21 17:11:05, Mike Kravetz wrote:
[...]
> @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (did_some_progress > 0 &&
>  			should_compact_retry(ac, order, alloc_flags,
>  				compact_result, &compact_priority,
> -				&compaction_retries))
> +				&compaction_retries)) {
> +		/*
> +		 * In one pathological case, pages can be stolen immediately
> +		 * after reclaimed.  It looks like we are making progress, and
> +		 * compaction_retries is not incremented.  This could cause
> +		 * an indefinite number of retries.  Cap the number of retries
> +		 * for costly orders.
> +		 */
> +		if (max_tries && tries > max_tries)
> +			goto nopage;
>  		goto retry;
> +	}

I do not think this is a good approach. We do not want to play with
retries numbers. If we want to go with a minimal change for now then the
compaction feedback mechanism should track the number of reclaimed pages
to satisfy watermarks and if that grows beyond reasonable (proportionaly
to the request size) then simply give up rather than keep trying again
and again.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-13 15:50                             ` Michal Hocko
@ 2021-09-15 16:57                               ` Mike Kravetz
  2021-09-17 20:44                                 ` Mike Kravetz
  0 siblings, 1 reply; 38+ messages in thread
From: Mike Kravetz @ 2021-09-15 16:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, Hillf Danton, linux-mm, linux-kernel

On 9/13/21 8:50 AM, Michal Hocko wrote:
> On Fri 10-09-21 17:11:05, Mike Kravetz wrote:
> [...]
>> @@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>>  	if (did_some_progress > 0 &&
>>  			should_compact_retry(ac, order, alloc_flags,
>>  				compact_result, &compact_priority,
>> -				&compaction_retries))
>> +				&compaction_retries)) {
>> +		/*
>> +		 * In one pathological case, pages can be stolen immediately
>> +		 * after reclaimed.  It looks like we are making progress, and
>> +		 * compaction_retries is not incremented.  This could cause
>> +		 * an indefinite number of retries.  Cap the number of retries
>> +		 * for costly orders.
>> +		 */
>> +		if (max_tries && tries > max_tries)
>> +			goto nopage;
>>  		goto retry;
>> +	}
> 
> I do not think this is a good approach. We do not want to play with
> retries numbers. If we want to go with a minimal change for now then the
> compaction feedback mechanism should track the number of reclaimed pages
> to satisfy watermarks and if that grows beyond reasonable (proportionaly
> to the request size) then simply give up rather than keep trying again
> and again.

I am experimenting with code similar to the patch below.  The idea is to
key off of 'COMPACT_SKIPPED' being returned.  This implies that not enough
pages are available for compaction.  One of the values in this calculation
is compact_gap() which is scaled based on allocation order.  The simple
patch is to give up if number of reclaimed pages is greater than
compact_gap().  The thought is that this implies pages are being stolen
before compaction.

Such a patch does help in my specific test.  However, the concern is
that giving up earlier may regress some workloads.  One could of course
come up with a scenario where one more retry would result in success.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..f8e3b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4413,6 +4413,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+		     unsigned long nr_reclaimed,
 		     enum compact_result compact_result,
 		     enum compact_priority *compact_priority,
 		     int *compaction_retries)
@@ -4445,7 +4446,16 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 * to work with, so we retry only if it looks like reclaim can help.
 	 */
 	if (compaction_needs_reclaim(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		/*
+		 * If we reclaimed enough pages, but they are not available
+		 * now, this implies they were stolen.  Do not reclaim again
+		 * as this could go on indefinitely.
+		 */
+		if (nr_reclaimed > compact_gap(order))
+			ret = false;
+		else
+			ret = compaction_zonelist_suitable(ac, order,
+								alloc_flags);
 		goto out;
 	}
 
@@ -4504,6 +4514,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
+		     unsigned long nr_reclaimed,
 		     enum compact_priority *compact_priority,
 		     int *compaction_retries)
 {
@@ -4890,6 +4901,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
+	unsigned long nr_reclaimed;
 	enum compact_priority compact_priority;
 	enum compact_result compact_result;
 	int compaction_retries;
@@ -5033,6 +5045,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 							&did_some_progress);
 	if (page)
 		goto got_pg;
+	nr_reclaimed = did_some_progress;
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
@@ -5063,7 +5076,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	 */
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
-				compact_result, &compact_priority,
+				nr_reclaimed, compact_result, &compact_priority,
 				&compaction_retries))
 		goto retry;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..d40eca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2819,10 +2819,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	}
 
 	/*
+	 * If we have reclaimed enough pages for compaction, and compaction
+	 * is not ready, this implies pages are being stolen as they are being
+	 * reclaimed.  Quit now instead of continual retrying.
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	if (!current_is_kswapd() && sc->nr_reclaimed > pages_for_compaction)
+		return false;
+
+	/*
 	 * If we have not reclaimed enough pages for compaction and the
 	 * inactive lists are large enough, continue reclaiming
 	 */
-	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);


-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
  2021-09-15 16:57                               ` Mike Kravetz
@ 2021-09-17 20:44                                 ` Mike Kravetz
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-09-17 20:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, Hillf Danton, linux-mm, linux-kernel

On 9/15/21 9:57 AM, Mike Kravetz wrote:
> On 9/13/21 8:50 AM, Michal Hocko wrote:
>> I do not think this is a good approach. We do not want to play with
>> retries numbers. If we want to go with a minimal change for now then the
>> compaction feedback mechanism should track the number of reclaimed pages
>> to satisfy watermarks and if that grows beyond reasonable (proportionaly
>> to the request size) then simply give up rather than keep trying again
>> and again.
> 
> I am experimenting with code similar to the patch below.  The idea is to
> key off of 'COMPACT_SKIPPED' being returned.  This implies that not enough
> pages are available for compaction.  One of the values in this calculation
> is compact_gap() which is scaled based on allocation order.  The simple
> patch is to give up if number of reclaimed pages is greater than
> compact_gap().  The thought is that this implies pages are being stolen
> before compaction.
> 
> Such a patch does help in my specific test.  However, the concern is
> that giving up earlier may regress some workloads.  One could of course
> come up with a scenario where one more retry would result in success.

With the patch below, I instrumented the number of times shrink_node and
__alloc_pages_slowpath would 'quit earlier than before'.  In my test (free
1GB pages, allocate 2MB pages), this early exit was exercised a significant
number of times:

shrink_node:
        931124 quick exits

__alloc_pages_slowpath:
        bail early 32 times

For comparison, I ran the LTP mm tests mostly because they include a few
OOM test scenarios.  Test results were unchanged, and numbers for
exiting early were much lower than my hugetlb test:

shrink_node:
        57 quick exits

__alloc_pages_slowpath:
        bail early 0 times

This at least 'looks' like the changes target the page stealing corner
case in my test, and minimally impact other scenarios.

Any suggestions for other tests to run, or tweaks to the code?
-- 
Mike Kravetz

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9c..f8e3b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4413,6 +4413,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  
>  static inline bool
>  should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> +		     unsigned long nr_reclaimed,
>  		     enum compact_result compact_result,
>  		     enum compact_priority *compact_priority,
>  		     int *compaction_retries)
> @@ -4445,7 +4446,16 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	 * to work with, so we retry only if it looks like reclaim can help.
>  	 */
>  	if (compaction_needs_reclaim(compact_result)) {
> -		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> +		/*
> +		 * If we reclaimed enough pages, but they are not available
> +		 * now, this implies they were stolen.  Do not reclaim again
> +		 * as this could go on indefinitely.
> +		 */
> +		if (nr_reclaimed > compact_gap(order))
> +			ret = false;
> +		else
> +			ret = compaction_zonelist_suitable(ac, order,
> +								alloc_flags);
>  		goto out;
>  	}
>  
> @@ -4504,6 +4514,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  static inline bool
>  should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
>  		     enum compact_result compact_result,
> +		     unsigned long nr_reclaimed,
>  		     enum compact_priority *compact_priority,
>  		     int *compaction_retries)
>  {
> @@ -4890,6 +4901,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	struct page *page = NULL;
>  	unsigned int alloc_flags;
>  	unsigned long did_some_progress;
> +	unsigned long nr_reclaimed;
>  	enum compact_priority compact_priority;
>  	enum compact_result compact_result;
>  	int compaction_retries;
> @@ -5033,6 +5045,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  							&did_some_progress);
>  	if (page)
>  		goto got_pg;
> +	nr_reclaimed = did_some_progress;
>  
>  	/* Try direct compaction and then allocating */
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> @@ -5063,7 +5076,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	 */
>  	if (did_some_progress > 0 &&
>  			should_compact_retry(ac, order, alloc_flags,
> -				compact_result, &compact_priority,
> +				nr_reclaimed, compact_result, &compact_priority,
>  				&compaction_retries))
>  		goto retry;
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeae2f6..d40eca5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2819,10 +2819,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  	}
>  
>  	/*
> +	 * If we have reclaimed enough pages for compaction, and compaction
> +	 * is not ready, this implies pages are being stolen as they are being
> +	 * reclaimed.  Quit now instead of continual retrying.
> +	 */
> +	pages_for_compaction = compact_gap(sc->order);
> +	if (!current_is_kswapd() && sc->nr_reclaimed > pages_for_compaction)
> +		return false;
> +
> +	/*
>  	 * If we have not reclaimed enough pages for compaction and the
>  	 * inactive lists are large enough, continue reclaiming
>  	 */
> -	pages_for_compaction = compact_gap(sc->order);
>  	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
>  	if (get_nr_swap_pages() > 0)
>  		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] hugetlb: document the demote sysfs interfaces
  2021-08-16 22:49 ` [PATCH 5/8] hugetlb: document the demote sysfs interfaces Mike Kravetz
  2021-08-16 23:28   ` Andrew Morton
@ 2021-09-21 13:52   ` Aneesh Kumar K.V
  2021-09-21 17:17     ` Mike Kravetz
  1 sibling, 1 reply; 38+ messages in thread
From: Aneesh Kumar K.V @ 2021-09-21 13:52 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton,
	Mike Kravetz

Mike Kravetz <mike.kravetz@oracle.com> writes:

> Describe demote and demote_size interfaces.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  Documentation/admin-guide/mm/hugetlbpage.rst | 29 ++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144e44..902059a0257b 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -234,8 +234,12 @@ will exist, of the form::
>  
>  	hugepages-${size}kB
>  
> -Inside each of these directories, the same set of files will exist::
> +Inside each of these directories, the set of files contained in ``/proc``
> +will exist.  In addition, two additional interfaces for demoting huge
> +pages will exist::
>  
> +        demote
> +        demote_size
>  	nr_hugepages
>  	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
> @@ -243,7 +247,28 @@ Inside each of these directories, the same set of files will exist::
>  	resv_hugepages
>  	surplus_hugepages
>  
> -which function as described above for the default huge page-sized case.
> +The demote interfaces provide the ability to split a huge page into
> +smaller huge pages.  For example, the x86 architecture supports both
> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
> +2MB huge pages.  The demote interfaces are:
> +
> +demote_size
> +        is the size of demoted pages.  When a page is demoted a corresponding
> +        number of huge pages of demote_size will be created.  For huge pages
> +        of the smallest supported size (2MB on x86), demote_size will be the
> +        system page size (PAGE_SIZE).  If demote_size is the system page size
> +        then demoting a page will simply free the huge page.  demote_size is
> +        a read only interface.

That is an alternate interface for nr_hugepages. Will it be better to
return EINVAL on write to 'demote' file below
/sys/kernel/mm/hugepages/hugepages-2048kB ?

Or may be not expose demote possibility within 2M hugepage directory at all?


> +
> +demote
> +        is used to demote a number of huge pages.  A user with root privileges
> +        can write to this file.  It may not be possible to demote the
> +        requested number of huge pages.  To determine how many pages were
> +        actually demoted, compare the value of nr_hugepages before and after
> +        writing to the demote interface.  demote is a write only interface.
> +
> +The interfaces which are the same as in ``/proc`` function as described
> +above for the default huge page-sized case.
>  
>  .. _mem_policy_and_hp_alloc:
>  
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] hugetlb: document the demote sysfs interfaces
  2021-09-21 13:52   ` Aneesh Kumar K.V
@ 2021-09-21 17:17     ` Mike Kravetz
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-09-21 17:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, linux-kernel
  Cc: David Hildenbrand, Michal Hocko, Oscar Salvador, Zi Yan,
	Muchun Song, Naoya Horiguchi, David Rientjes, Andrew Morton

On 9/21/21 6:52 AM, Aneesh Kumar K.V wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
>> Describe demote and demote_size interfaces.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  Documentation/admin-guide/mm/hugetlbpage.rst | 29 ++++++++++++++++++--
>>  1 file changed, 27 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
>> index 8abaeb144e44..902059a0257b 100644
>> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
>> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
>> @@ -234,8 +234,12 @@ will exist, of the form::
>>  
>>  	hugepages-${size}kB
>>  
>> -Inside each of these directories, the same set of files will exist::
>> +Inside each of these directories, the set of files contained in ``/proc``
>> +will exist.  In addition, two additional interfaces for demoting huge
>> +pages will exist::
>>  
>> +        demote
>> +        demote_size
>>  	nr_hugepages
>>  	nr_hugepages_mempolicy
>>  	nr_overcommit_hugepages
>> @@ -243,7 +247,28 @@ Inside each of these directories, the same set of files will exist::
>>  	resv_hugepages
>>  	surplus_hugepages
>>  
>> -which function as described above for the default huge page-sized case.
>> +The demote interfaces provide the ability to split a huge page into
>> +smaller huge pages.  For example, the x86 architecture supports both
>> +1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
>> +2MB huge pages.  The demote interfaces are:
>> +
>> +demote_size
>> +        is the size of demoted pages.  When a page is demoted a corresponding
>> +        number of huge pages of demote_size will be created.  For huge pages
>> +        of the smallest supported size (2MB on x86), demote_size will be the
>> +        system page size (PAGE_SIZE).  If demote_size is the system page size
>> +        then demoting a page will simply free the huge page.  demote_size is
>> +        a read only interface.
> 
> That is an alternate interface for nr_hugepages. Will it be better to
> return EINVAL on write to 'demote' file below
> /sys/kernel/mm/hugepages/hugepages-2048kB ?
> 
> Or may be not expose demote possibility within 2M hugepage directory at all?
> 

Thanks for taking a look Aneesh!

You are right. If demote_size is PAGE_SIZE, then demote is just an
alternative to freeing huge pages via the nr_hugepages interface.
So, why even provide such an interface?
It certainly would be easy to just not display demote interfaces for the
smallest huge page size.

Based on other feedback, I am also making demote_size writable.  It
makes little sense on x86 with only two huge page sizes.  However, it
might be useful on other architectures with more sizes.  demote_size can
only be a valid huge page size.  Before your comment, I was going to
allow setting demote_size to PAGE_SIZE.  Perhaps, that should not be
allowed.

Thanks for the suggestion,  I think removing support for demote_size ==
PAGE_SIZE will make the code simpler.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-09-21 17:18 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 22:49 [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Mike Kravetz
2021-08-16 22:49 ` [PATCH 1/8] hugetlb: add demote hugetlb page sysfs interfaces Mike Kravetz
2021-08-16 22:49 ` [PATCH 2/8] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA Mike Kravetz
2021-08-16 22:49 ` [PATCH 3/8] hugetlb: add demote bool to gigantic page routines Mike Kravetz
2021-08-16 22:49 ` [PATCH 4/8] hugetlb: add hugetlb demote page support Mike Kravetz
2021-08-16 22:49 ` [PATCH 5/8] hugetlb: document the demote sysfs interfaces Mike Kravetz
2021-08-16 23:28   ` Andrew Morton
2021-08-17  1:04     ` Mike Kravetz
2021-09-21 13:52   ` Aneesh Kumar K.V
2021-09-21 17:17     ` Mike Kravetz
2021-08-16 22:49 ` [PATCH 6/8] hugetlb: vmemmap optimizations when demoting hugetlb pages Mike Kravetz
2021-08-16 22:49 ` [PATCH 7/8] hugetlb: prepare destroy and prep routines for vmemmap optimized pages Mike Kravetz
2021-08-16 22:49 ` [PATCH 8/8] hugetlb: Optimized demote vmemmap optimizatized pages Mike Kravetz
2021-08-16 23:23 ` [PATCH RESEND 0/8] hugetlb: add demote/split page functionality Andrew Morton
2021-08-17  0:17   ` Mike Kravetz
2021-08-17  0:39     ` Andrew Morton
2021-08-17  0:58       ` Mike Kravetz
2021-08-16 23:27 ` Andrew Morton
2021-08-17  0:46   ` Mike Kravetz
2021-08-17  1:46     ` Andrew Morton
2021-08-17  7:30       ` David Hildenbrand
2021-08-17 16:19         ` Mike Kravetz
2021-08-17 18:49           ` David Hildenbrand
2021-08-24 22:08       ` Mike Kravetz
2021-08-27 17:22         ` Vlastimil Babka
2021-08-27 23:04           ` Mike Kravetz
2021-08-30 10:11             ` Vlastimil Babka
2021-09-02 18:17               ` Mike Kravetz
2021-09-06 14:40                 ` Vlastimil Babka
     [not found]                 ` <20210907085001.3773-1-hdanton@sina.com>
2021-09-08 21:00                   ` Mike Kravetz
2021-09-09 11:54                     ` Michal Hocko
2021-09-09 13:45                       ` Vlastimil Babka
2021-09-09 21:31                         ` Mike Kravetz
2021-09-10  8:20                         ` Michal Hocko
2021-09-11  0:11                           ` Mike Kravetz
2021-09-13 15:50                             ` Michal Hocko
2021-09-15 16:57                               ` Mike Kravetz
2021-09-17 20:44                                 ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).