LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
@ 2021-08-23 13:25 Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 1/4] list: Support getting most recent element in list_lru Mike Rapoport
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-23 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Mike Rapoport, Peter Zijlstra,
	Rick Edgecombe, Vlastimil Babka, x86, linux-kernel

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

This is early prototype for addition of cache of pte-mapped pages to the
page allocator. It survives boot and some cache shrinking, but it's still
a long way to go for it to be ready for non-RFC posting.

The example use-case for pte-mapped cache is protection of page tables and
keeping them read-only except for the designated code that is allowed to
modify the page tables.

I'd like to get an early feedback for the approach to see what would be
the best way to move forward with something like this.

This set is x86 specific at the moment because other architectures either
do not support set_memory APIs that split the direct^w linear map (e.g.
PowerPC) or only enable set_memory APIs when the linear map uses basic page
size (like arm64).

== Motivation ==

There are usecases that need to remove pages from the direct map or at
least map them with 4K granularity. Whenever this is done e.g. with
set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
direct map are split into smaller pages.

To reduce the performance hit caused by the fragmentation of the direct map
it make sense to group and/or cache the 4K pages removed from the direct
map so that the split large pages won't be all over the place. 

There were RFCs for grouped page allocations for vmalloc permissions [1]
and for using PKS to protect page tables [2] as well as an attempt to use a
pool of large pages in secretmtm [3].

== Implementation overview ==

This set leverages ideas from the patches that added PKS protection to page
tables, but instead of adding per-user grouped allocations it tries to move
the cache of pte-mapped pages closer to the page allocator.

The idea is to use a gfp flag that will instruct the page allocator to use
the cache of pte-mapped pages because the caller needs to remove them from
the direct map or change their attributes. 

When the cache is empty there is an attempt to refill it using PMD-sized
allocation so that once the direct map is split we'll be able to use all 4K
pages made available by the split. 

If the high order allocation fails, we fall back to order-0 and mark the
entire pageblock as pte-mapped. When pages from that pageblock are freed to
the page allocator they are put into the pte-mapped cache. There is also
unimplemented provision to add free pages from such pageblock to the
pte-mapped cache along with the page that was allocated and cause the split
of the pageblock.

For now only order-0 allocations of pte-mapped pages are supported, which
prevents, for instance, allocation of PGD with PTI enabled.

The free pages in the cache may be reclaimed using a shrinker, but for now
they will remain mapped with PTEs in the direct map.

== TODOs ==

Whenever pte-mapped cache is being shrunk, it is possible to add some kind
of compaction to move all the free pages into PMD-sized chunks, free these
chunks at once and restore large page in the direct map.

There is also a possibility to add heuristics and knobs to control
greediness of the cache vs memory pressure so that freed pte-mapped cache
won't be necessarily put into the pte-mapped cache.

Another thing that can be implemented is pre-populating the pte-cache at
boot time to include the free pages that are anyway mapped by PTEs.

== Alternatives ==

Current implementation uses a single global cache.

Another option is to have per-user caches, e.g one for the page tables,
another for vmalloc etc.  This approach provides better control of the
permissions of the pages allocated from these caches and allows the user to
decide when (if at all) the pages can be accessed, e.g. for cache
compaction. The down side of this approach is that it complicates the
freeing path. A page allocated from a dedicated cache cannot be freed with
put_page()/free_page() etc but it has to be freed with a dedicated API or
there should be some back pointer in struct page so that page allocator
will know what cache this page came from.

Yet another possibility to make pte-mapped cache a migratetype of its own.
Creating a new migratetype would allow higher order allocations of
pte-mapped pages, but I don't have enough understanding of page allocator
and reclaim internals to estimate the complexity associated with this
approach. 

[1] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
[2] https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
[3] https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org/

Mike Rapoport (2):
  mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  x86/mm: write protect (most) page tables

Rick Edgecombe (2):
  list: Support getting most recent element in list_lru
  list: Support list head not in object for list_lru

 arch/Kconfig                            |   8 +
 arch/x86/Kconfig                        |   1 +
 arch/x86/boot/compressed/ident_map_64.c |   3 +
 arch/x86/include/asm/pgalloc.h          |   2 +
 arch/x86/include/asm/pgtable.h          |  21 +-
 arch/x86/include/asm/pgtable_64.h       |  33 ++-
 arch/x86/mm/init.c                      |   2 +-
 arch/x86/mm/pgtable.c                   |  72 ++++++-
 include/asm-generic/pgalloc.h           |   2 +-
 include/linux/gfp.h                     |  11 +-
 include/linux/list_lru.h                |  26 +++
 include/linux/mm.h                      |   2 +
 include/linux/pageblock-flags.h         |  26 +++
 init/main.c                             |   1 +
 mm/internal.h                           |   3 +-
 mm/list_lru.c                           |  38 +++-
 mm/page_alloc.c                         | 261 +++++++++++++++++++++++-
 17 files changed, 496 insertions(+), 16 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/4] list: Support getting most recent element in list_lru
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
@ 2021-08-23 13:25 ` Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 2/4] list: Support list head not in object for list_lru Mike Rapoport
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-23 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Mike Rapoport, Peter Zijlstra,
	Rick Edgecombe, Vlastimil Babka, x86, linux-kernel

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

In future patches, some functionality will use list_lru that also needs
to keep track of the most recently used element on a node. Since this
information is already contained within list_lru, add a function to get
it so that an additional list is not needed in the caller.

Do not support memcg aware list_lru's since it is not needed by the
intended caller.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/list_lru.h | 13 +++++++++++++
 mm/list_lru.c            | 28 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 1b5fceb565df..08e07c19fd13 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -103,6 +103,19 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_get_mru: gets and removes the tail from one of the node lists
+ * @list_lru: the lru pointer
+ * @nid: the node id
+ *
+ * This function removes the most recently added item from one of the node
+ * id specified. This function should not be used if the list_lru is memcg
+ * aware.
+ *
+ * Return value: The element removed
+ */
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid);
+
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/mm/list_lru.c b/mm/list_lru.c
index cd58790d0fb3..c1bec58168e1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -156,6 +156,34 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid)
+{
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l = &nlru->lru;
+	struct list_head *ret;
+
+	/* This function does not attempt to search through the memcg lists */
+	if (list_lru_memcg_aware(lru)) {
+		WARN_ONCE(1, "list_lru: %s not supported on memcg aware list_lrus", __func__);
+		return NULL;
+	}
+
+	spin_lock(&nlru->lock);
+	if (list_empty(&l->list)) {
+		ret = NULL;
+	} else {
+		/* Get tail */
+		ret = l->list.prev;
+		list_del_init(ret);
+
+		l->nr_items--;
+		nlru->nr_items--;
+	}
+	spin_unlock(&nlru->lock);
+
+	return ret;
+}
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
-- 
2.28.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/4] list: Support list head not in object for list_lru
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 1/4] list: Support getting most recent element in list_lru Mike Rapoport
@ 2021-08-23 13:25 ` Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-23 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Mike Rapoport, Peter Zijlstra,
	Rick Edgecombe, Vlastimil Babka, x86, linux-kernel

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

In future patches, there will be a need to keep track of objects with
list_lru where the list_head is not in the object (will be in struct
page). Since list_lru automatically determines the node id from the
list_head, this will fail when using struct page.

So create a new function in list_lru, list_lru_add_node(), that allows
the node id of the item to be passed in. Otherwise it behaves exactly
like list_lru_add().

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/list_lru.h | 13 +++++++++++++
 mm/list_lru.c            | 10 ++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 08e07c19fd13..42c22322058b 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -90,6 +90,19 @@ void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);
  */
 bool list_lru_add(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_add_node: add an element to the lru list's tail
+ * @list_lru: the lru pointer
+ * @item: the item to be added.
+ * @nid: the node id of the item
+ *
+ * Like list_lru_add, but takes the node id as parameter instead of
+ * calculating it from the list_head passed in.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_add_node(struct list_lru *lru, struct list_head *item, int nid);
+
 /**
  * list_lru_del: delete an element to the lru list
  * @list_lru: the lru pointer
diff --git a/mm/list_lru.c b/mm/list_lru.c
index c1bec58168e1..f35f11ada8a1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -112,9 +112,8 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-bool list_lru_add(struct list_lru *lru, struct list_head *item)
+bool list_lru_add_node(struct list_lru *lru, struct list_head *item, int nid)
 {
-	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct mem_cgroup *memcg;
 	struct list_lru_one *l;
@@ -134,6 +133,13 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 	spin_unlock(&nlru->lock);
 	return false;
 }
+
+bool list_lru_add(struct list_lru *lru, struct list_head *item)
+{
+	int nid = page_to_nid(virt_to_page(item));
+
+	return list_lru_add_node(lru, item, nid);
+}
 EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
-- 
2.28.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 1/4] list: Support getting most recent element in list_lru Mike Rapoport
  2021-08-23 13:25 ` [RFC PATCH 2/4] list: Support list head not in object for list_lru Mike Rapoport
@ 2021-08-23 13:25 ` Mike Rapoport
  2021-08-23 20:29   ` Edgecombe, Rick P
                     ` (2 more replies)
  2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-23 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Mike Rapoport, Peter Zijlstra,
	Rick Edgecombe, Vlastimil Babka, x86, linux-kernel

From: Mike Rapoport <rppt@linux.ibm.com>

When __GFP_PTE_MAPPED flag is passed to an allocation request of order 0,
the allocated page will be mapped at PTE level in the direct map.

To reduce the direct map fragmentation, maintain a cache of 4K pages that
are already mapped at PTE level in the direct map. Whenever the cache
should be replenished, try to allocate 2M page and split it to 4K pages
to localize shutter of the direct map. If the allocation of 2M page fails,
fallback to a single page allocation at expense of the direct map
fragmentation.

The cache registers a shrinker that releases free pages from the cache to
the page allocator.

The __GFP_PTE_MAPPED and caching of 4K pages are enabled only if an
architecture selects ARCH_WANTS_PTE_MAPPED_CACHE in its Kconfig.

[
cache management are mostly copied from 
https://lore.kernel.org/lkml/20210505003032.489164-4-rick.p.edgecombe@intel.com/
]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/Kconfig                    |   8 +
 arch/x86/Kconfig                |   1 +
 include/linux/gfp.h             |  11 +-
 include/linux/mm.h              |   2 +
 include/linux/pageblock-flags.h |  26 ++++
 init/main.c                     |   1 +
 mm/internal.h                   |   3 +-
 mm/page_alloc.c                 | 261 +++++++++++++++++++++++++++++++-
 8 files changed, 309 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 129df498a8e1..2db95331201b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -243,6 +243,14 @@ config ARCH_HAS_SET_MEMORY
 config ARCH_HAS_SET_DIRECT_MAP
 	bool
 
+#
+# Select if the architecture wants to minimize fragmentation of its
+# direct/linear map cauesd by set_memory and set_direct_map operations
+#
+config ARCH_WANTS_PTE_MAPPED_CACHE
+	bool
+	depends on ARCH_HAS_SET_MEMORY || ARCH_HAS_SET_DIRECT_MAP
+
 #
 # Select if the architecture provides the arch_dma_set_uncached symbol to
 # either provide an uncached segment alias for a DMA allocation, or
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 88fb922c23a0..9b4e6cf4a6aa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -118,6 +118,7 @@ config X86
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANT_HUGE_PMD_SHARE
 	select ARCH_WANT_LD_ORPHAN_WARN
+	select ARCH_WANTS_PTE_MAPPED_CACHE
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..c9006e3c67ad 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -55,8 +55,9 @@ struct vm_area_struct;
 #define ___GFP_ACCOUNT		0x400000u
 #define ___GFP_ZEROTAGS		0x800000u
 #define ___GFP_SKIP_KASAN_POISON	0x1000000u
+#define ___GFP_PTE_MAPPED	0x2000000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x2000000u
+#define ___GFP_NOLOCKDEP	0x4000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -101,12 +102,18 @@ struct vm_area_struct;
  * node with no fallbacks or placement policy enforcements.
  *
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_PTE_MAPPED returns a page that is mapped with PTE in the
+ * direct map. On architectures that use higher page table levels to map
+ * physical memory, this flag will casue split of large pages in the direct
+ * mapping. Has effect only if CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE is set.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_PTE_MAPPED ((__force gfp_t)___GFP_PTE_MAPPED)
 
 /**
  * DOC: Watermark modifiers
@@ -249,7 +256,7 @@ struct vm_area_struct;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..350ec98b82d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3283,5 +3283,7 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
 	return 0;
 }
 
+void pte_mapped_cache_init(void);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 973fd731a520..4faf8c542b00 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -21,6 +21,8 @@ enum pageblock_bits {
 			/* 3 bits required for migrate types */
 	PB_migrate_skip,/* If set the block is skipped by compaction */
 
+	PB_pte_mapped, /* If set the block is mapped with PTEs in direct map */
+
 	/*
 	 * Assume the bits will always align on a word. If this assumption
 	 * changes then get/set pageblock needs updating.
@@ -88,4 +90,28 @@ static inline void set_pageblock_skip(struct page *page)
 }
 #endif /* CONFIG_COMPACTION */
 
+#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
+#define get_pageblock_pte_mapped(page)				\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),	\
+			(1 << (PB_pte_mapped)))
+#define clear_pageblock_pte_mapped(page) \
+	set_pfnblock_flags_mask(page, 0, page_to_pfn(page),	\
+			(1 << PB_pte_mapped))
+#define set_pageblock_pte_mapped(page) \
+	set_pfnblock_flags_mask(page, (1 << PB_pte_mapped),	\
+			page_to_pfn(page),			\
+			(1 << PB_pte_mapped))
+#else /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
+static inline bool get_pageblock_pte_mapped(struct page *page)
+{
+	return false;
+}
+static inline void clear_pageblock_pte_mapped(struct page *page)
+{
+}
+static inline void set_pageblock_pte_mapped(struct page *page)
+{
+}
+#endif /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
+
 #endif	/* PAGEBLOCK_FLAGS_H */
diff --git a/init/main.c b/init/main.c
index f5b8246e8aa1..c0d59a183a39 100644
--- a/init/main.c
+++ b/init/main.c
@@ -828,6 +828,7 @@ static void __init mm_init(void)
 	page_ext_init_flatmem_late();
 	kmem_cache_init();
 	kmemleak_init();
+	pte_mapped_cache_init();
 	pgtable_init();
 	debug_objects_mem_init();
 	vmalloc_init();
diff --git a/mm/internal.h b/mm/internal.h
index 31ff935b2547..0557ece6ebf4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -24,7 +24,8 @@
 			__GFP_ATOMIC)
 
 /* The GFP flags allowed during early boot */
-#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
+#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS|\
+					   __GFP_PTE_MAPPED))
 
 /* Control allocation cpuset and node placement constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 856b175c15a4..7936d8dcb80b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include <linux/padata.h>
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
+#include <linux/set_memory.h>
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -107,6 +108,14 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/*
+ * Free page directly to the page allocator rather than check if it should
+ * be placed into the cache of pte_mapped pages.
+ * Used by the pte_mapped cache shrinker.
+ * Has effect only when pte-mapped cache is enabled
+ */
+#define FPI_NO_PTE_MAP		((__force fpi_t)BIT(2))
+
 /*
  * Don't poison memory with KASAN (only for the tag-based modes).
  * During boot, all non-reserved memblock memory is exposed to page_alloc.
@@ -225,6 +234,19 @@ static inline void set_pcppage_migratetype(struct page *page, int migratetype)
 	page->index = migratetype;
 }
 
+#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
+static struct page *alloc_page_pte_mapped(gfp_t gfp);
+static void free_page_pte_mapped(struct page *page);
+#else
+static inline struct page *alloc_page_pte_mapped(gfp_t gfp)
+{
+	return NULL;
+}
+static void free_page_pte_mapped(struct page *page)
+{
+}
+#endif
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -536,7 +558,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 	unsigned long bitidx, word_bitidx;
 	unsigned long old_word, word;
 
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 5);
 	BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
 
 	bitmap = get_pageblock_bitmap(page, pfn);
@@ -1352,6 +1374,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
 					   PAGE_SIZE << order);
 	}
 
+	/*
+	 * Unless are we shrinking pte_mapped cache return a page from
+	 * a pageblock mapped with PTEs to that cache.
+	 */
+	if (!order && !(fpi_flags & FPI_NO_PTE_MAP) &&
+	    get_pageblock_pte_mapped(page)) {
+		free_page_pte_mapped(page);
+		return false;
+	}
+
 	kernel_poison_pages(page, 1 << order);
 
 	/*
@@ -3445,6 +3477,13 @@ void free_unref_page_list(struct list_head *list)
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		pfn = page_to_pfn(page);
+
+		if (get_pageblock_pte_mapped(page)) {
+			list_del(&page->lru);
+			free_page_pte_mapped(page);
+			continue;
+		}
+
 		if (!free_unref_page_prepare(page, pfn, 0))
 			list_del(&page->lru);
 
@@ -5381,6 +5420,12 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 			&alloc_gfp, &alloc_flags))
 		return NULL;
 
+	if ((alloc_gfp & __GFP_PTE_MAPPED) && order == 0) {
+		page = alloc_page_pte_mapped(alloc_gfp);
+		if (page)
+			goto out;
+	}
+
 	/*
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
@@ -5681,6 +5726,220 @@ void free_pages_exact(void *virt, size_t size)
 }
 EXPORT_SYMBOL(free_pages_exact);
 
+#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
+
+struct pte_mapped_cache {
+	struct shrinker shrinker;
+	struct list_lru lru;
+	atomic_t nid_round_robin;
+	unsigned long free_cnt;
+};
+
+static struct pte_mapped_cache pte_mapped_cache;
+
+static struct page *pte_mapped_cache_get(struct pte_mapped_cache *cache)
+{
+	unsigned long start_nid, i;
+	struct list_head *head;
+	struct page *page = NULL;
+
+	start_nid = atomic_fetch_inc(&cache->nid_round_robin) % nr_node_ids;
+	for (i = 0; i < nr_node_ids; i++) {
+		int cur_nid = (start_nid + i) % nr_node_ids;
+
+		head = list_lru_get_mru(&cache->lru, cur_nid);
+		if (head) {
+			page = list_entry(head, struct page, lru);
+			break;
+		}
+	}
+
+	return page;
+}
+
+static void pte_mapped_cache_add(struct pte_mapped_cache *cache,
+				 struct page *page)
+{
+	INIT_LIST_HEAD(&page->lru);
+	list_lru_add_node(&cache->lru, &page->lru, page_to_nid(page));
+	set_page_count(page, 0);
+}
+
+static void pte_mapped_cache_add_neighbour_pages(struct page *page)
+{
+#if 0
+	/*
+	 * TODO: if pte_mapped_cache_replenish() had to fallback to order-0
+	 * allocation, the large page in the direct map will be split
+	 * anyway and if there are free pages in the same pageblock they
+	 * can be added to pte_mapped cache.
+	 */
+	unsigned int order = (1 << HUGETLB_PAGE_ORDER);
+	unsigned int nr_pages = (1 << order);
+	unsigned long pfn = page_to_pfn(page);
+	struct page *page_head = page - (pfn & (order - 1));
+
+	for (i = 0; i < nr_pages; i++) {
+		page = page_head + i;
+		if (is_free_buddy_page(page)) {
+			take_page_off_buddy(page);
+			pte_mapped_cache_add(&pte_mapped_cache, page);
+		}
+	}
+#endif
+}
+
+static struct page *pte_mapped_cache_replenish(struct pte_mapped_cache *cache,
+					       gfp_t gfp)
+{
+	unsigned int order = HUGETLB_PAGE_ORDER;
+	unsigned int nr_pages;
+	struct page *page;
+	int i, err;
+
+	gfp &= ~__GFP_PTE_MAPPED;
+
+	page = alloc_pages(gfp, order);
+	if (!page) {
+		order = 0;
+		page = alloc_pages(gfp, order);
+		if (!page)
+			return NULL;
+	}
+
+	nr_pages = (1 << order);
+	err = set_memory_4k((unsigned long)page_address(page), nr_pages);
+	if (err)
+		goto err_free_pages;
+
+	if (order)
+		split_page(page, order);
+	else
+		pte_mapped_cache_add_neighbour_pages(page);
+
+	for (i = 1; i < nr_pages; i++)
+		pte_mapped_cache_add(cache, page + i);
+
+	set_pageblock_pte_mapped(page);
+
+	return page;
+
+err_free_pages:
+	__free_pages(page, order);
+	return NULL;
+}
+
+static struct page *alloc_page_pte_mapped(gfp_t gfp)
+{
+	struct pte_mapped_cache *cache = &pte_mapped_cache;
+	struct page *page;
+
+	page = pte_mapped_cache_get(cache);
+	if (page) {
+		prep_new_page(page, 0, gfp, 0);
+		goto out;
+	}
+
+	page = pte_mapped_cache_replenish(cache, gfp);
+
+out:
+	return page;
+}
+
+static void free_page_pte_mapped(struct page *page)
+{
+	pte_mapped_cache_add(&pte_mapped_cache, page);
+}
+
+static struct pte_mapped_cache *pte_mapped_cache_from_sc(struct shrinker *sh)
+{
+	return container_of(sh, struct pte_mapped_cache, shrinker);
+}
+
+static unsigned long pte_mapped_cache_shrink_count(struct shrinker *shrinker,
+						   struct shrink_control *sc)
+{
+	struct pte_mapped_cache *cache = pte_mapped_cache_from_sc(shrinker);
+	unsigned long count = list_lru_shrink_count(&cache->lru, sc);
+
+	return count ? count : SHRINK_EMPTY;
+}
+
+static enum lru_status pte_mapped_cache_shrink_isolate(struct list_head *item,
+						       struct list_lru_one *lst,
+						       spinlock_t *lock,
+						       void *cb_arg)
+{
+	struct list_head *freeable = cb_arg;
+
+	list_lru_isolate_move(lst, item, freeable);
+
+	return LRU_REMOVED;
+}
+
+static unsigned long pte_mapped_cache_shrink_scan(struct shrinker *shrinker,
+						  struct shrink_control *sc)
+{
+	struct pte_mapped_cache *cache = pte_mapped_cache_from_sc(shrinker);
+	struct list_head *cur, *next;
+	unsigned long isolated;
+	LIST_HEAD(freeable);
+
+	isolated = list_lru_shrink_walk(&cache->lru, sc,
+					pte_mapped_cache_shrink_isolate,
+					&freeable);
+
+	list_for_each_safe(cur, next, &freeable) {
+		struct page *page = list_entry(cur, struct page, lru);
+
+		list_del(cur);
+		__free_pages_ok(page, 0, FPI_NO_PTE_MAP);
+	}
+
+	/* Every item walked gets isolated */
+	sc->nr_scanned += isolated;
+
+	return isolated;
+}
+
+static int __pte_mapped_cache_init(struct pte_mapped_cache *cache)
+{
+	int err;
+
+	err = list_lru_init(&cache->lru);
+	if (err)
+		return err;
+
+	cache->shrinker.count_objects = pte_mapped_cache_shrink_count;
+	cache->shrinker.scan_objects = pte_mapped_cache_shrink_scan;
+	cache->shrinker.seeks = DEFAULT_SEEKS;
+	cache->shrinker.flags = SHRINKER_NUMA_AWARE;
+
+	err = register_shrinker(&cache->shrinker);
+	if (err)
+		goto err_list_lru_destroy;
+
+	return 0;
+
+err_list_lru_destroy:
+	list_lru_destroy(&cache->lru);
+	return err;
+}
+
+void __init pte_mapped_cache_init(void)
+{
+	if (gfp_allowed_mask & __GFP_PTE_MAPPED)
+		return;
+
+	if (!__pte_mapped_cache_init(&pte_mapped_cache))
+		gfp_allowed_mask |= __GFP_PTE_MAPPED;
+}
+#else
+void __init pte_mapped_cache_init(void)
+{
+}
+#endif /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
+
 /**
  * nr_free_zone_pages - count number of pages beyond high watermark
  * @offset: The zone index of the highest zone
-- 
2.28.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
                   ` (2 preceding siblings ...)
  2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
@ 2021-08-23 13:25 ` Mike Rapoport
  2021-08-23 20:08   ` Edgecombe, Rick P
                     ` (2 more replies)
  2021-08-23 20:02 ` [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Edgecombe, Rick P
  2021-08-24 16:09 ` Vlastimil Babka
  5 siblings, 3 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-23 13:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Mike Rapoport, Peter Zijlstra,
	Rick Edgecombe, Vlastimil Babka, x86, linux-kernel

From: Mike Rapoport <rppt@linux.ibm.com>

Allocate page table using __GFP_PTE_MAPPED so that they will have 4K PTEs
in the direct map. This allows to switch _PAGE_RW bit each time a page
table page needs to be made writable or read-only.

The writability of the page tables is toggled only in the lowest level page
table modifiction functions and immediately switched off.

The page tables created early in the boot (including the direct map page
table) are not write protected.

Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/boot/compressed/ident_map_64.c |  3 ++
 arch/x86/include/asm/pgalloc.h          |  2 +
 arch/x86/include/asm/pgtable.h          | 21 +++++++-
 arch/x86/include/asm/pgtable_64.h       | 33 ++++++++++--
 arch/x86/mm/init.c                      |  2 +-
 arch/x86/mm/pgtable.c                   | 72 +++++++++++++++++++++++--
 include/asm-generic/pgalloc.h           |  2 +-
 7 files changed, 125 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index f7213d0943b8..4f7d17970688 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -349,3 +349,6 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
 	 */
 	add_identity_map(address, end);
 }
+
+void enable_pgtable_write(void *p) {}
+void disable_pgtable_write(void *p) {}
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index c7ec5bb88334..a9e2d77697a7 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -6,6 +6,8 @@
 #include <linux/mm.h>		/* for struct page */
 #include <linux/pagemap.h>
 
+#define STATIC_TABLE_KEY	1
+
 #define __HAVE_ARCH_PTE_ALLOC_ONE
 #define __HAVE_ARCH_PGD_FREE
 #include <asm-generic/pgalloc.h>
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..0cc5753983ab 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -117,6 +117,9 @@ extern pmdval_t early_pmd_flags;
 #define arch_end_context_switch(prev)	do {} while(0)
 #endif	/* CONFIG_PARAVIRT_XXL */
 
+void enable_pgtable_write(void *pg_table);
+void disable_pgtable_write(void *pg_table);
+
 /*
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
@@ -1073,7 +1076,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+	enable_pgtable_write(ptep);
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
+	disable_pgtable_write(ptep);
 }
 
 #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
@@ -1123,7 +1128,9 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+	enable_pgtable_write(pmdp);
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	disable_pgtable_write(pmdp);
 }
 
 #define pud_write pud_write
@@ -1138,10 +1145,18 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmdp, pmd_t pmd)
 {
 	if (IS_ENABLED(CONFIG_SMP)) {
-		return xchg(pmdp, pmd);
+		pmd_t ret;
+
+		enable_pgtable_write(pmdp);
+		ret = xchg(pmdp, pmd);
+		disable_pgtable_write(pmdp);
+
+		return ret;
 	} else {
 		pmd_t old = *pmdp;
+		enable_pgtable_write(pmdp);
 		WRITE_ONCE(*pmdp, pmd);
+		disable_pgtable_write(pmdp);
 		return old;
 	}
 }
@@ -1224,13 +1239,17 @@ static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
  */
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
+	enable_pgtable_write(dst);
 	memcpy(dst, src, count * sizeof(pgd_t));
+	disable_pgtable_write(dst);
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
 	/* Clone the user space pgd as well */
+	enable_pgtable_write(kernel_to_user_pgdp(dst));
 	memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
 	       count * sizeof(pgd_t));
+	disable_pgtable_write(kernel_to_user_pgdp(dst));
 #endif
 }
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..5dfcf7dbe6ac 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -64,7 +64,9 @@ void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
 
 static inline void native_set_pte(pte_t *ptep, pte_t pte)
 {
+	enable_pgtable_write(ptep);
 	WRITE_ONCE(*ptep, pte);
+	disable_pgtable_write(ptep);
 }
 
 static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
@@ -80,7 +82,9 @@ static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
+	enable_pgtable_write(pmdp);
 	WRITE_ONCE(*pmdp, pmd);
+	disable_pgtable_write(pmdp);
 }
 
 static inline void native_pmd_clear(pmd_t *pmd)
@@ -91,7 +95,12 @@ static inline void native_pmd_clear(pmd_t *pmd)
 static inline pte_t native_ptep_get_and_clear(pte_t *xp)
 {
 #ifdef CONFIG_SMP
-	return native_make_pte(xchg(&xp->pte, 0));
+	pteval_t pte_val;
+
+	enable_pgtable_write(xp);
+	pte_val = xchg(&xp->pte, 0);
+	disable_pgtable_write(xp);
+	return native_make_pte(pte_val);
 #else
 	/* native_local_ptep_get_and_clear,
 	   but duplicated because of cyclic dependency */
@@ -104,7 +113,12 @@ static inline pte_t native_ptep_get_and_clear(pte_t *xp)
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 {
 #ifdef CONFIG_SMP
-	return native_make_pmd(xchg(&xp->pmd, 0));
+	pteval_t pte_val;
+
+	enable_pgtable_write(xp);
+	pte_val = xchg(&xp->pmd, 0);
+	disable_pgtable_write(xp);
+	return native_make_pmd(pte_val);
 #else
 	/* native_local_pmdp_get_and_clear,
 	   but duplicated because of cyclic dependency */
@@ -116,7 +130,9 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
 {
+	enable_pgtable_write(pudp);
 	WRITE_ONCE(*pudp, pud);
+	disable_pgtable_write(pudp);
 }
 
 static inline void native_pud_clear(pud_t *pud)
@@ -127,7 +143,12 @@ static inline void native_pud_clear(pud_t *pud)
 static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 {
 #ifdef CONFIG_SMP
-	return native_make_pud(xchg(&xp->pud, 0));
+	pteval_t pte_val;
+
+	enable_pgtable_write(xp);
+	pte_val = xchg(&xp->pud, 0);
+	disable_pgtable_write(xp);
+	return native_make_pud(pte_val);
 #else
 	/* native_local_pudp_get_and_clear,
 	 * but duplicated because of cyclic dependency
@@ -144,13 +165,17 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 	pgd_t pgd;
 
 	if (pgtable_l5_enabled() || !IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION)) {
+		enable_pgtable_write(p4dp);
 		WRITE_ONCE(*p4dp, p4d);
+		disable_pgtable_write(p4dp);
 		return;
 	}
 
 	pgd = native_make_pgd(native_p4d_val(p4d));
 	pgd = pti_set_user_pgtbl((pgd_t *)p4dp, pgd);
+	enable_pgtable_write(p4dp);
 	WRITE_ONCE(*p4dp, native_make_p4d(native_pgd_val(pgd)));
+	disable_pgtable_write(p4dp);
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -160,7 +185,9 @@ static inline void native_p4d_clear(p4d_t *p4d)
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+	enable_pgtable_write(pgdp);
 	WRITE_ONCE(*pgdp, pti_set_user_pgtbl(pgdp, pgd));
+	disable_pgtable_write(pgdp);
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 75ef19aa8903..5c7e70e15199 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -120,7 +120,7 @@ __ref void *alloc_low_pages(unsigned int num)
 		unsigned int order;
 
 		order = get_order((unsigned long)num << PAGE_SHIFT);
-		return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
+		return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO | __GFP_PTE_MAPPED, order);
 	}
 
 	if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..fd6bfa361865 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -2,10 +2,13 @@
 #include <linux/mm.h>
 #include <linux/gfp.h>
 #include <linux/hugetlb.h>
+#include <linux/printk.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/fixmap.h>
 #include <asm/mtrr.h>
+#include <asm/set_memory.h>
+#include "mm_internal.h"
 
 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
 phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
@@ -52,6 +55,7 @@ early_param("userpte", setup_userpte);
 
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
+	enable_pgtable_write(page_address(pte));
 	pgtable_pte_page_dtor(pte);
 	paravirt_release_pte(page_to_pfn(pte));
 	paravirt_tlb_remove_table(tlb, pte);
@@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #ifdef CONFIG_X86_PAE
 	tlb->need_flush_all = 1;
 #endif
+	enable_pgtable_write(pmd);
 	pgtable_pmd_page_dtor(page);
 	paravirt_tlb_remove_table(tlb, page);
 }
@@ -76,6 +81,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #if CONFIG_PGTABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
+	enable_pgtable_write(pud);
 	paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
 	paravirt_tlb_remove_table(tlb, virt_to_page(pud));
 }
@@ -83,6 +89,7 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
+	enable_pgtable_write(p4d);
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
 	paravirt_tlb_remove_table(tlb, virt_to_page(p4d));
 }
@@ -145,6 +152,7 @@ static void pgd_dtor(pgd_t *pgd)
 	if (SHARED_KERNEL_PMD)
 		return;
 
+	enable_pgtable_write(pgd);
 	spin_lock(&pgd_lock);
 	pgd_list_del(pgd);
 	spin_unlock(&pgd_lock);
@@ -543,9 +551,12 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	if (pte_young(*ptep))
+	if (pte_young(*ptep)) {
+		enable_pgtable_write(ptep);
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
 					 (unsigned long *) &ptep->pte);
+		disable_pgtable_write(ptep);
+	}
 
 	return ret;
 }
@@ -556,9 +567,12 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	if (pmd_young(*pmdp))
+	if (pmd_young(*pmdp)) {
+		enable_pgtable_write(pmdp);
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
 					 (unsigned long *)pmdp);
+		disable_pgtable_write(pmdp);
+	}
 
 	return ret;
 }
@@ -567,9 +581,12 @@ int pudp_test_and_clear_young(struct vm_area_struct *vma,
 {
 	int ret = 0;
 
-	if (pud_young(*pudp))
+	if (pud_young(*pudp)) {
+		enable_pgtable_write(pudp);
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
 					 (unsigned long *)pudp);
+		disable_pgtable_write(pudp);
+	}
 
 	return ret;
 }
@@ -578,6 +595,7 @@ int pudp_test_and_clear_young(struct vm_area_struct *vma,
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
+	int ret;
 	/*
 	 * On x86 CPUs, clearing the accessed bit without a TLB flush
 	 * doesn't cause data corruption. [ It could cause incorrect
@@ -591,7 +609,10 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
 	 * shouldn't really matter because there's no real memory
 	 * pressure for swapout to react to. ]
 	 */
-	return ptep_test_and_clear_young(vma, address, ptep);
+	enable_pgtable_write(ptep);
+	ret = ptep_test_and_clear_young(vma, address, ptep);
+	disable_pgtable_write(ptep);
+	return ret;
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -602,7 +623,9 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
+	enable_pgtable_write(pmdp);
 	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	disable_pgtable_write(pmdp);
 	if (young)
 		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 
@@ -851,6 +874,47 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 	return 1;
 }
 
+static void pgtable_write_set(void *pg_table, bool set)
+{
+	int level = 0;
+	pte_t *pte;
+
+	/*
+	 * Skip the page tables allocated from pgt_buf break area and from
+	 * memblock
+	 */
+	if (!after_bootmem)
+		return;
+	if (!PageTable(virt_to_page(pg_table)))
+		return;
+
+	pte = lookup_address((unsigned long)pg_table, &level);
+	if (!pte || level != PG_LEVEL_4K)
+		return;
+
+	if (set) {
+		if (pte_write(*pte))
+			return;
+
+		WRITE_ONCE(*pte, pte_mkwrite(*pte));
+	} else {
+		if (!pte_write(*pte))
+			return;
+
+		WRITE_ONCE(*pte, pte_wrprotect(*pte));
+	}
+}
+
+void enable_pgtable_write(void *pg_table)
+{
+	pgtable_write_set(pg_table, true);
+}
+
+void disable_pgtable_write(void *pg_table)
+{
+	pgtable_write_set(pg_table, false);
+}
+
 #else /* !CONFIG_X86_64 */
 
 /*
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 02932efad3ab..bc71d529552e 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -4,7 +4,7 @@
 
 #ifdef CONFIG_MMU
 
-#define GFP_PGTABLE_KERNEL	(GFP_KERNEL | __GFP_ZERO)
+#define GFP_PGTABLE_KERNEL	(GFP_KERNEL | __GFP_ZERO | __GFP_PTE_MAPPED)
 #define GFP_PGTABLE_USER	(GFP_PGTABLE_KERNEL | __GFP_ACCOUNT)
 
 /**
-- 
2.28.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
                   ` (3 preceding siblings ...)
  2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
@ 2021-08-23 20:02 ` Edgecombe, Rick P
  2021-08-24 13:03   ` Mike Rapoport
  2021-08-24 16:09 ` Vlastimil Babka
  5 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2021-08-23 20:02 UTC (permalink / raw)
  To: linux-mm, rppt
  Cc: linux-kernel, peterz, keescook, Weiny, Ira, dave.hansen, vbabka,
	x86, akpm, rppt, Lutomirski, Andy

 Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> This is early prototype for addition of cache of pte-mapped pages to
> the
> page allocator. It survives boot and some cache shrinking, but it's
> still
> a long way to go for it to be ready for non-RFC posting.
> 
> The example use-case for pte-mapped cache is protection of page
> tables and
> keeping them read-only except for the designated code that is allowed
> to
> modify the page tables.
> 
> I'd like to get an early feedback for the approach to see what would
> be
> the best way to move forward with something like this.
> 
> This set is x86 specific at the moment because other architectures
> either
> do not support set_memory APIs that split the direct^w linear map
> (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses
> basic page
> size (like arm64).
> 
> == Motivation ==
> 
> There are usecases that need to remove pages from the direct map or
> at
> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
> 
> To reduce the performance hit caused by the fragmentation of the
> direct map
> it make sense to group and/or cache the 4K pages removed from the
> direct
> map so that the split large pages won't be all over the place. 
> 
If you tied this into debug page alloc, you shouldn't need to group the
pages. Are you thinking this PKS-less page table usage would be a
security feature or debug time thing?

> There were RFCs for grouped page allocations for vmalloc permissions
> [1]
> and for using PKS to protect page tables [2] as well as an attempt to
> use a
> pool of large pages in secretmtm [3].
> 
> == Implementation overview ==
> 
> This set leverages ideas from the patches that added PKS protection
> to page
> tables, but instead of adding per-user grouped allocations it tries
> to move
> the cache of pte-mapped pages closer to the page allocator.
> 
> The idea is to use a gfp flag that will instruct the page allocator
> to use
> the cache of pte-mapped pages because the caller needs to remove them
> from
> the direct map or change their attributes. 
> 
> When the cache is empty there is an attempt to refill it using PMD-
> sized
> allocation so that once the direct map is split we'll be able to use
> all 4K
> pages made available by the split. 
> 
> If the high order allocation fails, we fall back to order-0 and mark
> the
> entire pageblock as pte-mapped. When pages from that pageblock are
> freed to
> the page allocator they are put into the pte-mapped cache. There is
> also
> unimplemented provision to add free pages from such pageblock to the
> pte-mapped cache along with the page that was allocated and cause the
> split
> of the pageblock.
> 
> For now only order-0 allocations of pte-mapped pages are supported,
> which
> prevents, for instance, allocation of PGD with PTI enabled.
> 
> The free pages in the cache may be reclaimed using a shrinker, but
> for now
> they will remain mapped with PTEs in the direct map.
> 
> == TODOs ==
> 
> Whenever pte-mapped cache is being shrunk, it is possible to add some
> kind
> of compaction to move all the free pages into PMD-sized chunks, free
> these
> chunks at once and restore large page in the direct map.
I had made a POC to do this a while back that hooked into the buddy
code in the page allocator where this coalescing is already happening
for freed pages. The problem was that most pages that get their direct
map alias broken, end up using a page from the same 2MB page for the
page table in the split. But then the direct map page table never gets
freed so it never can restore the large page when checking the the
allocation page getting freed. Grouping permissioned pages OR page
tables would resolve that and it was my plan to try again after
something like this happened. Was just an experiment, but can share if
you are interested.

> 
> There is also a possibility to add heuristics and knobs to control
> greediness of the cache vs memory pressure so that freed pte-mapped
> cache
> won't be necessarily put into the pte-mapped cache.
> 
> Another thing that can be implemented is pre-populating the pte-cache 
> at
> boot time to include the free pages that are anyway mapped by PTEs.
> 
> == Alternatives ==
> 
> Current implementation uses a single global cache.
> 
> Another option is to have per-user caches, e.g one for the page
> tables,
> another for vmalloc etc.  This approach provides better control of
> the
> permissions of the pages allocated from these caches and allows the
> user to
> decide when (if at all) the pages can be accessed, e.g. for cache
> compaction. The down side of this approach is that it complicates the
> freeing path. A page allocated from a dedicated cache cannot be freed
> with
> put_page()/free_page() etc but it has to be freed with a dedicated
> API or
> there should be some back pointer in struct page so that page
> allocator
> will know what cache this page came from.
This needs to reset the permissions before freeing, so doesn't seem too
different than freeing them a special way.
> 
> Yet another possibility to make pte-mapped cache a migratetype of its
> own.
> Creating a new migratetype would allow higher order allocations of
> pte-mapped pages, but I don't have enough understanding of page
> allocator
> and reclaim internals to estimate the complexity associated with this
> approach. 
> 
I've been thinking about two categories of direct map permission
usages.

One is limiting the use of the direct map alias when it's not in use
and the primary alias is getting some other permission. Examples are
modules, secretmem, xpfo, KVM guest memory unmapping stuff, etc. In
this case re-allocations can share unmapped pages without doing any
expensive maintenance and it helps to have one big cache. If you are
going to convert pages to 4k and cache them, you might as well convert
them to NP at the time, since it's cheap to restore them or set their
permission from that state.

Two is setting permissions on the direct map as the only alias to be
used. This includes this rfc, some PKS usages, but also possibly some
set_pages_uc() callers and the like. It seems that this category could
still make use of a big unmapped cache of pages. Just ask for unmapped
pages and convert them without a flush.

So like something would have a big cache of grouped unmapped pages 
that category one usages could share. And then little category two
allocators could have their own caches that feed on it too. What do you
think? This is regardless if they live in the page allocator or not.


> [1] 
> https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
> [2] 
> https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
> [3] 
> https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org/
> 
> Mike Rapoport (2):
>   mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-
> mapped pages
>   x86/mm: write protect (most) page tables
> 
> Rick Edgecombe (2):
>   list: Support getting most recent element in list_lru
>   list: Support list head not in object for list_lru
> 
>  arch/Kconfig                            |   8 +
>  arch/x86/Kconfig                        |   1 +
>  arch/x86/boot/compressed/ident_map_64.c |   3 +
>  arch/x86/include/asm/pgalloc.h          |   2 +
>  arch/x86/include/asm/pgtable.h          |  21 +-
>  arch/x86/include/asm/pgtable_64.h       |  33 ++-
>  arch/x86/mm/init.c                      |   2 +-
>  arch/x86/mm/pgtable.c                   |  72 ++++++-
>  include/asm-generic/pgalloc.h           |   2 +-
>  include/linux/gfp.h                     |  11 +-
>  include/linux/list_lru.h                |  26 +++
>  include/linux/mm.h                      |   2 +
>  include/linux/pageblock-flags.h         |  26 +++
>  init/main.c                             |   1 +
>  mm/internal.h                           |   3 +-
>  mm/list_lru.c                           |  38 +++-
>  mm/page_alloc.c                         | 261
> +++++++++++++++++++++++-
>  17 files changed, 496 insertions(+), 16 deletions(-)
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
@ 2021-08-23 20:08   ` Edgecombe, Rick P
  2021-08-23 23:50   ` Dave Hansen
       [not found]   ` <FB6C09CD-9CEA-4FE8-B179-98DB63EBDD68@gmail.com>
  2 siblings, 0 replies; 27+ messages in thread
From: Edgecombe, Rick P @ 2021-08-23 20:08 UTC (permalink / raw)
  To: linux-mm, rppt
  Cc: linux-kernel, peterz, keescook, Weiny, Ira, dave.hansen, vbabka,
	x86, akpm, rppt, Lutomirski, Andy

On Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Allocate page table using __GFP_PTE_MAPPED so that they will have 4K
> PTEs
> in the direct map. This allows to switch _PAGE_RW bit each time a
> page
> table page needs to be made writable or read-only.
> 
> The writability of the page tables is toggled only in the lowest
> level page
> table modifiction functions and immediately switched off.
> 
> The page tables created early in the boot (including the direct map
> page
> table) are not write protected.
> 
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
I have a second version of the PKS tables series that I think gets all
of them.

Also, I didn't see any flush anywhere when toggling. I guess the
spurious kernel fault handler is doing the work? It might be better to
just do a local flush of the address.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
@ 2021-08-23 20:29   ` Edgecombe, Rick P
  2021-08-24 13:02     ` Mike Rapoport
  2021-08-24 16:12   ` Vlastimil Babka
  2021-08-25  8:43   ` David Hildenbrand
  2 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2021-08-23 20:29 UTC (permalink / raw)
  To: linux-mm, rppt
  Cc: linux-kernel, peterz, keescook, Weiny, Ira, dave.hansen, vbabka,
	x86, akpm, rppt, Lutomirski, Andy

On Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> When __GFP_PTE_MAPPED flag is passed to an allocation request of
> order 0,
> the allocated page will be mapped at PTE level in the direct map.
> 
> To reduce the direct map fragmentation, maintain a cache of 4K pages
> that
> are already mapped at PTE level in the direct map. Whenever the cache
> should be replenished, try to allocate 2M page and split it to 4K
> pages
> to localize shutter of the direct map. If the allocation of 2M page
> fails,
> fallback to a single page allocation at expense of the direct map
> fragmentation.
> 
> The cache registers a shrinker that releases free pages from the
> cache to
> the page allocator.
> 
> The __GFP_PTE_MAPPED and caching of 4K pages are enabled only if an
> architecture selects ARCH_WANTS_PTE_MAPPED_CACHE in its Kconfig.
> 
> [
> cache management are mostly copied from 
> 
https://lore.kernel.org/lkml/20210505003032.489164-4-rick.p.edgecombe@intel.com/
> ]
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/Kconfig                    |   8 +
>  arch/x86/Kconfig                |   1 +
>  include/linux/gfp.h             |  11 +-
>  include/linux/mm.h              |   2 +
>  include/linux/pageblock-flags.h |  26 ++++
>  init/main.c                     |   1 +
>  mm/internal.h                   |   3 +-
>  mm/page_alloc.c                 | 261
> +++++++++++++++++++++++++++++++-
>  8 files changed, 309 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 129df498a8e1..2db95331201b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -243,6 +243,14 @@ config ARCH_HAS_SET_MEMORY
>  config ARCH_HAS_SET_DIRECT_MAP
>  	bool
>  
> +#
> +# Select if the architecture wants to minimize fragmentation of its
> +# direct/linear map cauesd by set_memory and set_direct_map
> operations
> +#
> +config ARCH_WANTS_PTE_MAPPED_CACHE
> +	bool
> +	depends on ARCH_HAS_SET_MEMORY || ARCH_HAS_SET_DIRECT_MAP
> +
>  #
>  # Select if the architecture provides the arch_dma_set_uncached
> symbol to
>  # either provide an uncached segment alias for a DMA allocation, or
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 88fb922c23a0..9b4e6cf4a6aa 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -118,6 +118,7 @@ config X86
>  	select ARCH_WANTS_NO_INSTR
>  	select ARCH_WANT_HUGE_PMD_SHARE
>  	select ARCH_WANT_LD_ORPHAN_WARN
> +	select ARCH_WANTS_PTE_MAPPED_CACHE
>  	select ARCH_WANTS_THP_SWAP		if X86_64
>  	select BUILDTIME_TABLE_SORT
>  	select CLKEVT_I8253
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..c9006e3c67ad 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -55,8 +55,9 @@ struct vm_area_struct;
>  #define ___GFP_ACCOUNT		0x400000u
>  #define ___GFP_ZEROTAGS		0x800000u
>  #define ___GFP_SKIP_KASAN_POISON	0x1000000u
> +#define ___GFP_PTE_MAPPED	0x2000000u
>  #ifdef CONFIG_LOCKDEP
> -#define ___GFP_NOLOCKDEP	0x2000000u
> +#define ___GFP_NOLOCKDEP	0x4000000u
>  #else
>  #define ___GFP_NOLOCKDEP	0
>  #endif
> @@ -101,12 +102,18 @@ struct vm_area_struct;
>   * node with no fallbacks or placement policy enforcements.
>   *
>   * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
> + *
> + * %__GFP_PTE_MAPPED returns a page that is mapped with PTE in the
> + * direct map. On architectures that use higher page table levels to
> map
> + * physical memory, this flag will casue split of large pages in the
> direct
> + * mapping. Has effect only if CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE is
> set.
>   */
>  #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
>  #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
>  #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
>  #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
>  #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
> +#define __GFP_PTE_MAPPED ((__force gfp_t)___GFP_PTE_MAPPED)
>  
>  /**
>   * DOC: Watermark modifiers
> @@ -249,7 +256,7 @@ struct vm_area_struct;
>  #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>  
>  /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
> +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) -
> 1))
>  
>  /**
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7ca22e6e694a..350ec98b82d2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3283,5 +3283,7 @@ static inline int seal_check_future_write(int
> seals, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +void pte_mapped_cache_init(void);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/pageblock-flags.h
> b/include/linux/pageblock-flags.h
> index 973fd731a520..4faf8c542b00 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -21,6 +21,8 @@ enum pageblock_bits {
>  			/* 3 bits required for migrate types */
>  	PB_migrate_skip,/* If set the block is skipped by compaction */
>  
> +	PB_pte_mapped, /* If set the block is mapped with PTEs in
> direct map */
> +
>  	/*
>  	 * Assume the bits will always align on a word. If this
> assumption
>  	 * changes then get/set pageblock needs updating.
> @@ -88,4 +90,28 @@ static inline void set_pageblock_skip(struct page
> *page)
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> +#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
> +#define get_pageblock_pte_mapped(page)				
> \
> +	get_pfnblock_flags_mask(page, page_to_pfn(page),	\
> +			(1 << (PB_pte_mapped)))
> +#define clear_pageblock_pte_mapped(page) \
> +	set_pfnblock_flags_mask(page, 0, page_to_pfn(page),	\
> +			(1 << PB_pte_mapped))
> +#define set_pageblock_pte_mapped(page) \
> +	set_pfnblock_flags_mask(page, (1 << PB_pte_mapped),	\
> +			page_to_pfn(page),			\
> +			(1 << PB_pte_mapped))
> +#else /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
> +static inline bool get_pageblock_pte_mapped(struct page *page)
> +{
> +	return false;
> +}
> +static inline void clear_pageblock_pte_mapped(struct page *page)
> +{
> +}
> +static inline void set_pageblock_pte_mapped(struct page *page)
> +{
> +}
> +#endif /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
> +
>  #endif	/* PAGEBLOCK_FLAGS_H */
> diff --git a/init/main.c b/init/main.c
> index f5b8246e8aa1..c0d59a183a39 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -828,6 +828,7 @@ static void __init mm_init(void)
>  	page_ext_init_flatmem_late();
>  	kmem_cache_init();
>  	kmemleak_init();
> +	pte_mapped_cache_init();
>  	pgtable_init();
>  	debug_objects_mem_init();
>  	vmalloc_init();
> diff --git a/mm/internal.h b/mm/internal.h
> index 31ff935b2547..0557ece6ebf4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -24,7 +24,8 @@
>  			__GFP_ATOMIC)
>  
>  /* The GFP flags allowed during early boot */
> -#define GFP_BOOT_MASK (__GFP_BITS_MASK &
> ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> +#define GFP_BOOT_MASK (__GFP_BITS_MASK &
> ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS|\
> +					   __GFP_PTE_MAPPED))
>  
>  /* Control allocation cpuset and node placement constraints */
>  #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 856b175c15a4..7936d8dcb80b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -72,6 +72,7 @@
>  #include <linux/padata.h>
>  #include <linux/khugepaged.h>
>  #include <linux/buffer_head.h>
> +#include <linux/set_memory.h>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -107,6 +108,14 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
>  
> +/*
> + * Free page directly to the page allocator rather than check if it
> should
> + * be placed into the cache of pte_mapped pages.
> + * Used by the pte_mapped cache shrinker.
> + * Has effect only when pte-mapped cache is enabled
> + */
> +#define FPI_NO_PTE_MAP		((__force fpi_t)BIT(2))
> +
>  /*
>   * Don't poison memory with KASAN (only for the tag-based modes).
>   * During boot, all non-reserved memblock memory is exposed to
> page_alloc.
> @@ -225,6 +234,19 @@ static inline void
> set_pcppage_migratetype(struct page *page, int migratetype)
>  	page->index = migratetype;
>  }
>  
> +#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
> +static struct page *alloc_page_pte_mapped(gfp_t gfp);
> +static void free_page_pte_mapped(struct page *page);
> +#else
> +static inline struct page *alloc_page_pte_mapped(gfp_t gfp)
> +{
> +	return NULL;
> +}
> +static void free_page_pte_mapped(struct page *page)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_PM_SLEEP
>  /*
>   * The following functions are used by the suspend/hibernate code to
> temporarily
> @@ -536,7 +558,7 @@ void set_pfnblock_flags_mask(struct page *page,
> unsigned long flags,
>  	unsigned long bitidx, word_bitidx;
>  	unsigned long old_word, word;
>  
> -	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 5);
>  	BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
>  
>  	bitmap = get_pageblock_bitmap(page, pfn);
> @@ -1352,6 +1374,16 @@ static __always_inline bool
> free_pages_prepare(struct page *page,
>  					   PAGE_SIZE << order);
>  	}
>  
> +	/*
> +	 * Unless are we shrinking pte_mapped cache return a page from
> +	 * a pageblock mapped with PTEs to that cache.
> +	 */
> +	if (!order && !(fpi_flags & FPI_NO_PTE_MAP) &&
> +	    get_pageblock_pte_mapped(page)) {
> +		free_page_pte_mapped(page);
> +		return false;
> +	}
> +
>  	kernel_poison_pages(page, 1 << order);
>  
>  	/*
> @@ -3445,6 +3477,13 @@ void free_unref_page_list(struct list_head
> *list)
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		pfn = page_to_pfn(page);
> +
> +		if (get_pageblock_pte_mapped(page)) {
> +			list_del(&page->lru);
> +			free_page_pte_mapped(page);
> +			continue;
> +		}
> +
>  		if (!free_unref_page_prepare(page, pfn, 0))
>  			list_del(&page->lru);
>  
> @@ -5381,6 +5420,12 @@ struct page *__alloc_pages(gfp_t gfp, unsigned
> int order, int preferred_nid,
>  			&alloc_gfp, &alloc_flags))
>  		return NULL;
>  
> +	if ((alloc_gfp & __GFP_PTE_MAPPED) && order == 0) {
> +		page = alloc_page_pte_mapped(alloc_gfp);
> +		if (page)
> +			goto out;
> +	}
> +
>  	/*
>  	 * Forbid the first pass from falling back to types that
> fragment
>  	 * memory until all local zones are considered.
> @@ -5681,6 +5726,220 @@ void free_pages_exact(void *virt, size_t
> size)
>  }
>  EXPORT_SYMBOL(free_pages_exact);
>  
> +#ifdef CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE
> +
> +struct pte_mapped_cache {
> +	struct shrinker shrinker;
> +	struct list_lru lru;
> +	atomic_t nid_round_robin;
> +	unsigned long free_cnt;
> +};
> +
> +static struct pte_mapped_cache pte_mapped_cache;
> +
> +static struct page *pte_mapped_cache_get(struct pte_mapped_cache
> *cache)
> +{
> +	unsigned long start_nid, i;
> +	struct list_head *head;
> +	struct page *page = NULL;
> +
> +	start_nid = atomic_fetch_inc(&cache->nid_round_robin) %
> nr_node_ids;
> +	for (i = 0; i < nr_node_ids; i++) {
> +		int cur_nid = (start_nid + i) % nr_node_ids;
> +
> +		head = list_lru_get_mru(&cache->lru, cur_nid);
> +		if (head) {
> +			page = list_entry(head, struct page, lru);
> +			break;
> +		}
> +	}
> +
> +	return page;
> +}
> +
> +static void pte_mapped_cache_add(struct pte_mapped_cache *cache,
> +				 struct page *page)
> +{
> +	INIT_LIST_HEAD(&page->lru);
> +	list_lru_add_node(&cache->lru, &page->lru, page_to_nid(page));
> +	set_page_count(page, 0);
> +}
> +
> +static void pte_mapped_cache_add_neighbour_pages(struct page *page)
> +{
> +#if 0
> +	/*
> +	 * TODO: if pte_mapped_cache_replenish() had to fallback to
> order-0
> +	 * allocation, the large page in the direct map will be split
> +	 * anyway and if there are free pages in the same pageblock
> they
> +	 * can be added to pte_mapped cache.
> +	 */
> +	unsigned int order = (1 << HUGETLB_PAGE_ORDER);
> +	unsigned int nr_pages = (1 << order);
> +	unsigned long pfn = page_to_pfn(page);
> +	struct page *page_head = page - (pfn & (order - 1));
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = page_head + i;
> +		if (is_free_buddy_page(page)) {
> +			take_page_off_buddy(page);
> +			pte_mapped_cache_add(&pte_mapped_cache, page);
> +		}
> +	}
> +#endif
> +}
> 
This seems a nice benefit of doing this sort of stuff in the page
allocator if it can work.

> +static struct page *pte_mapped_cache_replenish(struct
> pte_mapped_cache *cache,
> +					       gfp_t gfp)
> +{
> +	unsigned int order = HUGETLB_PAGE_ORDER;
> +	unsigned int nr_pages;
> +	struct page *page;
> +	int i, err;
> +
> +	gfp &= ~__GFP_PTE_MAPPED;
> +
> +	page = alloc_pages(gfp, order);
> +	if (!page) {
> +		order = 0;
> +		page = alloc_pages(gfp, order);
> +		if (!page)
> +			return NULL;
> +	}
> +
> +	nr_pages = (1 << order);
> +	err = set_memory_4k((unsigned long)page_address(page),
> nr_pages);
> +	if (err)
> +		goto err_free_pages;
> +
> +	if (order)
> +		split_page(page, order);
> +	else
> +		pte_mapped_cache_add_neighbour_pages(page);
> +
> +	for (i = 1; i < nr_pages; i++)
> +		pte_mapped_cache_add(cache, page + i);
> +
> +	set_pageblock_pte_mapped(page);
> +
> +	return page;
> +
> +err_free_pages:
> +	__free_pages(page, order);
> +	return NULL;
> +}
> +
> +static struct page *alloc_page_pte_mapped(gfp_t gfp)
I'm a little disappointed building into the page allocator didn't
automatically make higher order allocations easy. It seems this mostly
bolts the grouped pages code on to the page allocator and splits out of
the allocation/free paths to call into it?

I was thinking the main benefit of handling direct map permissions in
the page allocator would be re-using the buddy part to support high
order pages, etc. Did you try to build it in like that? If we can't get
that, what is the benefit to doing permission stuff in the pageallocator? 

> +{
> +	struct pte_mapped_cache *cache = &pte_mapped_cache;
> +	struct page *page;
> +
> +	page = pte_mapped_cache_get(cache);
> +	if (page) {
> +		prep_new_page(page, 0, gfp, 0);
> +		goto out;
> +	}
> +
> +	page = pte_mapped_cache_replenish(cache, gfp);
> +
> +out:
> +	return page;
> +}
> +
We probably want to exclude GFP_ATOMIC before calling into CPA unless
debug page alloc is on, because it may need to split and sleep for the
allocation. There is a page table allocation with GFP_ATOMIC passed
actually.

In my next series of this I added support for GFP_ATOMIC to this code,
but that solution should only work for permission changing grouped page
allocators in the protected page tables case where the direct map
tables are handled differently. As a general solution though (that's
the long term intention right?), GFP_ATOMIC might deserve some
consideration.

The other thing is we probably don't want to clean out the atomic
reserves and add them to a cache just for one page. I opted to just
convert one page in the GFP_ATOMIC case.

> +static void free_page_pte_mapped(struct page *page)
> +{
> +	pte_mapped_cache_add(&pte_mapped_cache, page);
> +}
> +
> +static struct pte_mapped_cache *pte_mapped_cache_from_sc(struct
> shrinker *sh)
> +{
> +	return container_of(sh, struct pte_mapped_cache, shrinker);
> +}
> +
> +static unsigned long pte_mapped_cache_shrink_count(struct shrinker
> *shrinker,
> +						   struct
> shrink_control *sc)
> +{
> +	struct pte_mapped_cache *cache =
> pte_mapped_cache_from_sc(shrinker);
> +	unsigned long count = list_lru_shrink_count(&cache->lru, sc);
> +
> +	return count ? count : SHRINK_EMPTY;
> +}
> +
> +static enum lru_status pte_mapped_cache_shrink_isolate(struct
> list_head *item,
> +						       struct
> list_lru_one *lst,
> +						       spinlock_t
> *lock,
> +						       void *cb_arg)
> +{
> +	struct list_head *freeable = cb_arg;
> +
> +	list_lru_isolate_move(lst, item, freeable);
> +
> +	return LRU_REMOVED;
> +}
> +
> +static unsigned long pte_mapped_cache_shrink_scan(struct shrinker
> *shrinker,
> +						  struct shrink_control
> *sc)
> +{
> +	struct pte_mapped_cache *cache =
> pte_mapped_cache_from_sc(shrinker);
> +	struct list_head *cur, *next;
> +	unsigned long isolated;
> +	LIST_HEAD(freeable);
> +
> +	isolated = list_lru_shrink_walk(&cache->lru, sc,
> +					pte_mapped_cache_shrink_isolate
> ,
> +					&freeable);
> +
> +	list_for_each_safe(cur, next, &freeable) {
> +		struct page *page = list_entry(cur, struct page, lru);
> +
> +		list_del(cur);
> +		__free_pages_ok(page, 0, FPI_NO_PTE_MAP);
> +	}
> +
> +	/* Every item walked gets isolated */
> +	sc->nr_scanned += isolated;
> +
> +	return isolated;
> +}
> +
> +static int __pte_mapped_cache_init(struct pte_mapped_cache *cache)
> +{
> +	int err;
> +
> +	err = list_lru_init(&cache->lru);
> +	if (err)
> +		return err;
> +
> +	cache->shrinker.count_objects = pte_mapped_cache_shrink_count;
> +	cache->shrinker.scan_objects = pte_mapped_cache_shrink_scan;
> +	cache->shrinker.seeks = DEFAULT_SEEKS;
> +	cache->shrinker.flags = SHRINKER_NUMA_AWARE;
> +
> +	err = register_shrinker(&cache->shrinker);
> +	if (err)
> +		goto err_list_lru_destroy;
> +
> +	return 0;
> +
> +err_list_lru_destroy:
> +	list_lru_destroy(&cache->lru);
> +	return err;
> +}
> +
> +void __init pte_mapped_cache_init(void)
> +{
> +	if (gfp_allowed_mask & __GFP_PTE_MAPPED)
> +		return;
> +
> +	if (!__pte_mapped_cache_init(&pte_mapped_cache))
> +		gfp_allowed_mask |= __GFP_PTE_MAPPED;
> +}
> +#else
> +void __init pte_mapped_cache_init(void)
> +{
> +}
> +#endif /* CONFIG_ARCH_WANTS_PTE_MAPPED_CACHE */
> +
>  /**
>   * nr_free_zone_pages - count number of pages beyond high watermark
>   * @offset: The zone index of the highest zone

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
  2021-08-23 20:08   ` Edgecombe, Rick P
@ 2021-08-23 23:50   ` Dave Hansen
  2021-08-24  3:34     ` Andy Lutomirski
                       ` (3 more replies)
       [not found]   ` <FB6C09CD-9CEA-4FE8-B179-98DB63EBDD68@gmail.com>
  2 siblings, 4 replies; 27+ messages in thread
From: Dave Hansen @ 2021-08-23 23:50 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On 8/23/21 6:25 AM, Mike Rapoport wrote:
>  void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
>  {
> +	enable_pgtable_write(page_address(pte));
>  	pgtable_pte_page_dtor(pte);
>  	paravirt_release_pte(page_to_pfn(pte));
>  	paravirt_tlb_remove_table(tlb, pte);
> @@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>  #ifdef CONFIG_X86_PAE
>  	tlb->need_flush_all = 1;
>  #endif
> +	enable_pgtable_write(pmd);
>  	pgtable_pmd_page_dtor(page);
>  	paravirt_tlb_remove_table(tlb, page);
>  }

I would expected this to have leveraged the pte_offset_map/unmap() code
to enable/disable write access.  Granted, it would enable write access
even when only a read is needed, but that could be trivially fixed with
having a variant like:

	pte_offset_map_write()
	pte_offset_unmap_write()

in addition to the existing (presumably read-only) versions:

	pte_offset_map()
	pte_offset_unmap()

Although those only work for the leaf levels, it seems a shame not to to
use them.

I'm also cringing a bit at hacking this into the page allocator.   A
*lot* of what you're trying to do with getting large allocations out and
splitting them up is done very well today by the slab allocators.  It
might take some rearrangement of 'struct page' metadata to be more slab
friendly, but it does seem like a close enough fit to warrant investigating.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 23:50   ` Dave Hansen
@ 2021-08-24  3:34     ` Andy Lutomirski
  2021-08-25 14:59       ` Dave Hansen
  2021-08-24 13:32     ` Mike Rapoport
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2021-08-24  3:34 UTC (permalink / raw)
  To: Dave Hansen, Mike Rapoport, linux-mm
  Cc: Andrew Morton, Dave Hansen, Ira Weiny, Kees Cook, Mike Rapoport,
	Peter Zijlstra (Intel),
	Rick P Edgecombe, Vlastimil Babka, the arch/x86 maintainers,
	Linux Kernel Mailing List



On Mon, Aug 23, 2021, at 4:50 PM, Dave Hansen wrote:
> On 8/23/21 6:25 AM, Mike Rapoport wrote:
> >  void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
> >  {
> > +	enable_pgtable_write(page_address(pte));
> >  	pgtable_pte_page_dtor(pte);
> >  	paravirt_release_pte(page_to_pfn(pte));
> >  	paravirt_tlb_remove_table(tlb, pte);
> > @@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
> >  #ifdef CONFIG_X86_PAE
> >  	tlb->need_flush_all = 1;
> >  #endif
> > +	enable_pgtable_write(pmd);
> >  	pgtable_pmd_page_dtor(page);
> >  	paravirt_tlb_remove_table(tlb, page);
> >  }
> 
> I would expected this to have leveraged the pte_offset_map/unmap() code
> to enable/disable write access.  Granted, it would enable write access
> even when only a read is needed, but that could be trivially fixed with
> having a variant like:
> 
> 	pte_offset_map_write()
> 	pte_offset_unmap_write()

I would also like to see a discussion of how races in which multiple threads or CPUs access ptes in the same page at the same time.

> 
> in addition to the existing (presumably read-only) versions:
> 
> 	pte_offset_map()
> 	pte_offset_unmap()
> 
> Although those only work for the leaf levels, it seems a shame not to to
> use them.
> 
> I'm also cringing a bit at hacking this into the page allocator.   A
> *lot* of what you're trying to do with getting large allocations out and
> splitting them up is done very well today by the slab allocators.  It
> might take some rearrangement of 'struct page' metadata to be more slab
> friendly, but it does seem like a close enough fit to warrant investigating.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
       [not found]   ` <FB6C09CD-9CEA-4FE8-B179-98DB63EBDD68@gmail.com>
@ 2021-08-24  5:34     ` Nadav Amit
  2021-08-24 13:36       ` Mike Rapoport
  0 siblings, 1 reply; 27+ messages in thread
From: Nadav Amit @ 2021-08-24  5:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Linux-MM, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

Sorry for sending twice. The mail app decided to use HTML for some
reason.

On Aug 23, 2021, at 10:32 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> 
> On Aug 23, 2021, at 6:25 AM, Mike Rapoport <rppt@kernel.org> wrote:
> 
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Allocate page table using __GFP_PTE_MAPPED so that they will have 4K PTEs
> in the direct map. This allows to switch _PAGE_RW bit each time a page
> table page needs to be made writable or read-only.
> 
> The writability of the page tables is toggled only in the lowest level page
> table modifiction functions and immediately switched off.
> 
> The page tables created early in the boot (including the direct map page
> table) are not write protected.
> 
> 

[ snip ]

> +static void pgtable_write_set(void *pg_table, bool set)
> +{
> +	int level = 0;
> +	pte_t *pte;
> +
> +	/*
> +	 * Skip the page tables allocated from pgt_buf break area and from
> +	 * memblock
> +	 */
> +	if (!after_bootmem)
> +		return;
> +	if (!PageTable(virt_to_page(pg_table)))
> +		return;
> +
> +	pte = lookup_address((unsigned long)pg_table, &level);
> +	if (!pte || level != PG_LEVEL_4K)
> +		return;
> +
> +	if (set) {
> +		if (pte_write(*pte))
> +			return;
> +
> +		WRITE_ONCE(*pte, pte_mkwrite(*pte));

I think that the pte_write() test (and the following one) might hide
latent bugs. Either you know whether the PTE is write-protected or you
need to protect against nested/concurrent calls to pgtable_write_set()
by disabling preemption/IRQs.

Otherwise, you risk in having someone else write-protecting the PTE
after it is write-unprotected and before it is written - causing a crash,
or write-unprotecting it after it is protected - which circumvents the
protection.

Therefore, I would think that instead you should have:

	VM_BUG_ON(pte_write(*pte));  // (or WARN_ON_ONCE())

In addition, if there are assumptions on the preemptability of the code,
it would be nice to have some assertions. I think that the code assumes
that all calls to pgtable_write_set() are done while holding the
page-table lock. If that is the case, perhaps adding some lockdep
assertion would also help to confirm the correctness.

[ I put aside the lack of TLB flushes, which make the whole matter of
delivered protection questionable. I presume that once PKS is used, 
this is not an issue. ]




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-23 20:29   ` Edgecombe, Rick P
@ 2021-08-24 13:02     ` Mike Rapoport
  2021-08-24 16:38       ` Edgecombe, Rick P
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 13:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, linux-kernel, peterz, keescook, Weiny, Ira,
	dave.hansen, vbabka, x86, akpm, rppt, Lutomirski, Andy

On Mon, Aug 23, 2021 at 08:29:49PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > When __GFP_PTE_MAPPED flag is passed to an allocation request of
> > order 0,
> > the allocated page will be mapped at PTE level in the direct map.
> > 
> > To reduce the direct map fragmentation, maintain a cache of 4K pages
> > that
> > are already mapped at PTE level in the direct map. Whenever the cache
> > should be replenished, try to allocate 2M page and split it to 4K
> > pages
> > to localize shutter of the direct map. If the allocation of 2M page
> > fails,
> > fallback to a single page allocation at expense of the direct map
> > fragmentation.
> > 
> > The cache registers a shrinker that releases free pages from the
> > cache to
> > the page allocator.
> > 
> > The __GFP_PTE_MAPPED and caching of 4K pages are enabled only if an
> > architecture selects ARCH_WANTS_PTE_MAPPED_CACHE in its Kconfig.
> > 
> > [
> > cache management are mostly copied from 
> > 
> https://lore.kernel.org/lkml/20210505003032.489164-4-rick.p.edgecombe@intel.com/
> > ]
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/Kconfig                    |   8 +
> >  arch/x86/Kconfig                |   1 +
> >  include/linux/gfp.h             |  11 +-
> >  include/linux/mm.h              |   2 +
> >  include/linux/pageblock-flags.h |  26 ++++
> >  init/main.c                     |   1 +
> >  mm/internal.h                   |   3 +-
> >  mm/page_alloc.c                 | 261 +++++++++++++++++++++++++++++++-
> >  8 files changed, 309 insertions(+), 4 deletions(-)
 
...

> > +static void pte_mapped_cache_add_neighbour_pages(struct page *page)
> > +{
> > +#if 0
> > +	/*
> > +	 * TODO: if pte_mapped_cache_replenish() had to fallback to
> > order-0
> > +	 * allocation, the large page in the direct map will be split
> > +	 * anyway and if there are free pages in the same pageblock
> > they
> > +	 * can be added to pte_mapped cache.
> > +	 */
> > +	unsigned int order = (1 << HUGETLB_PAGE_ORDER);
> > +	unsigned int nr_pages = (1 << order);
> > +	unsigned long pfn = page_to_pfn(page);
> > +	struct page *page_head = page - (pfn & (order - 1));
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		page = page_head + i;
> > +		if (is_free_buddy_page(page)) {
> > +			take_page_off_buddy(page);
> > +			pte_mapped_cache_add(&pte_mapped_cache, page);
> > +		}
> > +	}
> > +#endif
> > +}
> > 
> This seems a nice benefit of doing this sort of stuff in the page
> allocator if it can work.

I didn't try enable it yet, but I don't see a fundamental reason why this
won't work.
 
> > +static struct page *alloc_page_pte_mapped(gfp_t gfp)
> >
> I'm a little disappointed building into the page allocator didn't
> automatically make higher order allocations easy. It seems this mostly
> bolts the grouped pages code on to the page allocator and splits out of
> the allocation/free paths to call into it?
>
> I was thinking the main benefit of handling direct map permissions in
> the page allocator would be re-using the buddy part to support high
> order pages, etc. Did you try to build it in like that? If we can't get
> that, what is the benefit to doing permission stuff in the pageallocator? 

The addition of grouped pages to page allocator the way I did is somewhat
intermediate solution between keeping such cache entirely separate from
page allocator vs making it really tightly integrated, e.g. using a new
migratetype or doing more intrusive changes to page allocator. One of the
reasons I did it this way is to present various trade-offs because, tbh,
I'm not yet sure what's the best way to move forward. [The other reason
being my laziness, dropping your grouped pages code into the page allocator
was the simplest thing to do ;-)].

The immediate benefit of having this code close to the page allocator is
the simplification of the free path. Otherwise we'd need a cache-specific
free method or some information in struct page about how to free a grouped
page. Besides, it is possible to put pages mapped as 4k into such cache at
boot time when page allocator is initialized.

Also, keeping a central cache for multiple users will improve memory
utilization and I believe it would require less splits of the direct map.
OTOH, keeping such caches per-user allows managing access policy per cache
which could be better from the security POV.

I'm also going to explore the possibilities of using a new migratetype or
SL*B as Dave suggested.
 
> > +{
> > +	struct pte_mapped_cache *cache = &pte_mapped_cache;
> > +	struct page *page;
> > +
> > +	page = pte_mapped_cache_get(cache);
> > +	if (page) {
> > +		prep_new_page(page, 0, gfp, 0);
> > +		goto out;
> > +	}
> > +
> > +	page = pte_mapped_cache_replenish(cache, gfp);
> > +
> > +out:
> > +	return page;
> > +}
> > +
> We probably want to exclude GFP_ATOMIC before calling into CPA unless
> debug page alloc is on, because it may need to split and sleep for the
> allocation. There is a page table allocation with GFP_ATOMIC passed actually.

Looking at the callers of alloc_low_pages() it seems that GFP_ATOMIC there
is stale...
 
> In my next series of this I added support for GFP_ATOMIC to this code,
> but that solution should only work for permission changing grouped page
> allocators in the protected page tables case where the direct map
> tables are handled differently. As a general solution though (that's
> the long term intention right?), GFP_ATOMIC might deserve some
> consideration.

... but for the general solution GFP_ATOMIC indeed deserves some
consideration.
 
> The other thing is we probably don't want to clean out the atomic
> reserves and add them to a cache just for one page. I opted to just
> convert one page in the GFP_ATOMIC case.
 
Do you mean to allocate one page in GFP_ATOMIC case and bypass high order
allocation?
But the CPA split is still necessary here, isn't it?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
  2021-08-23 20:02 ` [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Edgecombe, Rick P
@ 2021-08-24 13:03   ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 13:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-mm, linux-kernel, peterz, keescook, Weiny, Ira,
	dave.hansen, vbabka, x86, akpm, rppt, Lutomirski, Andy

On Mon, Aug 23, 2021 at 08:02:55PM +0000, Edgecombe, Rick P wrote:
>  Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> > 
> > There are usecases that need to remove pages from the direct map or
> > at
> > least map them with 4K granularity. Whenever this is done e.g. with
> > set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> > direct map are split into smaller pages.
> > 
> > To reduce the performance hit caused by the fragmentation of the
> > direct map
> > it make sense to group and/or cache the 4K pages removed from the
> > direct
> > map so that the split large pages won't be all over the place. 
> > 
> If you tied this into debug page alloc, you shouldn't need to group the
> pages. Are you thinking this PKS-less page table usage would be a
> security feature or debug time thing?

I consider the PKS-less page table protection as an example user of the
grouped pages/pte-mapped cache rather than an actual security feature or
even a debug thing.

With PKS we still have the same trade-off of allocation flexibility vs
direct map fragmentation and I hoped to focus the discussion of the mm part
of the series rather than on page table protection. Apparently it didn't
work :)
 
> > == TODOs ==
> > 
> > Whenever pte-mapped cache is being shrunk, it is possible to add some
> > kind
> > of compaction to move all the free pages into PMD-sized chunks, free
> > these
> > chunks at once and restore large page in the direct map.
>
> I had made a POC to do this a while back that hooked into the buddy
> code in the page allocator where this coalescing is already happening
> for freed pages. The problem was that most pages that get their direct
> map alias broken, end up using a page from the same 2MB page for the
> page table in the split. But then the direct map page table never gets
> freed so it never can restore the large page when checking the the
> allocation page getting freed. Grouping permissioned pages OR page
> tables would resolve that and it was my plan to try again after
> something like this happened. 

This suggests that one global cache won't be good enough, at least for the
case when page tables are taken from that cache.

> Was just an experiment, but can share if you are interested.
 
Yes, please.

> > == Alternatives ==
> > 
> > Current implementation uses a single global cache.
> > 
> > Another option is to have per-user caches, e.g one for the page
> > tables,
> > another for vmalloc etc.  This approach provides better control of
> > the
> > permissions of the pages allocated from these caches and allows the
> > user to
> > decide when (if at all) the pages can be accessed, e.g. for cache
> > compaction. The down side of this approach is that it complicates the
> > freeing path. A page allocated from a dedicated cache cannot be freed
> > with
> > put_page()/free_page() etc but it has to be freed with a dedicated
> > API or
> > there should be some back pointer in struct page so that page
> > allocator
> > will know what cache this page came from.
>
> This needs to reset the permissions before freeing, so doesn't seem too
> different than freeing them a special way.

Not quite. For instance, when freeing page table pages with mmu_gather, we
can reset the permission at or near pgtable_pxy_page_dtor() and continue to
the batched free.
 
> > Yet another possibility to make pte-mapped cache a migratetype of its
> > own.
> > Creating a new migratetype would allow higher order allocations of
> > pte-mapped pages, but I don't have enough understanding of page
> > allocator
> > and reclaim internals to estimate the complexity associated with this
> > approach. 
> > 
> I've been thinking about two categories of direct map permission
> usages.
> 
> One is limiting the use of the direct map alias when it's not in use
> and the primary alias is getting some other permission. Examples are
> modules, secretmem, xpfo, KVM guest memory unmapping stuff, etc. In
> this case re-allocations can share unmapped pages without doing any
> expensive maintenance and it helps to have one big cache. If you are
> going to convert pages to 4k and cache them, you might as well convert
> them to NP at the time, since it's cheap to restore them or set their
> permission from that state.
>
> Two is setting permissions on the direct map as the only alias to be
> used. This includes this rfc, some PKS usages, but also possibly some
> set_pages_uc() callers and the like. It seems that this category could
> still make use of a big unmapped cache of pages. Just ask for unmapped
> pages and convert them without a flush.
> 
> So like something would have a big cache of grouped unmapped pages 
> that category one usages could share. And then little category two
> allocators could have their own caches that feed on it too. What do you
> think? This is regardless if they live in the page allocator or not.

I can say I see how category two cache would use a global unmapped cache.
I would envision that these caches can share the implementation, but there
will be different instances - one for the global cache and another one (or
several) for users that cannot share the global cache for some reason.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 23:50   ` Dave Hansen
  2021-08-24  3:34     ` Andy Lutomirski
@ 2021-08-24 13:32     ` Mike Rapoport
  2021-08-25  8:38     ` David Hildenbrand
  2021-08-26  8:02     ` Mike Rapoport
  3 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 13:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On Mon, Aug 23, 2021 at 04:50:10PM -0700, Dave Hansen wrote:
> On 8/23/21 6:25 AM, Mike Rapoport wrote:
> 
> I'm also cringing a bit at hacking this into the page allocator.   A
> *lot* of what you're trying to do with getting large allocations out and
> splitting them up is done very well today by the slab allocators.  It
> might take some rearrangement of 'struct page' metadata to be more slab
> friendly, but it does seem like a close enough fit to warrant investigating.

I did this at the page allocator level in a hope that (1) it would be
possible to use such cache for allocations if different orders and (2)
having a global cache of unmapped pages will utilize memory more
efficiently and will reduce direct map fragmentation.
And slab allocators may be the users of the cache at page allocator level.

For the single use case of page tables, slabs may work, but in more general
case I don't see them as a good fit.

I'll take a closer look to using slab anyway, maybe it'll work out.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-24  5:34     ` Nadav Amit
@ 2021-08-24 13:36       ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 13:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On Mon, Aug 23, 2021 at 10:34:42PM -0700, Nadav Amit wrote:
> On Aug 23, 2021, at 10:32 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> > On Aug 23, 2021, at 6:25 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Allocate page table using __GFP_PTE_MAPPED so that they will have 4K PTEs
> > in the direct map. This allows to switch _PAGE_RW bit each time a page
> > table page needs to be made writable or read-only.
> > 
> > The writability of the page tables is toggled only in the lowest level page
> > table modifiction functions and immediately switched off.
> > 
> > The page tables created early in the boot (including the direct map page
> > table) are not write protected.
> > 
> > 
> 
> [ snip ]
> 
> > +static void pgtable_write_set(void *pg_table, bool set)
> > +{
> > +	int level = 0;
> > +	pte_t *pte;
> > +
> > +	/*
> > +	 * Skip the page tables allocated from pgt_buf break area and from
> > +	 * memblock
> > +	 */
> > +	if (!after_bootmem)
> > +		return;
> > +	if (!PageTable(virt_to_page(pg_table)))
> > +		return;
> > +
> > +	pte = lookup_address((unsigned long)pg_table, &level);
> > +	if (!pte || level != PG_LEVEL_4K)
> > +		return;
> > +
> > +	if (set) {
> > +		if (pte_write(*pte))
> > +			return;
> > +
> > +		WRITE_ONCE(*pte, pte_mkwrite(*pte));
> 
> I think that the pte_write() test (and the following one) might hide
> latent bugs. Either you know whether the PTE is write-protected or you
> need to protect against nested/concurrent calls to pgtable_write_set()
> by disabling preemption/IRQs.
> 
> Otherwise, you risk in having someone else write-protecting the PTE
> after it is write-unprotected and before it is written - causing a crash,
> or write-unprotecting it after it is protected - which circumvents the
> protection.
> 
> Therefore, I would think that instead you should have:
> 
> 	VM_BUG_ON(pte_write(*pte));  // (or WARN_ON_ONCE())
> 
> In addition, if there are assumptions on the preemptability of the code,
> it would be nice to have some assertions. I think that the code assumes
> that all calls to pgtable_write_set() are done while holding the
> page-table lock. If that is the case, perhaps adding some lockdep
> assertion would also help to confirm the correctness.
> 
> [ I put aside the lack of TLB flushes, which make the whole matter of
> delivered protection questionable. I presume that once PKS is used, 
> this is not an issue. ]

As I said in another reply, the actual page table protection is merely to
exercise the allocator. I'll consider to actually use PKS for the next
versions (unless Rick beats me to it).

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
  2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
                   ` (4 preceding siblings ...)
  2021-08-23 20:02 ` [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Edgecombe, Rick P
@ 2021-08-24 16:09 ` Vlastimil Babka
  2021-08-29  7:06   ` Mike Rapoport
  5 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2021-08-24 16:09 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe, x86,
	linux-kernel, Brijesh Singh

On 8/23/21 15:25, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> This is early prototype for addition of cache of pte-mapped pages to the
> page allocator. It survives boot and some cache shrinking, but it's still
> a long way to go for it to be ready for non-RFC posting.
> 
> The example use-case for pte-mapped cache is protection of page tables and
> keeping them read-only except for the designated code that is allowed to
> modify the page tables.
> 
> I'd like to get an early feedback for the approach to see what would be
> the best way to move forward with something like this.
> 
> This set is x86 specific at the moment because other architectures either
> do not support set_memory APIs that split the direct^w linear map (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses basic page
> size (like arm64).
> 
> == Motivation ==
> 
> There are usecases that need to remove pages from the direct map or at

Such use-case is e.g. the SEV-SNP support proposed here:
https://lore.kernel.org/linux-coco/20210820155918.7518-7-brijesh.singh@amd.com/

> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
> 
> To reduce the performance hit caused by the fragmentation of the direct map
> it make sense to group and/or cache the 4K pages removed from the direct
> map so that the split large pages won't be all over the place. 
> 
> There were RFCs for grouped page allocations for vmalloc permissions [1]
> and for using PKS to protect page tables [2] as well as an attempt to use a
> pool of large pages in secretmtm [3].
> 
> == Implementation overview ==
> 
> This set leverages ideas from the patches that added PKS protection to page
> tables, but instead of adding per-user grouped allocations it tries to move
> the cache of pte-mapped pages closer to the page allocator.
> 
> The idea is to use a gfp flag that will instruct the page allocator to use
> the cache of pte-mapped pages because the caller needs to remove them from
> the direct map or change their attributes. 

Like Dave, I don't like much the idea of a new GFP flag that all page
allocations now have to check, and freeing that has to check a new pageblock
flag, although I can see some of the benefits this brings...

> When the cache is empty there is an attempt to refill it using PMD-sized
> allocation so that once the direct map is split we'll be able to use all 4K
> pages made available by the split. 
> 
> If the high order allocation fails, we fall back to order-0 and mark the

Yeah, this fallback is where we benefit from the page allocator implementation,
because of the page freeing hook that will recognize page from such fallback
blocks and free them to the cache. But does that prevent so much fragmentation
to be worth it? I'd see first if we can do without it.

> entire pageblock as pte-mapped. When pages from that pageblock are freed to
> the page allocator they are put into the pte-mapped cache. There is also
> unimplemented provision to add free pages from such pageblock to the
> pte-mapped cache along with the page that was allocated and cause the split
> of the pageblock.
> 
> For now only order-0 allocations of pte-mapped pages are supported, which
> prevents, for instance, allocation of PGD with PTI enabled.
> 
> The free pages in the cache may be reclaimed using a shrinker, but for now
> they will remain mapped with PTEs in the direct map.
> 
> == TODOs ==
> 
> Whenever pte-mapped cache is being shrunk, it is possible to add some kind
> of compaction to move all the free pages into PMD-sized chunks, free these
> chunks at once and restore large page in the direct map.
> 
> There is also a possibility to add heuristics and knobs to control
> greediness of the cache vs memory pressure so that freed pte-mapped cache
> won't be necessarily put into the pte-mapped cache.
> 
> Another thing that can be implemented is pre-populating the pte-cache at
> boot time to include the free pages that are anyway mapped by PTEs.
> 
> == Alternatives ==
> 
> Current implementation uses a single global cache.
> 
> Another option is to have per-user caches, e.g one for the page tables,
> another for vmalloc etc.  This approach provides better control of the
> permissions of the pages allocated from these caches and allows the user to
> decide when (if at all) the pages can be accessed, e.g. for cache
> compaction. The down side of this approach is that it complicates the
> freeing path. A page allocated from a dedicated cache cannot be freed with
> put_page()/free_page() etc but it has to be freed with a dedicated API or
> there should be some back pointer in struct page so that page allocator
> will know what cache this page came from.
> 
> Yet another possibility to make pte-mapped cache a migratetype of its own.
> Creating a new migratetype would allow higher order allocations of
> pte-mapped pages, but I don't have enough understanding of page allocator
> and reclaim internals to estimate the complexity associated with this
> approach. 
> 
> [1] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
> [2] https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
> [3] https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org/
> 
> Mike Rapoport (2):
>   mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
>   x86/mm: write protect (most) page tables
> 
> Rick Edgecombe (2):
>   list: Support getting most recent element in list_lru
>   list: Support list head not in object for list_lru
> 
>  arch/Kconfig                            |   8 +
>  arch/x86/Kconfig                        |   1 +
>  arch/x86/boot/compressed/ident_map_64.c |   3 +
>  arch/x86/include/asm/pgalloc.h          |   2 +
>  arch/x86/include/asm/pgtable.h          |  21 +-
>  arch/x86/include/asm/pgtable_64.h       |  33 ++-
>  arch/x86/mm/init.c                      |   2 +-
>  arch/x86/mm/pgtable.c                   |  72 ++++++-
>  include/asm-generic/pgalloc.h           |   2 +-
>  include/linux/gfp.h                     |  11 +-
>  include/linux/list_lru.h                |  26 +++
>  include/linux/mm.h                      |   2 +
>  include/linux/pageblock-flags.h         |  26 +++
>  init/main.c                             |   1 +
>  mm/internal.h                           |   3 +-
>  mm/list_lru.c                           |  38 +++-
>  mm/page_alloc.c                         | 261 +++++++++++++++++++++++-
>  17 files changed, 496 insertions(+), 16 deletions(-)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
  2021-08-23 20:29   ` Edgecombe, Rick P
@ 2021-08-24 16:12   ` Vlastimil Babka
  2021-08-25  8:43   ` David Hildenbrand
  2 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2021-08-24 16:12 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe, x86,
	linux-kernel

On 8/23/21 15:25, Mike Rapoport wrote:
> @@ -3283,5 +3283,7 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +void pte_mapped_cache_init(void);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 973fd731a520..4faf8c542b00 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -21,6 +21,8 @@ enum pageblock_bits {
>  			/* 3 bits required for migrate types */
>  	PB_migrate_skip,/* If set the block is skipped by compaction */
>  
> +	PB_pte_mapped, /* If set the block is mapped with PTEs in direct map */
> +
>  	/*
>  	 * Assume the bits will always align on a word. If this assumption
>  	 * changes then get/set pageblock needs updating.

You have broken this assumption :)
> @@ -536,7 +558,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
>  	unsigned long bitidx, word_bitidx;
>  	unsigned long old_word, word;
>  
> -	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 5);

This is not sufficient to satisfy the "needs updating" part. We would now need
NR_PAGEBLOCK_BITS == 8.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-24 13:02     ` Mike Rapoport
@ 2021-08-24 16:38       ` Edgecombe, Rick P
  2021-08-24 16:54         ` Mike Rapoport
  0 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2021-08-24 16:38 UTC (permalink / raw)
  To: rppt
  Cc: linux-kernel, peterz, keescook, Weiny, Ira, linux-mm,
	dave.hansen, vbabka, x86, akpm, rppt, Lutomirski, Andy

On Tue, 2021-08-24 at 16:02 +0300, Mike Rapoport wrote:
> > We probably want to exclude GFP_ATOMIC before calling into CPA
> > unless
> > debug page alloc is on, because it may need to split and sleep for
> > the
> > allocation. There is a page table allocation with GFP_ATOMIC passed
> > actually.
> 
> Looking at the callers of alloc_low_pages() it seems that GFP_ATOMIC
> there
> is stale...
Well two actually, there is also spp_getpage(). I tried to determine if
that was also stale but wasn't confident. There were a lot of paths in.
>  
> > In my next series of this I added support for GFP_ATOMIC to this
> > code,
> > but that solution should only work for permission changing grouped
> > page
> > allocators in the protected page tables case where the direct map
> > tables are handled differently. As a general solution though
> > (that's
> > the long term intention right?), GFP_ATOMIC might deserve some
> > consideration.
> 
> ... but for the general solution GFP_ATOMIC indeed deserves some
> consideration.
>  
> > The other thing is we probably don't want to clean out the atomic
> > reserves and add them to a cache just for one page. I opted to just
> > convert one page in the GFP_ATOMIC case.
> 
>  
> Do you mean to allocate one page in GFP_ATOMIC case and bypass high
> order
> allocation?
> But the CPA split is still necessary here, isn't it?
Yes, grabs one atomic page and fragments it in the case of no pages in
the grouped page cache. The CPA split is necessary still, but it should
be ok because of the special way direct map page table allocations are
handled for pks tables. Has not been reviewed by anyone yet, and
wouldn't work as a general solution anyway.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-24 16:38       ` Edgecombe, Rick P
@ 2021-08-24 16:54         ` Mike Rapoport
  2021-08-24 17:23           ` Edgecombe, Rick P
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 16:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rppt, linux-kernel, peterz, keescook, Weiny, Ira, linux-mm,
	dave.hansen, vbabka, x86, akpm, Lutomirski, Andy

On Tue, Aug 24, 2021 at 04:38:03PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2021-08-24 at 16:02 +0300, Mike Rapoport wrote:
> > > We probably want to exclude GFP_ATOMIC before calling into CPA
> > > unless
> > > debug page alloc is on, because it may need to split and sleep for
> > > the
> > > allocation. There is a page table allocation with GFP_ATOMIC passed
> > > actually.
> > 
> > Looking at the callers of alloc_low_pages() it seems that GFP_ATOMIC
> > there
> > is stale...
>
> Well two actually, there is also spp_getpage(). I tried to determine if
> that was also stale but wasn't confident. There were a lot of paths in.
  
It's also used at init and during memory hotplug, so I really doubt it
needs GFP_ATOMIC.

> > > In my next series of this I added support for GFP_ATOMIC to this
> > > code,
> > > but that solution should only work for permission changing grouped
> > > page
> > > allocators in the protected page tables case where the direct map
> > > tables are handled differently. As a general solution though
> > > (that's
> > > the long term intention right?), GFP_ATOMIC might deserve some
> > > consideration.
> > 
> > ... but for the general solution GFP_ATOMIC indeed deserves some
> > consideration.
> >  
> > > The other thing is we probably don't want to clean out the atomic
> > > reserves and add them to a cache just for one page. I opted to just
> > > convert one page in the GFP_ATOMIC case.
> >  
> > Do you mean to allocate one page in GFP_ATOMIC case and bypass high
> > order
> > allocation?
> > But the CPA split is still necessary here, isn't it?
>
> Yes, grabs one atomic page and fragments it in the case of no pages in
> the grouped page cache. The CPA split is necessary still, but it should
> be ok because of the special way direct map page table allocations are
> handled for pks tables. Has not been reviewed by anyone yet, and
> wouldn't work as a general solution anyway.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-24 16:54         ` Mike Rapoport
@ 2021-08-24 17:23           ` Edgecombe, Rick P
  2021-08-24 17:37             ` Mike Rapoport
  0 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2021-08-24 17:23 UTC (permalink / raw)
  To: rppt
  Cc: linux-kernel, peterz, keescook, rppt, Weiny, Ira, linux-mm,
	dave.hansen, vbabka, x86, akpm, Lutomirski, Andy

On Tue, 2021-08-24 at 19:54 +0300, Mike Rapoport wrote:
> On Tue, Aug 24, 2021 at 04:38:03PM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2021-08-24 at 16:02 +0300, Mike Rapoport wrote:
> > > > We probably want to exclude GFP_ATOMIC before calling into CPA
> > > > unless
> > > > debug page alloc is on, because it may need to split and sleep
> > > > for
> > > > the
> > > > allocation. There is a page table allocation with GFP_ATOMIC
> > > > passed
> > > > actually.
> > > 
> > > Looking at the callers of alloc_low_pages() it seems that
> > > GFP_ATOMIC
> > > there
> > > is stale...
> > 
> > Well two actually, there is also spp_getpage(). I tried to
> > determine if
> > that was also stale but wasn't confident. There were a lot of paths
> > in.
> 
>   
> It's also used at init and during memory hotplug, so I really doubt
> it
> needs GFP_ATOMIC.

Pretty sure it gets called after init by at least something besides
hotplug. I saw it during debugging with a little sanitizer I built to
find any unprotected page tables missed. Something tweaking the fixmap
IIRC. Did you look at the set_fixmap_() and set_pte_vaddr() family of
functions? Now whether any of them actually need GFP_ATOMIC, I am less
sure. There were a fair amount of drivers to analyze.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-24 17:23           ` Edgecombe, Rick P
@ 2021-08-24 17:37             ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-24 17:37 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel, peterz, keescook, rppt, Weiny, Ira, linux-mm,
	dave.hansen, vbabka, x86, akpm, Lutomirski, Andy

On Tue, Aug 24, 2021 at 05:23:04PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2021-08-24 at 19:54 +0300, Mike Rapoport wrote:
> > On Tue, Aug 24, 2021 at 04:38:03PM +0000, Edgecombe, Rick P wrote:
> > > On Tue, 2021-08-24 at 16:02 +0300, Mike Rapoport wrote:
> > > > > We probably want to exclude GFP_ATOMIC before calling into CPA
> > > > > unless
> > > > > debug page alloc is on, because it may need to split and sleep
> > > > > for
> > > > > the
> > > > > allocation. There is a page table allocation with GFP_ATOMIC
> > > > > passed
> > > > > actually.
> > > > 
> > > > Looking at the callers of alloc_low_pages() it seems that
> > > > GFP_ATOMIC
> > > > there
> > > > is stale...
> > > 
> > > Well two actually, there is also spp_getpage(). I tried to
> > > determine if
> > > that was also stale but wasn't confident. There were a lot of paths
> > > in.
> > 
> >   
> > It's also used at init and during memory hotplug, so I really doubt
> > it
> > needs GFP_ATOMIC.
> 
> Pretty sure it gets called after init by at least something besides
> hotplug. I saw it during debugging with a little sanitizer I built to
> find any unprotected page tables missed. Something tweaking the fixmap
> IIRC. Did you look at the set_fixmap_() and set_pte_vaddr() family of
> functions? Now whether any of them actually need GFP_ATOMIC, I am less
> sure. There were a fair amount of drivers to analyze.

Oh, I've missed set_pte_vaddr(). I still doubt anything that uses those two
would need GFP_ATOMIC, but it's surely way harder to analyze.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 23:50   ` Dave Hansen
  2021-08-24  3:34     ` Andy Lutomirski
  2021-08-24 13:32     ` Mike Rapoport
@ 2021-08-25  8:38     ` David Hildenbrand
  2021-08-26  8:02     ` Mike Rapoport
  3 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2021-08-25  8:38 UTC (permalink / raw)
  To: Dave Hansen, Mike Rapoport, linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On 24.08.21 01:50, Dave Hansen wrote:
> On 8/23/21 6:25 AM, Mike Rapoport wrote:
>>   void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
>>   {
>> +	enable_pgtable_write(page_address(pte));
>>   	pgtable_pte_page_dtor(pte);
>>   	paravirt_release_pte(page_to_pfn(pte));
>>   	paravirt_tlb_remove_table(tlb, pte);
>> @@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>>   #ifdef CONFIG_X86_PAE
>>   	tlb->need_flush_all = 1;
>>   #endif
>> +	enable_pgtable_write(pmd);
>>   	pgtable_pmd_page_dtor(page);
>>   	paravirt_tlb_remove_table(tlb, page);
>>   }
> 
> I would expected this to have leveraged the pte_offset_map/unmap() code
> to enable/disable write access.  Granted, it would enable write access
> even when only a read is needed, but that could be trivially fixed with
> having a variant like:

For write access you actually want pte_offset_map_locked(), but it's 
also used for stable read access sometimes (exclude any writers).

> 
> 	pte_offset_map_write()
> 	pte_offset_unmap_write()
> 
> in addition to the existing (presumably read-only) versions:
> 
> 	pte_offset_map()
> 	pte_offset_unmap()

These should mostly only be read access, you'd need other ways of making 
sure nobody else messes with that entry. I think it even holds for 
khugepaged collapsing code.


I find these hidden PMD entry modifications (e.g., without holding the 
PMD lock) deep down in arch code quite concerning. Read: horribly ugly 
and a nightmare to debug.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages
  2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
  2021-08-23 20:29   ` Edgecombe, Rick P
  2021-08-24 16:12   ` Vlastimil Babka
@ 2021-08-25  8:43   ` David Hildenbrand
  2 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2021-08-25  8:43 UTC (permalink / raw)
  To: Mike Rapoport, linux-mm
  Cc: Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On 23.08.21 15:25, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> When __GFP_PTE_MAPPED flag is passed to an allocation request of order 0,
> the allocated page will be mapped at PTE level in the direct map.
> 
> To reduce the direct map fragmentation, maintain a cache of 4K pages that
> are already mapped at PTE level in the direct map. Whenever the cache
> should be replenished, try to allocate 2M page and split it to 4K pages
> to localize shutter of the direct map. If the allocation of 2M page fails,
> fallback to a single page allocation at expense of the direct map
> fragmentation.
> 
> The cache registers a shrinker that releases free pages from the cache to
> the page allocator.
> 
> The __GFP_PTE_MAPPED and caching of 4K pages are enabled only if an
> architecture selects ARCH_WANTS_PTE_MAPPED_CACHE in its Kconfig.
> 
> [
> cache management are mostly copied from
> https://lore.kernel.org/lkml/20210505003032.489164-4-rick.p.edgecombe@intel.com/
> ]
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   arch/Kconfig                    |   8 +
>   arch/x86/Kconfig                |   1 +
>   include/linux/gfp.h             |  11 +-
>   include/linux/mm.h              |   2 +
>   include/linux/pageblock-flags.h |  26 ++++
>   init/main.c                     |   1 +
>   mm/internal.h                   |   3 +-
>   mm/page_alloc.c                 | 261 +++++++++++++++++++++++++++++++-
>   8 files changed, 309 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 129df498a8e1..2db95331201b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -243,6 +243,14 @@ config ARCH_HAS_SET_MEMORY
>   config ARCH_HAS_SET_DIRECT_MAP
>   	bool

[...]

> +static int __pte_mapped_cache_init(struct pte_mapped_cache *cache)
> +{
> +	int err;
> +
> +	err = list_lru_init(&cache->lru);
> +	if (err)
> +		return err;
> +
> +	cache->shrinker.count_objects = pte_mapped_cache_shrink_count;
> +	cache->shrinker.scan_objects = pte_mapped_cache_shrink_scan;
> +	cache->shrinker.seeks = DEFAULT_SEEKS;
> +	cache->shrinker.flags = SHRINKER_NUMA_AWARE;
> +
> +	err = register_shrinker(&cache->shrinker);
> +	if (err)
> +		goto err_list_lru_destroy;

With a shrinker in place, it really does somewhat feel like this should 
be a cache outside of the buddy. Or at least moved outside of 
page_alloc.c with a clean interface to work with the buddy.

But I only had a quick glimpse over this patch.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-24  3:34     ` Andy Lutomirski
@ 2021-08-25 14:59       ` Dave Hansen
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2021-08-25 14:59 UTC (permalink / raw)
  To: Andy Lutomirski, Mike Rapoport, linux-mm
  Cc: Andrew Morton, Dave Hansen, Ira Weiny, Kees Cook, Mike Rapoport,
	Peter Zijlstra (Intel),
	Rick P Edgecombe, Vlastimil Babka, the arch/x86 maintainers,
	Linux Kernel Mailing List

On 8/23/21 8:34 PM, Andy Lutomirski wrote:
>> I would expected this to have leveraged the pte_offset_map/unmap() code
>> to enable/disable write access.  Granted, it would enable write access
>> even when only a read is needed, but that could be trivially fixed with
>> having a variant like:
>>
>> 	pte_offset_map_write()
>> 	pte_offset_unmap_write()
> I would also like to see a discussion of how races in which multiple
> threads or CPUs access ptes in the same page at the same time.

Yeah, the 32-bit highmem code has a per-cpu mapping area for
kmap_atomic() that lies underneath the pte_offset_map().  Although it's
in the shared kernel mapping, only one CPU uses it.

I didn't look at it until you mentioned it, but the code in this set is
just plain buggy if two CPUs do a
enable_pgtable_write()/disable_pgtable_write().  They'll clobber each other:

> +static void pgtable_write_set(void *pg_table, bool set)
> +{
> +	int level = 0;
> +	pte_t *pte;
> +
> +	/*
> +	 * Skip the page tables allocated from pgt_buf break area and from
> +	 * memblock
> +	 */
> +	if (!after_bootmem)
> +		return;
> +	if (!PageTable(virt_to_page(pg_table)))
> +		return;
> +
> +	pte = lookup_address((unsigned long)pg_table, &level);
> +	if (!pte || level != PG_LEVEL_4K)
> +		return;
> +
> +	if (set) {
> +		if (pte_write(*pte))
> +			return;
> +
> +		WRITE_ONCE(*pte, pte_mkwrite(*pte));
> +	} else {
> +		if (!pte_write(*pte))
> +			return;
> +
> +		WRITE_ONCE(*pte, pte_wrprotect(*pte));
> +	}
> +}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-23 23:50   ` Dave Hansen
                       ` (2 preceding siblings ...)
  2021-08-25  8:38     ` David Hildenbrand
@ 2021-08-26  8:02     ` Mike Rapoport
  2021-08-26  9:01       ` Vlastimil Babka
  3 siblings, 1 reply; 27+ messages in thread
From: Mike Rapoport @ 2021-08-26  8:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe,
	Vlastimil Babka, x86, linux-kernel

On Mon, Aug 23, 2021 at 04:50:10PM -0700, Dave Hansen wrote:
> On 8/23/21 6:25 AM, Mike Rapoport wrote:
> >  void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
> >  {
> > +	enable_pgtable_write(page_address(pte));
> >  	pgtable_pte_page_dtor(pte);
> >  	paravirt_release_pte(page_to_pfn(pte));
> >  	paravirt_tlb_remove_table(tlb, pte);
> > @@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
> >  #ifdef CONFIG_X86_PAE
> >  	tlb->need_flush_all = 1;
> >  #endif
> > +	enable_pgtable_write(pmd);
> >  	pgtable_pmd_page_dtor(page);
> >  	paravirt_tlb_remove_table(tlb, page);
> >  }
> 
> I'm also cringing a bit at hacking this into the page allocator.   A
> *lot* of what you're trying to do with getting large allocations out and
> splitting them up is done very well today by the slab allocators.  It
> might take some rearrangement of 'struct page' metadata to be more slab
> friendly, but it does seem like a close enough fit to warrant investigating.

I thought more about using slab, but it seems to me the least suitable
option. The usecases at hand (page tables, secretmem, SEV/TDX) allocate in
page granularity and some of them use struct page metadata, so even its
rearrangement won't help. And adding support for 2M slabs to SLUB would be
quite intrusive.

I think that better options are moving such cache deeper into buddy or
using e.g. genalloc instead of a list to deal with higher order allocations. 

The choice between these two will mostly depend of the API selection, i.e.
a GFP flag or a dedicated alloc/free.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] x86/mm: write protect (most) page tables
  2021-08-26  8:02     ` Mike Rapoport
@ 2021-08-26  9:01       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2021-08-26  9:01 UTC (permalink / raw)
  To: Mike Rapoport, Dave Hansen
  Cc: linux-mm, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe, x86,
	linux-kernel

On 8/26/21 10:02, Mike Rapoport wrote:
> On Mon, Aug 23, 2021 at 04:50:10PM -0700, Dave Hansen wrote:
>> On 8/23/21 6:25 AM, Mike Rapoport wrote:
>> >  void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
>> >  {
>> > +	enable_pgtable_write(page_address(pte));
>> >  	pgtable_pte_page_dtor(pte);
>> >  	paravirt_release_pte(page_to_pfn(pte));
>> >  	paravirt_tlb_remove_table(tlb, pte);
>> > @@ -69,6 +73,7 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
>> >  #ifdef CONFIG_X86_PAE
>> >  	tlb->need_flush_all = 1;
>> >  #endif
>> > +	enable_pgtable_write(pmd);
>> >  	pgtable_pmd_page_dtor(page);
>> >  	paravirt_tlb_remove_table(tlb, page);
>> >  }
>> 
>> I'm also cringing a bit at hacking this into the page allocator.   A
>> *lot* of what you're trying to do with getting large allocations out and
>> splitting them up is done very well today by the slab allocators.  It
>> might take some rearrangement of 'struct page' metadata to be more slab
>> friendly, but it does seem like a close enough fit to warrant investigating.
> 
> I thought more about using slab, but it seems to me the least suitable
> option. The usecases at hand (page tables, secretmem, SEV/TDX) allocate in
> page granularity and some of them use struct page metadata, so even its
> rearrangement won't help. And adding support for 2M slabs to SLUB would be
> quite intrusive.

Agree, and there would be unnecessary memory overhead too, SLUB would be happy
to cache a 2MB block on each CPU, etc.

> I think that better options are moving such cache deeper into buddy or
> using e.g. genalloc instead of a list to deal with higher order allocations. 
> 
> The choice between these two will mostly depend of the API selection, i.e.
> a GFP flag or a dedicated alloc/free.

Implementing on top of buddy seem still like the better option to me.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
  2021-08-24 16:09 ` Vlastimil Babka
@ 2021-08-29  7:06   ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2021-08-29  7:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Andy Lutomirski, Dave Hansen, Ira Weiny,
	Kees Cook, Mike Rapoport, Peter Zijlstra, Rick Edgecombe, x86,
	linux-kernel, Brijesh Singh

On Tue, Aug 24, 2021 at 06:09:44PM +0200, Vlastimil Babka wrote:
> On 8/23/21 15:25, Mike Rapoport wrote:
> >
> > The idea is to use a gfp flag that will instruct the page allocator to use
> > the cache of pte-mapped pages because the caller needs to remove them from
> > the direct map or change their attributes. 
> 
> Like Dave, I don't like much the idea of a new GFP flag that all page
> allocations now have to check, and freeing that has to check a new pageblock
> flag, although I can see some of the benefits this brings...
> 
> > When the cache is empty there is an attempt to refill it using PMD-sized
> > allocation so that once the direct map is split we'll be able to use all 4K
> > pages made available by the split. 
> > 
> > If the high order allocation fails, we fall back to order-0 and mark the
> 
> Yeah, this fallback is where we benefit from the page allocator implementation,
> because of the page freeing hook that will recognize page from such fallback
> blocks and free them to the cache. But does that prevent so much fragmentation
> to be worth it? I'd see first if we can do without it.

I've run 'stress-ng --mmapfork 20 -t 30' in a VM with 4G or RAM and then
checked splits reported in /proc/vmstat to get some ideas what may be the
benefit.

I've compared Rick's implementation of grouped alloc (rebased on v5.14-rc6)
with this set. For that simple test there were ~30% less splits.

                      | grouped alloc | pte-mapped
----------------------+---------------+------------
PMD splits after boot |       16      |     14
PMD splits after test |       49      |     34

(there were no PUD splits at all).

I think the closer we have such cache to the buddy, the better would be
memory utilization. The downside is that it will be harder to reclaim 2M
blocks than with separate caches because at page allocator level we don't
have enough information to make the pages allocated from the cache movable.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-08-29  7:06 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 1/4] list: Support getting most recent element in list_lru Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 2/4] list: Support list head not in object for list_lru Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
2021-08-23 20:29   ` Edgecombe, Rick P
2021-08-24 13:02     ` Mike Rapoport
2021-08-24 16:38       ` Edgecombe, Rick P
2021-08-24 16:54         ` Mike Rapoport
2021-08-24 17:23           ` Edgecombe, Rick P
2021-08-24 17:37             ` Mike Rapoport
2021-08-24 16:12   ` Vlastimil Babka
2021-08-25  8:43   ` David Hildenbrand
2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
2021-08-23 20:08   ` Edgecombe, Rick P
2021-08-23 23:50   ` Dave Hansen
2021-08-24  3:34     ` Andy Lutomirski
2021-08-25 14:59       ` Dave Hansen
2021-08-24 13:32     ` Mike Rapoport
2021-08-25  8:38     ` David Hildenbrand
2021-08-26  8:02     ` Mike Rapoport
2021-08-26  9:01       ` Vlastimil Babka
     [not found]   ` <FB6C09CD-9CEA-4FE8-B179-98DB63EBDD68@gmail.com>
2021-08-24  5:34     ` Nadav Amit
2021-08-24 13:36       ` Mike Rapoport
2021-08-23 20:02 ` [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Edgecombe, Rick P
2021-08-24 13:03   ` Mike Rapoport
2021-08-24 16:09 ` Vlastimil Babka
2021-08-29  7:06   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).