LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/5] x86: Impplement support for unaccepted memory
@ 2021-08-10  6:26 Kirill A. Shutemov
  2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces concept of memory acceptance:
Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
requiring memory to be accepted before it can be used by the guest.
Accepting happens via a protocol specific for the Virtrual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. We don't want to accept all memory
upfront.

The patchset implements on-demand memory acceptance for TDX.

Please, review. Any feedback is welcome.

Kirill A. Shutemov (5):
  mm: Add support for unaccepted memory
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Provide helpers for unaccepted memory
  x86/tdx: Unaccepted memory support

 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 86 ++++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 +++-
 arch/x86/boot/compressed/misc.c              |  9 ++
 arch/x86/boot/compressed/tdx.c               | 29 +++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 51 ++++++++++++
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/tdx.h                   |  2 +
 arch/x86/include/asm/unaccepted_memory.h     | 17 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 arch/x86/kernel/tdx.c                        |  8 ++
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 84 +++++++++++++++++++
 drivers/firmware/efi/Kconfig                 | 12 +++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 75 ++++++++++++++---
 include/linux/efi.h                          |  3 +-
 mm/internal.h                                | 14 ++++
 mm/memblock.c                                |  1 +
 mm/page_alloc.c                              | 13 ++-
 22 files changed, 414 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
@ 2021-08-10  6:26 ` Kirill A. Shutemov
  2021-08-10  7:48   ` David Hildenbrand
  2021-08-10 18:13   ` Dave Hansen
  2021-08-10  6:26 ` [PATCH 2/5] efi/x86: Implement " Kirill A. Shutemov
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces concept of memory acceptance:
Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
requiring memory to be accepted before it can be used by the guest.
Accepting happens via a protocol specific for the Virtrual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptation until memory is needed. It lowers boot time and reduces
memory overhead.

Support of such memory requires few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

Page allocator is modified to accept pages on the first allocation.
PageOffline() is used to indicate that the page requires acceptance.
The flag currently used by hotplug and balloon. Such pages are not
available to page allocator.

An architecture has to provide three helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - maybe_set_page_offline() marks a page PageOffline() if it requires
   acceptance. Used during boot to put pages on free lists.

 - clear_page_offline() clears makes a page accepted and clears
   PageOffline().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/internal.h   | 14 ++++++++++++++
 mm/memblock.c   |  1 +
 mm/page_alloc.c | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 31ff935b2547..d2fc8a17fbe0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline void maybe_set_page_offline(struct page *page, unsigned int order)
+{
+}
+
+static inline void clear_page_offline(struct page *page, unsigned int order)
+{
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 28a813d9e955..8c1bf08f2b0b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1370,6 +1370,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	accept_memory(found, found + size);
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 856b175c15a4..892347d9a507 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -899,6 +899,9 @@ static inline bool page_is_buddy(struct page *page, struct page *buddy,
 	if (buddy_order(buddy) != order)
 		return false;
 
+	if (PageOffline(buddy) || PageOffline(page))
+		return false;
+
 	/*
 	 * zone check is done late to avoid uselessly calculating
 	 * zone/node ids for pages that could never merge.
@@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (PageOffline(page))
+		clear_page_offline(page, order);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -1165,7 +1171,8 @@ static inline void __free_one_page(struct page *page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
+	    !PageOffline(page))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1748,6 +1755,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
+
+	maybe_set_page_offline(page, order);
 	__free_pages_core(page, order);
 }
 
@@ -1839,10 +1848,12 @@ static void __init deferred_free_range(unsigned long pfn,
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		maybe_set_page_offline(page, pageblock_order);
 		__free_pages_core(page, pageblock_order);
 		return;
 	}
 
+	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
  2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
@ 2021-08-10  6:26 ` Kirill A. Shutemov
  2021-08-10 17:50   ` Dave Hansen
  2021-08-10 18:30   ` Dave Hansen
  2021-08-10  6:26 ` [PATCH 3/5] x86/boot/compressed: Handle " Kirill A. Shutemov
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

UEFI Specification version 2.9 introduces concept of memory acceptance:
Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
requiring memory to be accepted before it can be used by the guest.
Accepting happens via a protocol specific for the Virtrual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptation until memory is needed. It lowers boot time and reduces
memory overhead.

Kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range based tracking works fine for firmware, but it gets bulky for
kernel: e820 has to be modified on every page acceptance. It leads to
table fragmentation, but there's a limited number of entries in the e820
table

Other option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst case scenario -- a huge hole in the middle of the
address space -- we would need 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M get accepted upfront.

The bitmap allocated and constructed in EFI stub and passed down to
kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory present according to the maximum address in the memory
map.

The same boot_params.unaccepted_memory can be used to pass the bitmap
between two kernel on kexec, but the use-case is not yet implemented.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 24 +++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 36 ++++++++++
 arch/x86/include/asm/unaccepted_memory.h     | 12 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 drivers/firmware/efi/Kconfig                 | 12 ++++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 75 ++++++++++++++++----
 include/linux/efi.h                          |  3 +-
 10 files changed, 153 insertions(+), 15 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index f088f5881666..8e3447a4b373 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -42,4 +42,5 @@ Offset/Size	Proto	Name			Meaning
 2D0/A00		ALL	e820_table		E820 memory map table
 						(array of struct e820_entry)
 D00/1EC		ALL	eddbuf			EDD data (array of struct edd_info)
+ECC/008		ALL	unaccepted_memory	Bitmap of unaccepted memory (1bit == 2M)
 ===========	=====	=======================	=================================================
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 1bfe30ebadbe..f5b49e74d728 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -100,6 +100,7 @@ endif
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o
 vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/unaccepted_memory.o
 
 vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_thunk_$(BITS).o
 efi-obj-$(CONFIG_EFI_STUB) = $(objtree)/drivers/firmware/efi/libstub/lib.a
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..bf58b259380a
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Taken from lib/string.c */
+
+#include <linux/bitmap.h>
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_set >= 0) {
+		*p |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_LONG;
+		mask_to_set = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+		*p |= mask_to_set;
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
new file mode 100644
index 000000000000..c2eca85b5073
--- /dev/null
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -0,0 +1,36 @@
+#include "error.h"
+#include "misc.h"
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	/* Platform-specific memory-acceptance call goes here */
+	error("Cannot accept memory");
+}
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
+{
+	u64 end = start + num * PAGE_SIZE;
+	unsigned int npages;
+
+	if ((start & PMD_MASK) == (end & PMD_MASK)) {
+		npages = (end - start) / PAGE_SIZE;
+		__accept_memory(start, start + npages * PAGE_SIZE);
+		return;
+	}
+
+	if (start & ~PMD_MASK) {
+		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
+		__accept_memory(start, start + npages * PAGE_SIZE);
+		start = round_up(start, PMD_SIZE);
+	}
+
+	if (end & ~PMD_MASK) {
+		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
+		end = round_down(end, PMD_SIZE);
+		__accept_memory(end, end + npages * PAGE_SIZE);
+	}
+
+	npages = (end - start) / PMD_SIZE;
+	bitmap_set((unsigned long *)params->unaccepted_memory,
+		   start / PMD_SIZE, npages);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..cbc24040b853
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+#include <linux/types.h>
+
+struct boot_params;
+
+void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index b25d3f82c2f3..16bc686a198d 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -217,7 +217,8 @@ struct boot_params {
 	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
 	__u8  _pad8[48];				/* 0xcd0 */
 	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
-	__u8  _pad9[276];				/* 0xeec */
+	__u64 unaccepted_memory;			/* 0xeec */
+	__u8  _pad9[268];				/* 0xef4 */
 } __attribute__((packed));
 
 /**
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 2c3dac5ecb36..e13b584cdd80 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -243,6 +243,18 @@ config EFI_DISABLE_PCI_DMA
 	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
 	  may be used to override this option.
 
+config UNACCEPTED_MEMORY
+	bool
+	depends on EFI_STUB
+	help
+	   Some Virtual Machine platforms, such as Intel TDX, introduce
+	   the concept of memory acceptance, requiring memory to be accepted
+	   before it can be used by the guest. This protects against a class of
+	   attacks by the virtual machine platform.
+
+	   This option adds support for unaccepted memory and makes such memory
+	   usable by kernel.
+
 endmenu
 
 config EFI_EMBEDDED_FIRMWARE
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 847f33ffc4ae..c6b8a1c5a87f 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -737,6 +737,7 @@ static __initdata char memory_type_name[][13] = {
 	"MMIO Port",
 	"PAL Code",
 	"Persistent",
+	"Unaccepted",
 };
 
 char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index f14c4ff5839f..e67ec1245f10 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -9,12 +9,14 @@
 #include <linux/efi.h>
 #include <linux/pci.h>
 #include <linux/stddef.h>
+#include <linux/bitmap.h>
 
 #include <asm/efi.h>
 #include <asm/e820/types.h>
 #include <asm/setup.h>
 #include <asm/desc.h>
 #include <asm/boot.h>
+#include <asm/unaccepted_memory.h>
 
 #include "efistub.h"
 
@@ -504,6 +506,12 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
 			e820_type = E820_TYPE_PMEM;
 			break;
 
+		case EFI_UNACCEPTED_MEMORY:
+			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				continue;
+			e820_type = E820_TYPE_RAM;
+			mark_unaccepted(params, d->phys_addr, d->num_pages);
+			break;
 		default:
 			continue;
 		}
@@ -569,30 +577,71 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
 }
 
 static efi_status_t allocate_e820(struct boot_params *params,
+				  struct efi_boot_memmap *map,
 				  struct setup_data **e820ext,
 				  u32 *e820ext_size)
 {
-	unsigned long map_size, desc_size, map_key;
 	efi_status_t status;
-	__u32 nr_desc, desc_version;
-
-	/* Only need the size of the mem map and size of each mem descriptor */
-	map_size = 0;
-	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
-			     &desc_size, &desc_version);
-	if (status != EFI_BUFFER_TOO_SMALL)
-		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
+	__u32 nr_desc;
+	bool unaccepted_memory_present = false;
+	u64 max_addr = 0;
+	int i;
 
-	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+	status = efi_get_memory_map(map);
+	if (status != EFI_SUCCESS)
+		return status;
 
-	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
-		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+	nr_desc = *map->map_size / *map->desc_size;
+	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) -
+			EFI_MMAP_NR_SLACK_SLOTS;
 
 		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
 		if (status != EFI_SUCCESS)
 			return status;
 	}
 
+	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+		return EFI_SUCCESS;
+
+	/* Check if there's any unaccepted memory and find the max address */
+	for (i = 0; i < nr_desc; i++) {
+		efi_memory_desc_t *d;
+
+		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
+		if (d->type == EFI_UNACCEPTED_MEMORY)
+			unaccepted_memory_present = true;
+		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+	}
+
+	/*
+	 * If unaccepted memory present allocate a bitmap to track what memory
+	 * has to be accepted before access.
+	 *
+	 * One bit in the bitmap represents 2MiB in the address space: one 4k
+	 * page is enough to track 64GiB or physical address space.
+	 *
+	 * In the worst case scenario -- a huge hole in the middle of the
+	 * address space -- we would need 256MiB to handle 4PiB of the address
+	 * space.
+	 *
+	 * TODO: handle situation if params->unaccepted_memory has already set.
+	 * It's required to deal with kexec.
+	 */
+	if (unaccepted_memory_present) {
+		unsigned long *unaccepted_memory = NULL;
+		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+
+		status = efi_allocate_pages(size,
+					    (unsigned long *)&unaccepted_memory,
+					    ULONG_MAX);
+		if (status != EFI_SUCCESS)
+			return status;
+		memset(unaccepted_memory, 0, size);
+		params->unaccepted_memory = (u64)unaccepted_memory;
+	}
+
 	return EFI_SUCCESS;
 }
 
@@ -642,7 +691,7 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
 	priv.boot_params	= boot_params;
 	priv.efi		= &boot_params->efi_info;
 
-	status = allocate_e820(boot_params, &e820ext, &e820ext_size);
+	status = allocate_e820(boot_params, &map, &e820ext, &e820ext_size);
 	if (status != EFI_SUCCESS)
 		return status;
 
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 6b5d36babfcc..d43cc872b582 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef	struct {
 #define EFI_MEMORY_MAPPED_IO_PORT_SPACE	12
 #define EFI_PAL_CODE			13
 #define EFI_PERSISTENT_MEMORY		14
-#define EFI_MAX_MEMORY_TYPE		15
+#define EFI_UNACCEPTED_MEMORY		15
+#define EFI_MAX_MEMORY_TYPE		16
 
 /* Attribute values: */
 #define EFI_MEMORY_UC		((u64)0x0000000000000001ULL)	/* uncached */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/5] x86/boot/compressed: Handle unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
  2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
  2021-08-10  6:26 ` [PATCH 2/5] efi/x86: Implement " Kirill A. Shutemov
@ 2021-08-10  6:26 ` Kirill A. Shutemov
  2021-08-10  6:26 ` [PATCH 4/5] x86/mm: Provide helpers for " Kirill A. Shutemov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

Firmware is responsible for accepting memory where compressed kernel
image and initrd land. But kernel has to accept memory for decompression
buffer: accept memory just before decompression starts

KASLR allowed to use unaccepted memory for output buffer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/bitmap.c            | 62 ++++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 ++++-
 arch/x86/boot/compressed/misc.c              |  9 +++
 arch/x86/boot/compressed/unaccepted_memory.c | 13 ++++
 arch/x86/include/asm/unaccepted_memory.h     |  2 +
 5 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
index bf58b259380a..ba2de61c0823 100644
--- a/arch/x86/boot/compressed/bitmap.c
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -2,6 +2,48 @@
 /* Taken from lib/string.c */
 
 #include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long nbits,
+		unsigned long start, unsigned long invert, unsigned long le)
+{
+	unsigned long tmp, mask;
+
+	if (unlikely(start >= nbits))
+		return nbits;
+
+	tmp = addr1[start / BITS_PER_LONG];
+	if (addr2)
+		tmp &= addr2[start / BITS_PER_LONG];
+	tmp ^= invert;
+
+	/* Handle 1st word. */
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
+	start = round_down(start, BITS_PER_LONG);
+
+	while (!tmp) {
+		start += BITS_PER_LONG;
+		if (start >= nbits)
+			return nbits;
+
+		tmp = addr1[start / BITS_PER_LONG];
+		if (addr2)
+			tmp &= addr2[start / BITS_PER_LONG];
+		tmp ^= invert;
+	}
+
+	if (le)
+		tmp = swab(tmp);
+
+	return min(start + __ffs(tmp), nbits);
+}
 
 void __bitmap_set(unsigned long *map, unsigned int start, int len)
 {
@@ -22,3 +64,23 @@ void __bitmap_set(unsigned long *map, unsigned int start, int len)
 		*p |= mask_to_set;
 	}
 }
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+	unsigned long *p = map + BIT_WORD(start);
+	const unsigned int size = start + len;
+	int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+	unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+	while (len - bits_to_clear >= 0) {
+		*p &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_LONG;
+		mask_to_clear = ~0UL;
+		p++;
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+		*p &= ~mask_to_clear;
+	}
+}
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index e36690778497..89cdc17dcbf1 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -729,10 +729,20 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
 		 * but in practice there's firmware where using that memory leads
 		 * to crashes.
 		 *
-		 * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
+		 * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if
+		 * supported) are guaranteed to be free.
 		 */
-		if (md->type != EFI_CONVENTIONAL_MEMORY)
+
+		switch (md->type) {
+		case EFI_CONVENTIONAL_MEMORY:
+			break;
+		case EFI_UNACCEPTED_MEMORY:
+			if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+				break;
 			continue;
+		default:
+			continue;
+		}
 
 		if (efi_soft_reserve_enabled() &&
 		    (md->attribute & EFI_MEMORY_SP))
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 743f13ea25c1..eeefcde8394d 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -18,6 +18,7 @@
 #include "../string.h"
 #include "../voffset.h"
 #include <asm/bootparam_utils.h>
+#include <asm/unaccepted_memory.h>
 
 /*
  * WARNING!!
@@ -435,6 +436,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
 #endif
 
 	debug_putstr("\nDecompressing Linux... ");
+
+	if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+	    boot_params->unaccepted_memory) {
+		debug_putstr("Accepting memory... ");
+		accept_memory((phys_addr_t)output,
+			      (phys_addr_t)output + needed_size);
+	}
+
 	__decompress(input_data, input_len, NULL, NULL, output, output_len,
 			NULL, error);
 	parse_elf(output);
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index c2eca85b5073..17b70627b0cd 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -34,3 +34,16 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
 	bitmap_set((unsigned long *)params->unaccepted_memory,
 		   start / PMD_SIZE, npages);
 }
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned int rs, re;
+
+	unaccepted_memory = (unsigned long *)boot_params->unaccepted_memory;
+	bitmap_for_each_set_region(unaccepted_memory, rs, re,
+				   start / PMD_SIZE, end / PMD_SIZE) {
+		__accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index cbc24040b853..f1f835d3cd78 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -9,4 +9,6 @@ struct boot_params;
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/5] x86/mm: Provide helpers for unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2021-08-10  6:26 ` [PATCH 3/5] x86/boot/compressed: Handle " Kirill A. Shutemov
@ 2021-08-10  6:26 ` Kirill A. Shutemov
  2021-08-10 18:16   ` Dave Hansen
  2021-08-10  6:26 ` [PATCH 5/5] x86/tdx: Unaccepted memory support Kirill A. Shutemov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

Core-mm requires few helpers to support unaccepted memory:

 - accept_memory() checks the range of addresses against the bitmap and
   accept memory if needed;

 - maybe_set_page_offline() checks the bitmap and marks a page with
   PageOffline() if memory acceptance is required on the first
   allocation of the page.

 - clear_page_offline() accepts memory for the page and clears
   PageOffline().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/unaccepted_memory.c |  3 +-
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/unaccepted_memory.h     |  3 +
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 80 ++++++++++++++++++++
 5 files changed, 91 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index 17b70627b0cd..818d32169eef 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -13,8 +13,7 @@ void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
 	unsigned int npages;
 
 	if ((start & PMD_MASK) == (end & PMD_MASK)) {
-		npages = (end - start) / PAGE_SIZE;
-		__accept_memory(start, start + npages * PAGE_SIZE);
+		__accept_memory(start, end);
 		return;
 	}
 
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4d5810c8fab7..1e56d76ca474 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,11 @@
 struct page;
 
 #include <linux/range.h>
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+#include <asm/unaccepted_memory.h>
+#endif
+
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index f1f835d3cd78..712128760131 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -6,9 +6,12 @@
 #include <linux/types.h>
 
 struct boot_params;
+struct page;
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
 
 void accept_memory(phys_addr_t start, phys_addr_t end);
 
+void maybe_set_page_offline(struct page *page, unsigned int order);
+void clear_page_offline(struct page *page, unsigned int order);
 #endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index b31cb52bf1bd..fe4e16322868 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -57,3 +57,5 @@ obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON)	+= mem_encrypt_common.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY)	+= unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..e11933f62ead
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,80 @@
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
+static void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	unsigned long *unaccepted_memory;
+	unsigned int rs, re;
+
+	unaccepted_memory = __va(boot_params.unaccepted_memory);
+	bitmap_for_each_set_region(unaccepted_memory, rs, re,
+				   start / PMD_SIZE,
+				   DIV_ROUND_UP(end, PMD_SIZE)) {
+		/* Platform-specific memory-acceptance call goes here */
+		panic("Cannot accept memory");
+		bitmap_clear(unaccepted_memory, rs, re - rs);
+	}
+}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	spin_lock(&unaccepted_memory_lock);
+	__accept_memory(start, end);
+	spin_unlock(&unaccepted_memory_lock);
+}
+
+void __init maybe_set_page_offline(struct page *page, unsigned int order)
+{
+	unsigned long *unaccepted_memory;
+	phys_addr_t addr = page_to_phys(page);
+	bool unaccepted = true;
+	unsigned int i;
+
+	if (!boot_params.unaccepted_memory)
+		return;
+
+	unaccepted_memory = __va(boot_params.unaccepted_memory);
+	spin_lock(&unaccepted_memory_lock);
+	if (order < PMD_ORDER) {
+		BUG_ON(test_bit(addr / PMD_SIZE, unaccepted_memory));
+		goto out;
+	}
+
+	for (i = 0; i < (1 << (order - PMD_ORDER)); i++) {
+		if (!test_bit(addr / PMD_SIZE + i, unaccepted_memory)) {
+			unaccepted = false;
+			break;
+		}
+	}
+
+	if (unaccepted)
+		__SetPageOffline(page);
+	else
+		__accept_memory(addr, addr + (PAGE_SIZE << order));
+out:
+	spin_unlock(&unaccepted_memory_lock);
+}
+
+void clear_page_offline(struct page *page, unsigned int order)
+{
+	phys_addr_t addr = page_to_phys(page);
+
+	/* PageOffline() page on a free list, but no unaccepted memory? Hm. */
+	WARN_ON_ONCE(!boot_params.unaccepted_memory);
+
+	accept_memory(addr, addr + (PAGE_SIZE << order));
+	__ClearPageOffline(page);
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/5] x86/tdx: Unaccepted memory support
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2021-08-10  6:26 ` [PATCH 4/5] x86/mm: Provide helpers for " Kirill A. Shutemov
@ 2021-08-10  6:26 ` Kirill A. Shutemov
  2021-08-10 14:08 ` [PATCH 0/5] x86: Impplement support for unaccepted memory Dave Hansen
  2021-08-12  8:23 ` Joerg Roedel
  6 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10  6:26 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

All preparation is complete. Hookup TDX-specific code to accept memory.

There are two tdg_accept_memory() implementations: one in main kernel
and one in the decompresser.

The implementation in core kernel uses tdx_hcall_gpa_intent(). The
helper is not available in the decompresser, self-contained
implementation added there instead.

Note that tdx_hcall_gpa_intent() is going to be more complex once we
teach it to accept in 1G and 2M chunks.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/tdx.c               | 29 ++++++++++++++++++++
 arch/x86/boot/compressed/unaccepted_memory.c |  5 +++-
 arch/x86/include/asm/tdx.h                   |  2 ++
 arch/x86/kernel/tdx.c                        |  8 ++++++
 arch/x86/mm/unaccepted_memory.c              |  6 +++-
 6 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c0261cc72449..5b1b3dc84c7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -874,6 +874,7 @@ config INTEL_TDX_GUEST
 	select ARCH_HAS_PROTECTED_GUEST
 	select X86_MEM_ENCRYPT_COMMON
 	select ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
+	select UNACCEPTED_MEMORY
 	help
 	  Provide support for running in a trusted domain on Intel processors
 	  equipped with Trusted Domain eXtensions. TDX is a new Intel
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 88ed6465405b..6f13bdaf327f 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -5,6 +5,10 @@
 
 #include "../cpuflags.h"
 #include "../string.h"
+#include "error.h"
+
+#include <asm/page_types.h>
+#include <asm/tdx.h>
 
 #define TDX_CPUID_LEAF_ID                       0x21
 
@@ -32,3 +36,28 @@ bool early_is_tdx_guest(void)
 
 	return !!tdx_guest;
 }
+
+#define TDACCEPTPAGE		6
+#define TDVMCALL_MAP_GPA	0x10001
+
+void tdg_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	struct tdx_hypercall_output outl = {0};
+	int i;
+
+	if (__tdx_hypercall(TDX_HYPERCALL_STANDARD, TDVMCALL_MAP_GPA,
+			    start, end, 0, 0, &outl)) {
+		error("Cannot accept memory: MapGPA failed\n");
+	}
+
+	/*
+	 * For shared->private conversion, accept the page using TDACCEPTPAGE
+	 * TDX module call.
+	 */
+	for (i = 0; i < (end - start) / PAGE_SIZE; i++) {
+		if (__tdx_module_call(TDACCEPTPAGE, start + i * PAGE_SIZE,
+				      0, 0, 0, NULL)) {
+			error("Cannot accept memory: page accept failed\n");
+		}
+	}
+}
diff --git a/arch/x86/boot/compressed/unaccepted_memory.c b/arch/x86/boot/compressed/unaccepted_memory.c
index 818d32169eef..146a3a6968a8 100644
--- a/arch/x86/boot/compressed/unaccepted_memory.c
+++ b/arch/x86/boot/compressed/unaccepted_memory.c
@@ -4,7 +4,10 @@
 static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
 {
 	/* Platform-specific memory-acceptance call goes here */
-	error("Cannot accept memory");
+	if (early_is_tdx_guest())
+		tdg_accept_memory(start, end);
+	else
+		error("Cannot accept memory");
 }
 
 void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index af6e4cd8078d..f74b6cfde205 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -97,6 +97,8 @@ extern phys_addr_t tdg_shared_mask(void);
 extern int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
 				enum tdx_map_type map_type);
 
+extern void tdg_accept_memory(phys_addr_t start, phys_addr_t end);
+
 int tdx_mcall_tdreport(u64 data, u64 reportdata);
 
 int tdx_mcall_rtmr_extend(u64 data, u64 rmtr);
diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
index dc6d9441f3be..75f3804f86da 100644
--- a/arch/x86/kernel/tdx.c
+++ b/arch/x86/kernel/tdx.c
@@ -372,6 +372,14 @@ int tdx_hcall_gpa_intent(phys_addr_t gpa, int numpages,
 	return 0;
 }
 
+void tdg_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+	if (tdx_hcall_gpa_intent(start, (end - start) / PAGE_SIZE,
+				 TDX_MAP_PRIVATE)) {
+		panic("Accepting memory failed\n");
+	}
+}
+
 static __cpuidle void tdg_halt(void)
 {
 	u64 ret;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index e11933f62ead..19e0309e128b 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -5,6 +5,7 @@
 
 #include <asm/io.h>
 #include <asm/setup.h>
+#include <asm/tdx.h>
 #include <asm/unaccepted_memory.h>
 
 static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -21,7 +22,10 @@ static void __accept_memory(phys_addr_t start, phys_addr_t end)
 				   start / PMD_SIZE,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
 		/* Platform-specific memory-acceptance call goes here */
-		panic("Cannot accept memory");
+		if (prot_guest_has(PATTR_GUEST_TDX))
+			tdg_accept_memory(rs * PMD_SIZE, re * PMD_SIZE);
+		else
+			panic("Cannot accept memory");
 		bitmap_clear(unaccepted_memory, rs, re - rs);
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
@ 2021-08-10  7:48   ` David Hildenbrand
  2021-08-10 15:02     ` Kirill A. Shutemov
  2021-08-10 18:13   ` Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2021-08-10  7:48 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

On 10.08.21 08:26, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces concept of memory acceptance:
> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> requiring memory to be accepted before it can be used by the guest.
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptation until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> Support of such memory requires few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> Page allocator is modified to accept pages on the first allocation.
> PageOffline() is used to indicate that the page requires acceptance.
> The flag currently used by hotplug and balloon. Such pages are not
> available to page allocator.
> 
> An architecture has to provide three helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - maybe_set_page_offline() marks a page PageOffline() if it requires
>     acceptance. Used during boot to put pages on free lists.
> 
>   - clear_page_offline() clears makes a page accepted and clears
>     PageOffline().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   mm/internal.h   | 14 ++++++++++++++
>   mm/memblock.c   |  1 +
>   mm/page_alloc.c | 13 ++++++++++++-
>   3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 31ff935b2547..d2fc8a17fbe0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>   		      unsigned long addr, int page_nid, int *flags);
>   
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> +{
> +}
> +
> +static inline void clear_page_offline(struct page *page, unsigned int order)
> +{
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}

Can we find better fitting names for the first two? The function names 
are way too generic. For example:

accept_or_set_page_offline()

accept_and_clear_page_offline()

I thought for a second if
	PAGE_TYPE_OPS(Unaccepted, offline)
makes sense as well, not sure.


Also, please update the description of PageOffline in page-flags.h to 
include the additional usage with PageBuddy set at the same time.


I assume you don't have to worry about page_offline_freeze/thaw ... as 
we only set PageOffline initially, but not later at runtime when other 
subsystems (/proc/kcore) might stumble over it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2021-08-10  6:26 ` [PATCH 5/5] x86/tdx: Unaccepted memory support Kirill A. Shutemov
@ 2021-08-10 14:08 ` Dave Hansen
  2021-08-10 15:15   ` Kirill A. Shutemov
  2021-08-12  8:23 ` Joerg Roedel
  6 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 14:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces concept of memory acceptance:
> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> requiring memory to be accepted before it can be used by the guest.
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. We don't want to accept all memory
> upfront.

This could use a bit more explanation.  Any VM is likely to *eventually*
touch all its memory, so it's not like a VMM has a long-term advantage
by delaying this.

So, it must have to do with resource use at boot.  Is this to help boot
times?

I had expected this series, but I also expected it to be connected to
CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
this problem is different and demands a totally orthogonal solution?

For instance, what prevents us from declaring: "Memory is accepted at
the time that its 'struct page' is initialized" ?  Then, we use all the
infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.

This series isn't too onerous, but I do want to make sure that we're not
reinventing the wheel.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10  7:48   ` David Hildenbrand
@ 2021-08-10 15:02     ` Kirill A. Shutemov
  2021-08-10 15:21       ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 15:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
> On 10.08.21 08:26, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > requiring memory to be accepted before it can be used by the guest.
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> > 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptation until memory is needed. It lowers boot time and reduces
> > memory overhead.
> > 
> > Support of such memory requires few changes in core-mm code:
> > 
> >    - memblock has to accept memory on allocation;
> > 
> >    - page allocator has to accept memory on the first allocation of the
> >      page;
> > 
> > Memblock change is trivial.
> > 
> > Page allocator is modified to accept pages on the first allocation.
> > PageOffline() is used to indicate that the page requires acceptance.
> > The flag currently used by hotplug and balloon. Such pages are not
> > available to page allocator.
> > 
> > An architecture has to provide three helpers if it wants to support
> > unaccepted memory:
> > 
> >   - accept_memory() makes a range of physical addresses accepted.
> > 
> >   - maybe_set_page_offline() marks a page PageOffline() if it requires
> >     acceptance. Used during boot to put pages on free lists.
> > 
> >   - clear_page_offline() clears makes a page accepted and clears
> >     PageOffline().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >   mm/internal.h   | 14 ++++++++++++++
> >   mm/memblock.c   |  1 +
> >   mm/page_alloc.c | 13 ++++++++++++-
> >   3 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 31ff935b2547..d2fc8a17fbe0 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> >   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> >   		      unsigned long addr, int page_nid, int *flags);
> > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> > +{
> > +}
> > +
> > +static inline void clear_page_offline(struct page *page, unsigned int order)
> > +{
> > +}
> > +
> > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +}
> 
> Can we find better fitting names for the first two? The function names are
> way too generic. For example:
> 
> accept_or_set_page_offline()
> 
> accept_and_clear_page_offline()

Sounds good.

> I thought for a second if
> 	PAGE_TYPE_OPS(Unaccepted, offline)
> makes sense as well, not sure.

I find Offline fitting the situation. Don't see a reason to add more
terminology here.

> Also, please update the description of PageOffline in page-flags.h to
> include the additional usage with PageBuddy set at the same time.

Okay.

> I assume you don't have to worry about page_offline_freeze/thaw ... as we
> only set PageOffline initially, but not later at runtime when other
> subsystems (/proc/kcore) might stumble over it.

I think so, but I would need to look at this code once again.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 14:08 ` [PATCH 0/5] x86: Impplement support for unaccepted memory Dave Hansen
@ 2021-08-10 15:15   ` Kirill A. Shutemov
  2021-08-10 15:51     ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 15:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote:
> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > requiring memory to be accepted before it can be used by the guest.
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> > 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. We don't want to accept all memory
> > upfront.
> 
> This could use a bit more explanation.  Any VM is likely to *eventually*
> touch all its memory, so it's not like a VMM has a long-term advantage
> by delaying this.
> 
> So, it must have to do with resource use at boot.  Is this to help boot
> times?

Yes, boot time is main motivation.

But I'm going also to look at long-term VM behaviour with the fixed memory
footprint. I think if a workload allocate/free memory within the same
amount we can keep memory beyond the size unaccepted. Few tweaks likely
will be required such as disabling page shuffling on free to keep
unaccepted memory at the tail of free list. More investigation needed.

> I had expected this series, but I also expected it to be connected to
> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
> this problem is different and demands a totally orthogonal solution?
>
> For instance, what prevents us from declaring: "Memory is accepted at
> the time that its 'struct page' is initialized" ?  Then, we use all the
> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.

That was my first thought too and I tried it just to realize that it is
not what we want. If we would accept page on page struct init it means we
would make host allocate all memory assigned to the guest on boot even if
guest actually use small portion of it.

Also deferred page init only allows to scale memory accept across multiple
CPUs, but doesn't allow to get to userspace before we done with it. See
wait_for_completion(&pgdat_init_all_done_comp).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 15:02     ` Kirill A. Shutemov
@ 2021-08-10 15:21       ` David Hildenbrand
  2021-08-12 20:34         ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2021-08-10 15:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On 10.08.21 17:02, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
>> On 10.08.21 08:26, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces concept of memory acceptance:
>>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>>> requiring memory to be accepted before it can be used by the guest.
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. It's better to postpone memory
>>> acceptation until memory is needed. It lowers boot time and reduces
>>> memory overhead.
>>>
>>> Support of such memory requires few changes in core-mm code:
>>>
>>>     - memblock has to accept memory on allocation;
>>>
>>>     - page allocator has to accept memory on the first allocation of the
>>>       page;
>>>
>>> Memblock change is trivial.
>>>
>>> Page allocator is modified to accept pages on the first allocation.
>>> PageOffline() is used to indicate that the page requires acceptance.
>>> The flag currently used by hotplug and balloon. Such pages are not
>>> available to page allocator.
>>>
>>> An architecture has to provide three helpers if it wants to support
>>> unaccepted memory:
>>>
>>>    - accept_memory() makes a range of physical addresses accepted.
>>>
>>>    - maybe_set_page_offline() marks a page PageOffline() if it requires
>>>      acceptance. Used during boot to put pages on free lists.
>>>
>>>    - clear_page_offline() clears makes a page accepted and clears
>>>      PageOffline().
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>    mm/internal.h   | 14 ++++++++++++++
>>>    mm/memblock.c   |  1 +
>>>    mm/page_alloc.c | 13 ++++++++++++-
>>>    3 files changed, 27 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 31ff935b2547..d2fc8a17fbe0 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>>>    int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>>    		      unsigned long addr, int page_nid, int *flags);
>>> +#ifndef CONFIG_UNACCEPTED_MEMORY
>>> +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
>>> +{
>>> +}
>>> +
>>> +static inline void clear_page_offline(struct page *page, unsigned int order)
>>> +{
>>> +}
>>> +
>>> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
>>> +{
>>> +}
>>
>> Can we find better fitting names for the first two? The function names are
>> way too generic. For example:
>>
>> accept_or_set_page_offline()
>>
>> accept_and_clear_page_offline()
> 
> Sounds good.
> 
>> I thought for a second if
>> 	PAGE_TYPE_OPS(Unaccepted, offline)
>> makes sense as well, not sure.
> 
> I find Offline fitting the situation. Don't see a reason to add more
> terminology here.
> 
>> Also, please update the description of PageOffline in page-flags.h to
>> include the additional usage with PageBuddy set at the same time.
> 
> Okay.
> 
>> I assume you don't have to worry about page_offline_freeze/thaw ... as we
>> only set PageOffline initially, but not later at runtime when other
>> subsystems (/proc/kcore) might stumble over it.
> 
> I think so, but I would need to look at this code once again.
> 

Another thing to look into would be teaching makedumpfile via vmcoreinfo 
about these special buddy pages:

makedumpfile will naturally skip all PageOffline pages and skip 
PageBuddy pages if requested to skip free pages. It detects these pages 
via the mapcount value. You will want makedumpfile to treat them like 
PageOffline pages: kernel/crash_core.c

#define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);

#define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);

We could export PAGE_BUDDY_OFFLINE_MAPCOUNT_VALUE or just compute it 
inside makedumpfile from the other two values.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 15:15   ` Kirill A. Shutemov
@ 2021-08-10 15:51     ` Dave Hansen
  2021-08-10 17:31       ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 15:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On 8/10/21 8:15 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 07:08:58AM -0700, Dave Hansen wrote:
>> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces concept of memory acceptance:
>>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>>> requiring memory to be accepted before it can be used by the guest.
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. We don't want to accept all memory
>>> upfront.
>>
>> This could use a bit more explanation.  Any VM is likely to *eventually*
>> touch all its memory, so it's not like a VMM has a long-term advantage
>> by delaying this.
>>
>> So, it must have to do with resource use at boot.  Is this to help boot
>> times?
> 
> Yes, boot time is main motivation.
> 
> But I'm going also to look at long-term VM behaviour with the fixed memory
> footprint. I think if a workload allocate/free memory within the same
> amount we can keep memory beyond the size unaccepted. Few tweaks likely
> will be required such as disabling page shuffling on free to keep
> unaccepted memory at the tail of free list. More investigation needed.

OK, so this is predicated on the idea that a guest will not use all of
its assigned RAM and that the host will put that RAM to good use
elsewhere.  Right?

That's undercut by a a few of factors:
1. Many real-world cloud providers do not overcommit RAM.  If the guest
   does not use the RAM, it goes to waste.  (Yes, there are providers
   that overcommit, but we're talking generally about places where this
   proposal is useful).
2. Long-term, RAM fills up with page cache in many scenarios

So, this is really only beneficial for long-term host physical memory
use if:
1. The host is overcommitting
and
2. The guest never uses all of its RAM

Seeing as TDX doesn't support swap and can't coexist with persistent
memory, the only recourse for folks overcommitting TDX VMs when the run
out of RAM is to kill VMs.

I can't imagine that a lot of folks are going to do this.

In other words, I buy the boot speed argument.  But, I don't buy the
"this saves memory long term" argument at all.

>> I had expected this series, but I also expected it to be connected to
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
>> this problem is different and demands a totally orthogonal solution?
>>
>> For instance, what prevents us from declaring: "Memory is accepted at
>> the time that its 'struct page' is initialized" ?  Then, we use all the
>> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.
> 
> That was my first thought too and I tried it just to realize that it is
> not what we want. If we would accept page on page struct init it means we
> would make host allocate all memory assigned to the guest on boot even if
> guest actually use small portion of it.
> 
> Also deferred page init only allows to scale memory accept across multiple
> CPUs, but doesn't allow to get to userspace before we done with it. See
> wait_for_completion(&pgdat_init_all_done_comp).

That's good information.  It's a refinement of the "I want to boot
faster" requirement.  What you want is not just going _faster_, but
being able to run userspace before full acceptance has completed.

Would you be able to quantify how fast TDX page acceptance is?  Are we
talking about MB/s, GB/s, TB/s?  This series is rather bereft of numbers
for a feature which making a performance claim.

Let's say we have a 128GB VM.  How much does faster does this approach
reach userspace than if all memory was accepted up front?  How much
memory _could_ have been accepted at the point userspace starts running?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 15:51     ` Dave Hansen
@ 2021-08-10 17:31       ` Kirill A. Shutemov
  2021-08-10 17:36         ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 17:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote:
> In other words, I buy the boot speed argument.  But, I don't buy the
> "this saves memory long term" argument at all.

Okay, that's a fair enough. I guess there's *some* workloads that may
have memory footprint reduced, but I agree it's minority.

> >> I had expected this series, but I also expected it to be connected to
> >> CONFIG_DEFERRED_STRUCT_PAGE_INIT somehow.  Could you explain a bit how
> >> this problem is different and demands a totally orthogonal solution?
> >>
> >> For instance, what prevents us from declaring: "Memory is accepted at
> >> the time that its 'struct page' is initialized" ?  Then, we use all the
> >> infrastructure we already have for DEFERRED_STRUCT_PAGE_INIT.
> > 
> > That was my first thought too and I tried it just to realize that it is
> > not what we want. If we would accept page on page struct init it means we
> > would make host allocate all memory assigned to the guest on boot even if
> > guest actually use small portion of it.
> > 
> > Also deferred page init only allows to scale memory accept across multiple
> > CPUs, but doesn't allow to get to userspace before we done with it. See
> > wait_for_completion(&pgdat_init_all_done_comp).
> 
> That's good information.  It's a refinement of the "I want to boot
> faster" requirement.  What you want is not just going _faster_, but
> being able to run userspace before full acceptance has completed.
> 
> Would you be able to quantify how fast TDX page acceptance is?  Are we
> talking about MB/s, GB/s, TB/s?  This series is rather bereft of numbers
> for a feature which making a performance claim.
> 
> Let's say we have a 128GB VM.  How much does faster does this approach
> reach userspace than if all memory was accepted up front?  How much
> memory _could_ have been accepted at the point userspace starts running?

Acceptance code is not optimized yet: we accept memory in 4k chunk which
is very slow because hypercall overhead dominates the picture.

As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory
accept is >20 times slower than with this lazy memory accept approach.

The difference is going to be substantially lower once we get it optimized
properly.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 17:31       ` Kirill A. Shutemov
@ 2021-08-10 17:36         ` Dave Hansen
  2021-08-10 17:51           ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 17:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On 8/10/21 10:31 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 08:51:01AM -0700, Dave Hansen wrote:
>> Let's say we have a 128GB VM.  How much does faster does this approach
>> reach userspace than if all memory was accepted up front?  How much
>> memory _could_ have been accepted at the point userspace starts running?
> 
> Acceptance code is not optimized yet: we accept memory in 4k chunk which
> is very slow because hypercall overhead dominates the picture.
> 
> As of now, kernel boot time of 1 VCPU and 64TiB VM with upfront memory
> accept is >20 times slower than with this lazy memory accept approach.

That's a pretty big deal.

> The difference is going to be substantially lower once we get it optimized
> properly.

What does this mean?  Is this future work in the kernel or somewhere in
the TDX hardware/firmware which will speed things up?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10  6:26 ` [PATCH 2/5] efi/x86: Implement " Kirill A. Shutemov
@ 2021-08-10 17:50   ` Dave Hansen
  2021-08-12 21:14     ` Kirill A. Shutemov
  2021-08-10 18:30   ` Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 17:50 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

...
> +void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
> +{

Some of these interfaces like accept_memory() take a start/end physical
address.  Having this take a "num pages" is bound to cause confusion.
Could you make these all consistently take start/end physical addresses?

> +	u64 end = start + num * PAGE_SIZE;
> +	unsigned int npages;


Could you comment those, please?

	/*
	 * The accepted memory bitmap only works at PMD_SIZE
	 * granularity.  If a request comes in to mark memory
	 * as unaccepted which is not PMD_SIZE-aligned, simply
	 * accept the memory now since it can not be *marked* as
	 * unaccepted.
	 */

Then go on to comment the three cases:

	/* Check for ranges which do not span a whole PMD_SIZE area: */

> +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> +		npages = (end - start) / PAGE_SIZE;
> +		__accept_memory(start, start + npages * PAGE_SIZE);
> +		return;
> +	}

Hmm, is it possible to have this case hit, but neither of the two below
cases?  This seems to be looking for a case where the range is somehow
entirely contained in one PMD_SIZE area, but where it doesn't consume a
whole area.

Wouldn't that mean that 'start' or 'end' must be unaligned?


> +	if (start & ~PMD_MASK) {
> +		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
> +		__accept_memory(start, start + npages * PAGE_SIZE);
> +		start = round_up(start, PMD_SIZE);
> +	}
> +
> +	if (end & ~PMD_MASK) {
> +		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
> +		end = round_down(end, PMD_SIZE);
> +		__accept_memory(end, end + npages * PAGE_SIZE);
> +	}
> +	npages = (end - start) / PMD_SIZE;
> +	bitmap_set((unsigned long *)params->unaccepted_memory,
> +		   start / PMD_SIZE, npages);
> +}

Even though it's changed right there, it's a bit cruel to change the
units of 'npages' right in the middle of a function.  It's just asking
for bugs.

It would only take a single extra variable declaration to make this
unambiguous:

	u64 nr_unaccepted_bits;

or something, then you can do:

	nr_unaccepted_bits = (end - start) / PMD_SIZE;
	bitmap_set((unsigned long *)params->unaccepted_memory,
		   start / PMD_SIZE, nr_unaccepted_bits);

...
>  static efi_status_t allocate_e820(struct boot_params *params,
> +				  struct efi_boot_memmap *map,
>  				  struct setup_data **e820ext,
>  				  u32 *e820ext_size)
>  {
> -	unsigned long map_size, desc_size, map_key;
>  	efi_status_t status;
> -	__u32 nr_desc, desc_version;
> -
> -	/* Only need the size of the mem map and size of each mem descriptor */
> -	map_size = 0;
> -	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> -			     &desc_size, &desc_version);
> -	if (status != EFI_BUFFER_TOO_SMALL)
> -		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;

I noticed that there's no reference to EFI_BUFFER_TOO_SMALL in the hunks
you added back.  That makes me a bit nervous that this is going to
unintentionally change behavior.

It might be worth having a preparatory reorganization patch for
allocate_e820() before this new feature is added to make this more clear.

> +	__u32 nr_desc;
> +	bool unaccepted_memory_present = false;
> +	u64 max_addr = 0;
> +	int i;
>  
> -	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> +	status = efi_get_memory_map(map);
> +	if (status != EFI_SUCCESS)
> +		return status;
>  
> -	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> -		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> +	nr_desc = *map->map_size / *map->desc_size;
> +	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> +		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) -
> +			EFI_MMAP_NR_SLACK_SLOTS;
>  
>  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
>  		if (status != EFI_SUCCESS)
>  			return status;
>  	}
>  
> +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> +		return EFI_SUCCESS;
> +
> +	/* Check if there's any unaccepted memory and find the max address */
> +	for (i = 0; i < nr_desc; i++) {
> +		efi_memory_desc_t *d;
> +
> +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> +		if (d->type == EFI_UNACCEPTED_MEMORY)
> +			unaccepted_memory_present = true;
> +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> +	}

This 'max_addr' variable looks a bit funky.

It *seems* like it's related only to EFI_UNACCEPTED_MEMORY, but it's not
underneath the EFI_UNACCEPTED_MEMORY check.  Is this somehow assuming
that once unaccepted memory as been found that *all* memory found in
later descriptors at higher addresses is also going to be unaccepted?

> +	/*
> +	 * If unaccepted memory present allocate a bitmap to track what memory
> +	 * has to be accepted before access.
> +	 *
> +	 * One bit in the bitmap represents 2MiB in the address space: one 4k
> +	 * page is enough to track 64GiB or physical address space.
> +	 *
> +	 * In the worst case scenario -- a huge hole in the middle of the
> +	 * address space -- we would need 256MiB to handle 4PiB of the address
> +	 * space.
> +	 *
> +	 * TODO: handle situation if params->unaccepted_memory has already set.
> +	 * It's required to deal with kexec.
> +	 */
> +	if (unaccepted_memory_present) {
> +		unsigned long *unaccepted_memory = NULL;
> +		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);

Oh, so the bitmap has to be present for *all* memory, not just
unaccepted memory.  So, we really do need to know the 'max_addr' so that
we can allocate the bitmap for so that can be marked in the bitmap has
having been accepted.

> +		status = efi_allocate_pages(size,
> +					    (unsigned long *)&unaccepted_memory,
> +					    ULONG_MAX);
> +		if (status != EFI_SUCCESS)
> +			return status;
> +		memset(unaccepted_memory, 0, size);
> +		params->unaccepted_memory = (u64)unaccepted_memory;
> +	}

It might be nice to refer to setup_e820() here to mention that it is the
thing that actually fills out the bitmap.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 17:36         ` Dave Hansen
@ 2021-08-10 17:51           ` Kirill A. Shutemov
  2021-08-10 18:19             ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 17:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
> > The difference is going to be substantially lower once we get it optimized
> > properly.
> 
> What does this mean?  Is this future work in the kernel or somewhere in
> the TDX hardware/firmware which will speed things up?

Kernel has to be changed to accept memory in 2M and 1G chunks where
possible. The interface exists and described in spec, but not yet used in
guest kernel.

It would cut hypercall overhead dramatically. It makes upfront memory
accept more bearable and lowers latency of lazy memory accept. So I expect
the gap being not 20x, but like 3-5x (which is still huge).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
  2021-08-10  7:48   ` David Hildenbrand
@ 2021-08-10 18:13   ` Dave Hansen
  2021-08-10 18:30     ` Andi Kleen
  2021-08-10 20:50     ` Dave Hansen
  1 sibling, 2 replies; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 18:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  	if (page_reported(page))
>  		__ClearPageReported(page);
>  
> +	if (PageOffline(page))
> +		clear_page_offline(page, order);
> +
>  	list_del(&page->lru);
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);

So, this is right in the fast path of the page allocator.  It's a
one-time thing per 2M page, so it's not permanent.

*But* there's both a global spinlock and a firmware call hidden in
clear_page_offline().  That's *GOT* to hurt if you were, for instance,
running a benchmark while this code path is being tickled.  Not just to

That could be just downright catastrophic for scalability, albeit
temporarily.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] x86/mm: Provide helpers for unaccepted memory
  2021-08-10  6:26 ` [PATCH 4/5] x86/mm: Provide helpers for " Kirill A. Shutemov
@ 2021-08-10 18:16   ` Dave Hansen
  2021-08-12 20:31     ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 18:16 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +	if (!boot_params.unaccepted_memory)
> +		return;
> +
> +	spin_lock(&unaccepted_memory_lock);
> +	__accept_memory(start, end);
> +	spin_unlock(&unaccepted_memory_lock);
> +}

Isn't this taken in the:

	del_page_from_free_list()->
	clear_page_offline()->
	accept_memory()

call path?

That's underneath:

	spin_lock_irqsave(&zone->lock, flags);

Which means that accept_memory() can happen from interrupt context.  Is
it always covered by another spin_lock_irqsave() which means that it can
use a plain spin_lock()?

If so, it would be nice to call out that logic.  It *looks* like a
spinlock that we would want to be spin_lock_irqsave().

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 17:51           ` Kirill A. Shutemov
@ 2021-08-10 18:19             ` Dave Hansen
  2021-08-10 18:39               ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 18:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On 8/10/21 10:51 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
>>> The difference is going to be substantially lower once we get it optimized
>>> properly.
>> What does this mean?  Is this future work in the kernel or somewhere in
>> the TDX hardware/firmware which will speed things up?
> Kernel has to be changed to accept memory in 2M and 1G chunks where
> possible. The interface exists and described in spec, but not yet used in
> guest kernel.

From a quick scan of the spec, I only see:

> 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest
> TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT
> with the page GPA as an input.
Is there some other 2M/1G page-acceptance call that I'm missing?

> It would cut hypercall overhead dramatically. It makes upfront memory
> accept more bearable and lowers latency of lazy memory accept. So I expect
> the gap being not 20x, but like 3-5x (which is still huge).

It would be nice to be able to judge the benefits of this series based
on the final form.  I guess we'll take what we can get, though.

Either way, I'd still like to see some *actual* numbers for at least one
configuration:

	With this series applied, userspace starts to run at X seconds
	after kernel boot.  Without this series, userspace runs at Y
	seconds.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10  6:26 ` [PATCH 2/5] efi/x86: Implement " Kirill A. Shutemov
  2021-08-10 17:50   ` Dave Hansen
@ 2021-08-10 18:30   ` Dave Hansen
  2021-08-10 19:08     ` Kirill A. Shutemov
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 18:30 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> +config UNACCEPTED_MEMORY
> +	bool
> +	depends on EFI_STUB
> +	help
> +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> +	   the concept of memory acceptance, requiring memory to be accepted
> +	   before it can be used by the guest. This protects against a class of
> +	   attacks by the virtual machine platform.
> +
> +	   This option adds support for unaccepted memory and makes such memory
> +	   usable by kernel.

Do we really need a full-blown user-visible option here?  If we, for
instance, just did:

config UNACCEPTED_MEMORY
	bool
	depends on EFI_STUB

it could be 'select'ed from the TDX Kconfig and no users would ever be
bothered with it.  Would a user *ever* turn this on if they don't have
TDX (or equivalent)?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 18:13   ` Dave Hansen
@ 2021-08-10 18:30     ` Andi Kleen
  2021-08-10 18:56       ` Dave Hansen
  2021-08-10 20:50     ` Dave Hansen
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2021-08-10 18:30 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel
  Cc: Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov


> So, this is right in the fast path of the page allocator.  It's a
> one-time thing per 2M page, so it's not permanent.
>
> *But* there's both a global spinlock and a firmware call hidden in
> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> running a benchmark while this code path is being tickled.  Not just to
>
> That could be just downright catastrophic for scalability, albeit
> temporarily

This would be only a short blib at initialization until the system 
reaches steady state. So yes it would be temporary, but very short at that.

-Andi


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10 18:19             ` Dave Hansen
@ 2021-08-10 18:39               ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 18:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 11:19:30AM -0700, Dave Hansen wrote:
> On 8/10/21 10:51 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 10:36:21AM -0700, Dave Hansen wrote:
> >>> The difference is going to be substantially lower once we get it optimized
> >>> properly.
> >> What does this mean?  Is this future work in the kernel or somewhere in
> >> the TDX hardware/firmware which will speed things up?
> > Kernel has to be changed to accept memory in 2M and 1G chunks where
> > possible. The interface exists and described in spec, but not yet used in
> > guest kernel.
> 
> From a quick scan of the spec, I only see:
> 
> > 7.9.3. Page Acceptance by the Guest TD: TDG.MEM.PAGE.ACCEPT ... The guest
> > TD can accept a dynamically added 4KB page using TDG.MEM.PAGE.ACCEPT
> > with the page GPA as an input.
> Is there some other 2M/1G page-acceptance call that I'm missing?

I referred to GHCI[1], section 2.4.7. RDX=0 is 4k, RDX=1 is 2M and
RDX=2 is 1G.

Public specs have mismatches. I hope it will get sorted out soon. :/

[1] https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf

> > It would cut hypercall overhead dramatically. It makes upfront memory
> > accept more bearable and lowers latency of lazy memory accept. So I expect
> > the gap being not 20x, but like 3-5x (which is still huge).
> 
> It would be nice to be able to judge the benefits of this series based
> on the final form.  I guess we'll take what we can get, though.
> 
> Either way, I'd still like to see some *actual* numbers for at least one
> configuration:
> 
> 	With this series applied, userspace starts to run at X seconds
> 	after kernel boot.  Without this series, userspace runs at Y
> 	seconds.

Getting absolute numbers in public for unreleased product is tricky.
I hoped to get away with ratios or percentage of difference. 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 18:30     ` Andi Kleen
@ 2021-08-10 18:56       ` Dave Hansen
  2021-08-10 19:23         ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 18:56 UTC (permalink / raw)
  To: Andi Kleen, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On 8/10/21 11:30 AM, Andi Kleen wrote:
>> So, this is right in the fast path of the page allocator.  It's a
>> one-time thing per 2M page, so it's not permanent.
>>
>> *But* there's both a global spinlock and a firmware call hidden in
>> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
>> running a benchmark while this code path is being tickled.  Not just to
>>
>> That could be just downright catastrophic for scalability, albeit
>> temporarily
> 
> This would be only a short blib at initialization until the system
> reaches steady state. So yes it would be temporary, but very short at that.

But it can't be *that* short or we wouldn't be going to all this trouble
in the first place.  This can't simultaneously be both bad enough that
this series exists, but minor enough that nobody will notice or care at
runtime.

In general, I'd rather have a system which is running userspace, slowly,
than one where I'm waiting for the kernel.  The trade-off being made is
a *good* trade-off for me.  But, not everyone is going to agree with me.

This also begs the question of how folks know when this "blip" is over.
 Do we have a counter for offline pages?  Is there any way to force page
acceptance?  Or, are we just stuck allocating a bunch of memory to warm
up the system?

How do folks who care about these new blips avoid them?

Again, I don't particularly care about how this affects the
benchmarkers.  But, I do care that they're going to hound us when these
blips start impacting their 99th percentile tail latencies.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10 18:30   ` Dave Hansen
@ 2021-08-10 19:08     ` Kirill A. Shutemov
  2021-08-10 19:19       ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-10 19:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 11:30:42AM -0700, Dave Hansen wrote:
> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> > +config UNACCEPTED_MEMORY
> > +	bool
> > +	depends on EFI_STUB
> > +	help
> > +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> > +	   the concept of memory acceptance, requiring memory to be accepted
> > +	   before it can be used by the guest. This protects against a class of
> > +	   attacks by the virtual machine platform.
> > +
> > +	   This option adds support for unaccepted memory and makes such memory
> > +	   usable by kernel.
> 
> Do we really need a full-blown user-visible option here?  If we, for
> instance, just did:
> 
> config UNACCEPTED_MEMORY
> 	bool
> 	depends on EFI_STUB
> 
> it could be 'select'ed from the TDX Kconfig and no users would ever be
> bothered with it.  Would a user *ever* turn this on if they don't have
> TDX (or equivalent)?

But it's already not user selectable. Note that there's no prompt next to
the "bool". The "help" section is just for documentation. I think it can
be useful.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10 19:08     ` Kirill A. Shutemov
@ 2021-08-10 19:19       ` Dave Hansen
  2021-08-12 21:17         ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 19:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On 8/10/21 12:08 PM, Kirill A. Shutemov wrote:
>>> +config UNACCEPTED_MEMORY
>>> +	bool
>>> +	depends on EFI_STUB
>>> +	help
>>> +	   Some Virtual Machine platforms, such as Intel TDX, introduce
>>> +	   the concept of memory acceptance, requiring memory to be accepted
>>> +	   before it can be used by the guest. This protects against a class of
>>> +	   attacks by the virtual machine platform.
>>> +
>>> +	   This option adds support for unaccepted memory and makes such memory
>>> +	   usable by kernel.
>> Do we really need a full-blown user-visible option here?  If we, for
>> instance, just did:
>>
>> config UNACCEPTED_MEMORY
>> 	bool
>> 	depends on EFI_STUB
>>
>> it could be 'select'ed from the TDX Kconfig and no users would ever be
>> bothered with it.  Would a user *ever* turn this on if they don't have
>> TDX (or equivalent)?
> But it's already not user selectable. Note that there's no prompt next to
> the "bool". The "help" section is just for documentation. I think it can
> be useful.

Ahh, gotcha.  I misread it.  Seems like an odd thing to do, but it's
also fairly widespread in the tree.

Can you even reach that help text from any of the configuration tools?
If you're doing an 'oldconfig', you won't get a prompt to do the "?".
Even in the 'meunconfig' search results, it doesn't display "help" text,
only the "prompt".

BTW, should this text call out that this is for parsing an actual UEFI
feature along with the spec version?  It's not obvious from the text
that "unaccepted memory" really is a UEFI thing as opposed to being some
kernel-only concept.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 18:56       ` Dave Hansen
@ 2021-08-10 19:23         ` Andi Kleen
  2021-08-10 19:46           ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2021-08-10 19:23 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel
  Cc: Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov


> But, not everyone is going to agree with me.

Both the Intel TDX and the AMD SEV side independently came to opposite 
conclusions. In general people care a lot about boot time of VM guests.


>
> This also begs the question of how folks know when this "blip" is over.
>   Do we have a counter for offline pages?  Is there any way to force page
> acceptance?  Or, are we just stuck allocating a bunch of memory to warm
> up the system?
>
> How do folks who care about these new blips avoid them?

It's not different than any other warmup. At warmup time you always have 
lots of blips until the working set stabilizes. For example in 
virtualization first touch of a new page is usually an EPT violation 
handled to the host. Or in the native case you may need to do IO or free 
memory. Everybody who based their critical latency percentiles around a 
warming up process would be foolish, the picture would be completely 
distorted.

So the basic operation is adding some overhead, but I don't think 
anything is that unusual compared to the state of the art.

Now perhaps the locking might be a problem if the other operations all 
run in parallel, causing unnecessary serialization If that's really a 
problem I guess we can optimize later. I don't think there's anything 
fundamental about the current locking.


-Andi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 19:23         ` Andi Kleen
@ 2021-08-10 19:46           ` Dave Hansen
  2021-08-10 21:20             ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 19:46 UTC (permalink / raw)
  To: Andi Kleen, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On 8/10/21 12:23 PM, Andi Kleen wrote:
>> But, not everyone is going to agree with me.
> 
> Both the Intel TDX and the AMD SEV side independently came to opposite
> conclusions. In general people care a lot about boot time of VM guests.

I was also saying that getting to userspace fast is important to me.
Almost everyone agrees there.

>> This also begs the question of how folks know when this "blip" is over.
>>   Do we have a counter for offline pages?  Is there any way to force page
>> acceptance?  Or, are we just stuck allocating a bunch of memory to warm
>> up the system?
>>
>> How do folks who care about these new blips avoid them?
> 
> It's not different than any other warmup. At warmup time you always have
> lots of blips until the working set stabilizes. For example in
> virtualization first touch of a new page is usually an EPT violation
> handled to the host. Or in the native case you may need to do IO or free
> memory. Everybody who based their critical latency percentiles around a
> warming up process would be foolish, the picture would be completely
> distorted.
> 
> So the basic operation is adding some overhead, but I don't think
> anything is that unusual compared to the state of the art.

Except that today, you can totally avoid the allocation latency (not
sure about the EPT violation/fill latency) from things like QEMU's
-mem-prealloc.

> Now perhaps the locking might be a problem if the other operations all
> run in parallel, causing unnecessary serialization If that's really a
> problem I guess we can optimize later. I don't think there's anything
> fundamental about the current locking.

These boot blips are not the biggest issue in the world.  But, it is
fully under the guest's control and I think the guest has some
responsibility to provide *some* mitigation for it.

1. Do background acceptance, as opposed to relying 100% on demand-driven
   acceptance.  Guarantees a limited window in which blips can occur.
2. Do acceptance based on user input, like from sysfs.
3. Add a command-line argument to accept everything up front, or at
   least before userspace runs.
4. Add some statistic for how much unaccepted memory remains.

I can think of at least four ways we could mitigate it.  A sysfs
statistic file would probably take ~30 lines of code to loop over the
bitmap.  A command-line option would probably be <10 lines of code to
just short-circuit the bitmap and accept everything up front.  A file to
force acceptance would probably be pretty quick too.

Nothing there seem too onerous.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 18:13   ` Dave Hansen
  2021-08-10 18:30     ` Andi Kleen
@ 2021-08-10 20:50     ` Dave Hansen
  2021-08-12 21:08       ` Kirill A. Shutemov
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-10 20:50 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel
  Cc: Andi Kleen, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov

On 8/10/21 11:13 AM, Dave Hansen wrote:
>> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>>  	if (page_reported(page))
>>  		__ClearPageReported(page);
>>  
>> +	if (PageOffline(page))
>> +		clear_page_offline(page, order);
>> +
>>  	list_del(&page->lru);
>>  	__ClearPageBuddy(page);
>>  	set_page_private(page, 0);
> So, this is right in the fast path of the page allocator.  It's a
> one-time thing per 2M page, so it's not permanent.
> 
> *But* there's both a global spinlock and a firmware call hidden in
> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> running a benchmark while this code path is being tickled.  Not just to
> 
> That could be just downright catastrophic for scalability, albeit
> temporarily.

One more thing...

How long are these calls?  You have to make at least 512 calls into the
SEAM module.  Assuming they're syscall-ish, so ~1,000 cycles each,
that's ~500,000 cycles, even if we ignore the actual time it takes to
zero that 2MB worth of memory and all other overhead within the SEAM module.

So, we're sitting on one CPU with interrupts off, blocking all the other
CPUs from doing page allocation in this zone.  Then, we're holding a
global lock which prevents any other NUMA nodes from accepting pages.
If the other node happens to *try* to do an accept, it will sit with its
zone lock held waiting for this one.

Maybe nobody will ever notice.  But, it seems like an awfully big risk
to me.  I'd at least *try* do these calls outside of the zone lock.
Then the collateral damage will at least be limited to things doing
accepts rather than all zone->lock users.

Couldn't we delay the acceptance to, say the place where we've dropped
the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()?
Or is there some concern that the page has been split at that point?

I guess that makes it more complicated because you might have a 4k page
but you need to go accept a 2M page.  You might end up having to check
the bitmap 511 more times because you might see 511 more PageOffline()
pages come through.

You shouldn't even need the bitmap lock to read since it's a one-way
trip from unaccepted->accepted.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 19:46           ` Dave Hansen
@ 2021-08-10 21:20             ` Andi Kleen
  2021-08-12  8:19               ` Joerg Roedel
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2021-08-10 21:20 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Joerg Roedel
  Cc: Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov


> These boot blips are not the biggest issue in the world.  But, it is
> fully under the guest's control and I think the guest has some
> responsibility to provide *some* mitigation for it.

It sounds more like an exercise in preliminary optimization at this 
point. If it's a real problem we can still worry about it later.


> 1. Do background acceptance, as opposed to relying 100% on demand-driven
>     acceptance.  Guarantees a limited window in which blips can occur.

Like Kirill wrote this was abandoned because it always allocates all 
memory on the host even if the guest doesn't need it.


> 2. Do acceptance based on user input, like from sysfs.

You can easily do that by running "memhog" at boot. No need for anything 
in the kernel.

BTW I believe this is also configurable at the guest BIOS level.

> 3. Add a command-line argument to accept everything up front, or at
>     least before userspace runs.

Same.


> 4. Add some statistic for how much unaccepted memory remains.

Yes that makes sense. We should have statistic counters for all of this.

Also I agree with your suggestion that we should get the slow path out 
of the zone locks/interrupt disable region. That should be easy enough 
and is an obvious improvement.

-Andi


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 21:20             ` Andi Kleen
@ 2021-08-12  8:19               ` Joerg Roedel
  2021-08-12 14:14                 ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Joerg Roedel @ 2021-08-12  8:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
> Also I agree with your suggestion that we should get the slow path out of
> the zone locks/interrupt disable region. That should be easy enough and is
> an obvious improvement.

I also agree that the slow-path needs to be outside of the memory
allocator locks. But I think this conflicts with the concept of
accepting memory in 2MB chunks even if allocation size is smaller.

Given some kernel code allocated 2 pages and the allocator path starts
to validate the whole 2MB page the memory is on, then there are
potential races to take into account.

Either some other code path allocates memory from that page and returns
it before validation is finished or we end up with double validation.
Returning unvalidated memory is a guest-problem and double validation
will cause security issues for SNP guests.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2021-08-10 14:08 ` [PATCH 0/5] x86: Impplement support for unaccepted memory Dave Hansen
@ 2021-08-12  8:23 ` Joerg Roedel
  2021-08-12 10:10   ` Kirill A. Shutemov
  6 siblings, 1 reply; 49+ messages in thread
From: Joerg Roedel @ 2021-08-12  8:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

Hi Kirill,

On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.

That sentence bothers me a bit. Can you explain what it VMM specific in
the acceptance protocol?

I want to avoid having to implement VMM specific acceptance protocols.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-12  8:23 ` Joerg Roedel
@ 2021-08-12 10:10   ` Kirill A. Shutemov
  2021-08-12 19:33     ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 10:10 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Andi Kleen, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
> Hi Kirill,
> 
> On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> 
> That sentence bothers me a bit. Can you explain what it VMM specific in
> the acceptance protocol?

For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
every accepted page to TDX Module. SEV-SNP has to something similar.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-12  8:19               ` Joerg Roedel
@ 2021-08-12 14:14                 ` Dave Hansen
  2021-08-12 20:49                   ` Kirill A. Shutemov
  2021-08-13 14:49                   ` Joerg Roedel
  0 siblings, 2 replies; 49+ messages in thread
From: Dave Hansen @ 2021-08-12 14:14 UTC (permalink / raw)
  To: Joerg Roedel, Andi Kleen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On 8/12/21 1:19 AM, Joerg Roedel wrote:
> On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
>> Also I agree with your suggestion that we should get the slow path out of
>> the zone locks/interrupt disable region. That should be easy enough and is
>> an obvious improvement.
> 
> I also agree that the slow-path needs to be outside of the memory
> allocator locks. But I think this conflicts with the concept of
> accepting memory in 2MB chunks even if allocation size is smaller.
> 
> Given some kernel code allocated 2 pages and the allocator path starts
> to validate the whole 2MB page the memory is on, then there are
> potential races to take into account.

Yeah, the PageOffline()+PageBuddy() trick breaks down as soon as
PageBuddy() gets cleared.

I'm not 100% sure we need a page flag, though.  Imagine if we just did a
static key check in prep_new_page():

	if (static_key_whatever(tdx_accept_ongoing))
		maybe_accept_page(page, order);

maybe_accept_page() could just check the acceptance bitmap and see if
the 2MB page has been accepted.  If so, just return.  If not, take the
bitmap lock, accept the 2MB page, then mark the bitmap.

maybe_accept_page()
{
	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;

	/* Test the bit before taking any locks: */
	if (test_bit(huge_pfn, &accepted_bitmap))
		return;

	spin_lock_irq();
	/* Retest inside the lock: */
	if (test_bit(huge_pfn, &accepted_bitmap))
		return;
	tdx_accept_page(page, PMD_SIZE);
	set_bit(huge_pfn, &accepted_bitmap));
	spin_unlock_irq();
}

That's still not great.  It's still a global lock and the lock is still
held for quite a while because that accept isn't going to be lightning
fast.  But, at least it's not holding any *other* locks.  It also
doesn't take any locks in the fast path where the 2MB page was already
accepted.

The locking could be more fine-grained, for sure.  The bitmap could, for
instance, have a lock bit too.  Or we could just have an array of locks
and hash the huge_pfn to find a lock given a huge_pfn.  But, for now, I
think it's fine to just keep the global lock.

> Either some other code path allocates memory from that page and returns
> it before validation is finished or we end up with double validation.
> Returning unvalidated memory is a guest-problem and double validation
> will cause security issues for SNP guests.

Yeah, I think the *canonical* source of information for accepts is the
bitmap.  The page flags and any static keys or whatever are
less-canonical sources that tell you when you _might_ need to consult
the bitmap.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-12 10:10   ` Kirill A. Shutemov
@ 2021-08-12 19:33     ` Andi Kleen
  2021-08-12 20:22       ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2021-08-12 19:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Joerg Roedel
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Kuppuswamy Sathyanarayanan, David Rientjes,
	Vlastimil Babka, Tom Lendacky, Thomas Gleixner, Peter Zijlstra,
	Paolo Bonzini, Ingo Molnar, Varad Gautam, Dario Faggioli, x86,
	linux-mm, linux-coco, linux-kernel, Kirill A. Shutemov


On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
>> Hi Kirill,
>>
>> On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>> That sentence bothers me a bit. Can you explain what it VMM specific in
>> the acceptance protocol?
> For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
> every accepted page to TDX Module. SEV-SNP has to something similar.


I think Joerg's question was if TDX has a single ABI for all 
hypervisors. The GHCI specification supports both hypervisor specific 
and hypervisor agnostic calls. But these basic operations like MapGPA 
are all hypervisor agnostic. The only differences would be in the 
existing hypervisor specific PV code.


-Andi


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-12 19:33     ` Andi Kleen
@ 2021-08-12 20:22       ` Kirill A. Shutemov
  2021-08-13 14:56         ` Joerg Roedel
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 20:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joerg Roedel, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote:
> 
> On 8/12/2021 3:10 AM, Kirill A. Shutemov wrote:
> > On Thu, Aug 12, 2021 at 10:23:24AM +0200, Joerg Roedel wrote:
> > > Hi Kirill,
> > > 
> > > On Tue, Aug 10, 2021 at 09:26:21AM +0300, Kirill A. Shutemov wrote:
> > > > Accepting happens via a protocol specific for the Virtrual Machine
> > > > platform.
> > > That sentence bothers me a bit. Can you explain what it VMM specific in
> > > the acceptance protocol?
> > For TDX we have a signle MapGPA hypercall to VMM plus TDAcceptPage for
> > every accepted page to TDX Module. SEV-SNP has to something similar.
> 
> 
> I think Joerg's question was if TDX has a single ABI for all hypervisors.
> The GHCI specification supports both hypervisor specific and hypervisor
> agnostic calls. But these basic operations like MapGPA are all hypervisor
> agnostic. The only differences would be in the existing hypervisor specific
> PV code.

My point was that TDX and SEV-SNP going to be different and we need a way
to hook the right protocol for each.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 4/5] x86/mm: Provide helpers for unaccepted memory
  2021-08-10 18:16   ` Dave Hansen
@ 2021-08-12 20:31     ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 20:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On Tue, Aug 10, 2021 at 11:16:26AM -0700, Dave Hansen wrote:
> On 8/9/21 11:26 PM, Kirill A. Shutemov wrote:
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	if (!boot_params.unaccepted_memory)
> > +		return;
> > +
> > +	spin_lock(&unaccepted_memory_lock);
> > +	__accept_memory(start, end);
> > +	spin_unlock(&unaccepted_memory_lock);
> > +}
> 
> Isn't this taken in the:
> 
> 	del_page_from_free_list()->
> 	clear_page_offline()->
> 	accept_memory()
> 
> call path?
> 
> That's underneath:
> 
> 	spin_lock_irqsave(&zone->lock, flags);
> 
> Which means that accept_memory() can happen from interrupt context.  Is
> it always covered by another spin_lock_irqsave() which means that it can
> use a plain spin_lock()?

I didn't give it enough thought yet, but we always run under zone lock
which has to use spin_lock_irqsave() if it called from interrupt context.

Having said that I think it is good idea to move clear_page_offline() out
zone lock. It should help with allocation latency. Not sure how messy it
gets. Merging/splitting path looks complex and I'm not an expert in the
page allocator.

> If so, it would be nice to call out that logic.  It *looks* like a
> spinlock that we would want to be spin_lock_irqsave().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 15:21       ` David Hildenbrand
@ 2021-08-12 20:34         ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 20:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 05:21:48PM +0200, David Hildenbrand wrote:
> On 10.08.21 17:02, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
> > > On 10.08.21 08:26, Kirill A. Shutemov wrote:
> > > > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > > > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > > > requiring memory to be accepted before it can be used by the guest.
> > > > Accepting happens via a protocol specific for the Virtrual Machine
> > > > platform.
> > > > 
> > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > accepted guest physical address range. It's better to postpone memory
> > > > acceptation until memory is needed. It lowers boot time and reduces
> > > > memory overhead.
> > > > 
> > > > Support of such memory requires few changes in core-mm code:
> > > > 
> > > >     - memblock has to accept memory on allocation;
> > > > 
> > > >     - page allocator has to accept memory on the first allocation of the
> > > >       page;
> > > > 
> > > > Memblock change is trivial.
> > > > 
> > > > Page allocator is modified to accept pages on the first allocation.
> > > > PageOffline() is used to indicate that the page requires acceptance.
> > > > The flag currently used by hotplug and balloon. Such pages are not
> > > > available to page allocator.
> > > > 
> > > > An architecture has to provide three helpers if it wants to support
> > > > unaccepted memory:
> > > > 
> > > >    - accept_memory() makes a range of physical addresses accepted.
> > > > 
> > > >    - maybe_set_page_offline() marks a page PageOffline() if it requires
> > > >      acceptance. Used during boot to put pages on free lists.
> > > > 
> > > >    - clear_page_offline() clears makes a page accepted and clears
> > > >      PageOffline().
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > >    mm/internal.h   | 14 ++++++++++++++
> > > >    mm/memblock.c   |  1 +
> > > >    mm/page_alloc.c | 13 ++++++++++++-
> > > >    3 files changed, 27 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 31ff935b2547..d2fc8a17fbe0 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> > > >    int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> > > >    		      unsigned long addr, int page_nid, int *flags);
> > > > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > > > +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> > > > +{
> > > > +}
> > > > +
> > > > +static inline void clear_page_offline(struct page *page, unsigned int order)
> > > > +{
> > > > +}
> > > > +
> > > > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > > > +{
> > > > +}
> > > 
> > > Can we find better fitting names for the first two? The function names are
> > > way too generic. For example:
> > > 
> > > accept_or_set_page_offline()
> > > 
> > > accept_and_clear_page_offline()
> > 
> > Sounds good.
> > 
> > > I thought for a second if
> > > 	PAGE_TYPE_OPS(Unaccepted, offline)
> > > makes sense as well, not sure.
> > 
> > I find Offline fitting the situation. Don't see a reason to add more
> > terminology here.
> > 
> > > Also, please update the description of PageOffline in page-flags.h to
> > > include the additional usage with PageBuddy set at the same time.
> > 
> > Okay.
> > 
> > > I assume you don't have to worry about page_offline_freeze/thaw ... as we
> > > only set PageOffline initially, but not later at runtime when other
> > > subsystems (/proc/kcore) might stumble over it.
> > 
> > I think so, but I would need to look at this code once again.
> > 
> 
> Another thing to look into would be teaching makedumpfile via vmcoreinfo
> about these special buddy pages:
> 
> makedumpfile will naturally skip all PageOffline pages and skip PageBuddy
> pages if requested to skip free pages. It detects these pages via the
> mapcount value. You will want makedumpfile to treat them like PageOffline
> pages: kernel/crash_core.c
> 
> #define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
> 
> #define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
> VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
> 
> We could export PAGE_BUDDY_OFFLINE_MAPCOUNT_VALUE or just compute it inside
> makedumpfile from the other two values.

Thanks, for digging it up. I'll look into makedumpfile, but it's not on
top of my todo list, so may take a while.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-12 14:14                 ` Dave Hansen
@ 2021-08-12 20:49                   ` Kirill A. Shutemov
  2021-08-12 20:59                     ` Dave Hansen
  2021-08-13 14:49                   ` Joerg Roedel
  1 sibling, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 20:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joerg Roedel, Andi Kleen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> On 8/12/21 1:19 AM, Joerg Roedel wrote:
> > On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
> >> Also I agree with your suggestion that we should get the slow path out of
> >> the zone locks/interrupt disable region. That should be easy enough and is
> >> an obvious improvement.
> > 
> > I also agree that the slow-path needs to be outside of the memory
> > allocator locks. But I think this conflicts with the concept of
> > accepting memory in 2MB chunks even if allocation size is smaller.
> > 
> > Given some kernel code allocated 2 pages and the allocator path starts
> > to validate the whole 2MB page the memory is on, then there are
> > potential races to take into account.
> 
> Yeah, the PageOffline()+PageBuddy() trick breaks down as soon as
> PageBuddy() gets cleared.
> 
> I'm not 100% sure we need a page flag, though.  Imagine if we just did a
> static key check in prep_new_page():
> 
> 	if (static_key_whatever(tdx_accept_ongoing))
> 		maybe_accept_page(page, order);
> 
> maybe_accept_page() could just check the acceptance bitmap and see if
> the 2MB page has been accepted.  If so, just return.  If not, take the
> bitmap lock, accept the 2MB page, then mark the bitmap.
> 
> maybe_accept_page()
> {
> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> 
> 	/* Test the bit before taking any locks: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 
> 	spin_lock_irq();
> 	/* Retest inside the lock: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 	tdx_accept_page(page, PMD_SIZE);
> 	set_bit(huge_pfn, &accepted_bitmap));
> 	spin_unlock_irq();
> }
> 
> That's still not great.  It's still a global lock and the lock is still
> held for quite a while because that accept isn't going to be lightning
> fast.  But, at least it's not holding any *other* locks.  It also
> doesn't take any locks in the fast path where the 2MB page was already
> accepted.

I expect a cache line with bitmap to bounce around during warm up. One
cache line covers a gig of RAM.

It's also not clear at all at what point the static key has to be
switched. We don't have any obvious point where we can say we are done
with accepting (unless you advocate for proactive accepting).

I like PageOffline() because we only need to consult the cache page
allocator already have in hands before looking into bitmap.

> The locking could be more fine-grained, for sure.  The bitmap could, for
> instance, have a lock bit too.  Or we could just have an array of locks
> and hash the huge_pfn to find a lock given a huge_pfn.  But, for now, I
> think it's fine to just keep the global lock.
> 
> > Either some other code path allocates memory from that page and returns
> > it before validation is finished or we end up with double validation.
> > Returning unvalidated memory is a guest-problem and double validation
> > will cause security issues for SNP guests.
> 
> Yeah, I think the *canonical* source of information for accepts is the
> bitmap.  The page flags and any static keys or whatever are
> less-canonical sources that tell you when you _might_ need to consult
> the bitmap.

Right.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-12 20:49                   ` Kirill A. Shutemov
@ 2021-08-12 20:59                     ` Dave Hansen
  2021-08-12 21:23                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Hansen @ 2021-08-12 20:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Joerg Roedel, Andi Kleen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On 8/12/21 1:49 PM, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
>> On 8/12/21 1:19 AM, Joerg Roedel wrote:
>> maybe_accept_page()
>> {
>> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
>>
>> 	/* Test the bit before taking any locks: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>>
>> 	spin_lock_irq();
>> 	/* Retest inside the lock: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>> 	tdx_accept_page(page, PMD_SIZE);
>> 	set_bit(huge_pfn, &accepted_bitmap));
>> 	spin_unlock_irq();
>> }
>>
>> That's still not great.  It's still a global lock and the lock is still
>> held for quite a while because that accept isn't going to be lightning
>> fast.  But, at least it's not holding any *other* locks.  It also
>> doesn't take any locks in the fast path where the 2MB page was already
>> accepted.
> 
> I expect a cache line with bitmap to bounce around during warm up. One
> cache line covers a gig of RAM.

The bitmap bouncing around isn't going to really matter when you have a
global lock protecting it from writes.

> It's also not clear at all at what point the static key has to be
> switched. We don't have any obvious point where we can say we are done
> with accepting (unless you advocate for proactive accepting).

Two easy options:
1. Run over the bitmap and counts the bits left.  That can be done
   outside the lock even.
2. Keep a counter of the number of bits set in the bitmap.

> I like PageOffline() because we only need to consult the cache page
> allocator already have in hands before looking into bitmap.

I like it too.  But, it's really nasty if the value is only valid deep
in the allocator.

We could keep the PageOffline() thing and then move it to some other
field in 'struct page' that's only valid between ClearPageOffline() and
prep_new_page().   Some magic value that says: "This page has not yet
been accepted, you better check the bitmap."  Something like:

	if (TestClearPageOffline(page))
		page->private = 0Xdeadbeef;

and then check page->private in prep_new_page(). There should be plenty
of 'struct page' space to hijack in that small window.

BTW, I was going to actually try and hack something up, but I got
annoyed that your patches don't apply upstream and gave up.  A git tree
with all of the dependencies would be nice. <hint, hint>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-10 20:50     ` Dave Hansen
@ 2021-08-12 21:08       ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 21:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On Tue, Aug 10, 2021 at 01:50:57PM -0700, Dave Hansen wrote:
> On 8/10/21 11:13 AM, Dave Hansen wrote:
> >> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> >>  	if (page_reported(page))
> >>  		__ClearPageReported(page);
> >>  
> >> +	if (PageOffline(page))
> >> +		clear_page_offline(page, order);
> >> +
> >>  	list_del(&page->lru);
> >>  	__ClearPageBuddy(page);
> >>  	set_page_private(page, 0);
> > So, this is right in the fast path of the page allocator.  It's a
> > one-time thing per 2M page, so it's not permanent.
> > 
> > *But* there's both a global spinlock and a firmware call hidden in
> > clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> > running a benchmark while this code path is being tickled.  Not just to
> > 
> > That could be just downright catastrophic for scalability, albeit
> > temporarily.
> 
> One more thing...
> 
> How long are these calls?  You have to make at least 512 calls into the
> SEAM module.  Assuming they're syscall-ish, so ~1,000 cycles each,
> that's ~500,000 cycles, even if we ignore the actual time it takes to
> zero that 2MB worth of memory and all other overhead within the SEAM module.

I hope to get away with 2 calls per 2M: one MapGPA and one TDACCEPTPAGE
(or 3 for MAXORDER -- 4M -- pages). I don't have any numbers yet.

> So, we're sitting on one CPU with interrupts off, blocking all the other
> CPUs from doing page allocation in this zone. 

I agree that's not good. Let's see if it's going to be okay with accepting
in 2M chunks.

> Then, we're holding a global lock which prevents any other NUMA nodes
> from accepting pages.

Looking at this again, the global lock is aviodable: the caller owns the
pfn range so nobody can touch these bits in the bitmap. We can replace
bitmap_clear() with atomic clear_bit() loop and drop the lock completely.

> If the other node happens to *try* to do an
> accept, it will sit with its zone lock held waiting for this one.

> Maybe nobody will ever notice.  But, it seems like an awfully big risk
> to me.  I'd at least *try* do these calls outside of the zone lock.
> Then the collateral damage will at least be limited to things doing
> accepts rather than all zone->lock users.
> 
> Couldn't we delay the acceptance to, say the place where we've dropped
> the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()?
> Or is there some concern that the page has been split at that point?

It *will* be split by the point. Like if you ask for order-0 page and you
don't any left page allocator will try higher orders until finds anything.
On order-9 it would hit unaccepted. At that point the page going to split
and put on the free lists accordingly. That's all happens under zone lock.

  __rmqueue_smallest ->
    del_page_from_free_list()
    expand()

> I guess that makes it more complicated because you might have a 4k page
> but you need to go accept a 2M page.  You might end up having to check
> the bitmap 511 more times because you might see 511 more PageOffline()
> pages come through.
> 
> You shouldn't even need the bitmap lock to read since it's a one-way
> trip from unaccepted->accepted.

Yeah. Unless we don't want to flip it back on making the range share.
I think we do. Otherwise it will cause problems for kexec.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10 17:50   ` Dave Hansen
@ 2021-08-12 21:14     ` Kirill A. Shutemov
  2021-08-12 21:43       ` Dave Hansen
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 21:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On Tue, Aug 10, 2021 at 10:50:33AM -0700, Dave Hansen wrote:
> ...
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 num)
> > +{
> 
> Some of these interfaces like accept_memory() take a start/end physical
> address.  Having this take a "num pages" is bound to cause confusion.
> Could you make these all consistently take start/end physical addresses?

Okay.

> 
> > +	u64 end = start + num * PAGE_SIZE;
> > +	unsigned int npages;
> 
> 
> Could you comment those, please?
> 
> 	/*
> 	 * The accepted memory bitmap only works at PMD_SIZE
> 	 * granularity.  If a request comes in to mark memory
> 	 * as unaccepted which is not PMD_SIZE-aligned, simply
> 	 * accept the memory now since it can not be *marked* as
> 	 * unaccepted.
> 	 */
> 
> Then go on to comment the three cases:
> 
> 	/* Check for ranges which do not span a whole PMD_SIZE area: */

Okay.

> > +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> > +		npages = (end - start) / PAGE_SIZE;
> > +		__accept_memory(start, start + npages * PAGE_SIZE);
> > +		return;
> > +	}
> 
> Hmm, is it possible to have this case hit, but neither of the two below
> cases?  This seems to be looking for a case where the range is somehow
> entirely contained in one PMD_SIZE area, but where it doesn't consume a
> whole area.
> 
> Wouldn't that mean that 'start' or 'end' must be unaligned?

The problem is that if both of them unaligned round_up() and round_down()
in the cases below would step outside the requested range.

> > +	if (start & ~PMD_MASK) {
> > +		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
> > +		__accept_memory(start, start + npages * PAGE_SIZE);
> > +		start = round_up(start, PMD_SIZE);
> > +	}
> > +
> > +	if (end & ~PMD_MASK) {
> > +		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
> > +		end = round_down(end, PMD_SIZE);
> > +		__accept_memory(end, end + npages * PAGE_SIZE);
> > +	}
> > +	npages = (end - start) / PMD_SIZE;
> > +	bitmap_set((unsigned long *)params->unaccepted_memory,
> > +		   start / PMD_SIZE, npages);
> > +}
> 
> Even though it's changed right there, it's a bit cruel to change the
> units of 'npages' right in the middle of a function.  It's just asking
> for bugs.
> 
> It would only take a single extra variable declaration to make this
> unambiguous:
> 
> 	u64 nr_unaccepted_bits;
> 
> or something, then you can do:
> 
> 	nr_unaccepted_bits = (end - start) / PMD_SIZE;
> 	bitmap_set((unsigned long *)params->unaccepted_memory,
> 		   start / PMD_SIZE, nr_unaccepted_bits);

Okay.

> 
> ...
> >  static efi_status_t allocate_e820(struct boot_params *params,
> > +				  struct efi_boot_memmap *map,
> >  				  struct setup_data **e820ext,
> >  				  u32 *e820ext_size)
> >  {
> > -	unsigned long map_size, desc_size, map_key;
> >  	efi_status_t status;
> > -	__u32 nr_desc, desc_version;
> > -
> > -	/* Only need the size of the mem map and size of each mem descriptor */
> > -	map_size = 0;
> > -	status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> > -			     &desc_size, &desc_version);
> > -	if (status != EFI_BUFFER_TOO_SMALL)
> > -		return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
> 
> I noticed that there's no reference to EFI_BUFFER_TOO_SMALL in the hunks
> you added back.  That makes me a bit nervous that this is going to
> unintentionally change behavior.
> 
> It might be worth having a preparatory reorganization patch for
> allocate_e820() before this new feature is added to make this more clear.

Okay. Will do.
> 
> > +	__u32 nr_desc;
> > +	bool unaccepted_memory_present = false;
> > +	u64 max_addr = 0;
> > +	int i;
> >  
> > -	nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> > +	status = efi_get_memory_map(map);
> > +	if (status != EFI_SUCCESS)
> > +		return status;
> >  
> > -	if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> > -		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> > +	nr_desc = *map->map_size / *map->desc_size;
> > +	if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> > +		u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) -
> > +			EFI_MMAP_NR_SLACK_SLOTS;
> >  
> >  		status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> >  		if (status != EFI_SUCCESS)
> >  			return status;
> >  	}
> >  
> > +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +		return EFI_SUCCESS;
> > +
> > +	/* Check if there's any unaccepted memory and find the max address */
> > +	for (i = 0; i < nr_desc; i++) {
> > +		efi_memory_desc_t *d;
> > +
> > +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> > +		if (d->type == EFI_UNACCEPTED_MEMORY)
> > +			unaccepted_memory_present = true;
> > +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> > +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> > +	}
> 
> This 'max_addr' variable looks a bit funky.
> 
> It *seems* like it's related only to EFI_UNACCEPTED_MEMORY, but it's not
> underneath the EFI_UNACCEPTED_MEMORY check.  Is this somehow assuming
> that once unaccepted memory as been found that *all* memory found in
> later descriptors at higher addresses is also going to be unaccepted?

You got it right below :P

> > +	/*
> > +	 * If unaccepted memory present allocate a bitmap to track what memory
> > +	 * has to be accepted before access.
> > +	 *
> > +	 * One bit in the bitmap represents 2MiB in the address space: one 4k
> > +	 * page is enough to track 64GiB or physical address space.
> > +	 *
> > +	 * In the worst case scenario -- a huge hole in the middle of the
> > +	 * address space -- we would need 256MiB to handle 4PiB of the address
> > +	 * space.
> > +	 *
> > +	 * TODO: handle situation if params->unaccepted_memory has already set.
> > +	 * It's required to deal with kexec.
> > +	 */
> > +	if (unaccepted_memory_present) {
> > +		unsigned long *unaccepted_memory = NULL;
> > +		u64 size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> 
> Oh, so the bitmap has to be present for *all* memory, not just
> unaccepted memory.  So, we really do need to know the 'max_addr' so that
> we can allocate the bitmap for so that can be marked in the bitmap has
> having been accepted.

Right we need a bit for every 2M. Accepted or not.

> > +		status = efi_allocate_pages(size,
> > +					    (unsigned long *)&unaccepted_memory,
> > +					    ULONG_MAX);
> > +		if (status != EFI_SUCCESS)
> > +			return status;
> > +		memset(unaccepted_memory, 0, size);
> > +		params->unaccepted_memory = (u64)unaccepted_memory;
> > +	}
> 
> It might be nice to refer to setup_e820() here to mention that it is the
> thing that actually fills out the bitmap.

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-10 19:19       ` Dave Hansen
@ 2021-08-12 21:17         ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 21:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel

On Tue, Aug 10, 2021 at 12:19:22PM -0700, Dave Hansen wrote:
> On 8/10/21 12:08 PM, Kirill A. Shutemov wrote:
> >>> +config UNACCEPTED_MEMORY
> >>> +	bool
> >>> +	depends on EFI_STUB
> >>> +	help
> >>> +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> >>> +	   the concept of memory acceptance, requiring memory to be accepted
> >>> +	   before it can be used by the guest. This protects against a class of
> >>> +	   attacks by the virtual machine platform.
> >>> +
> >>> +	   This option adds support for unaccepted memory and makes such memory
> >>> +	   usable by kernel.
> >> Do we really need a full-blown user-visible option here?  If we, for
> >> instance, just did:
> >>
> >> config UNACCEPTED_MEMORY
> >> 	bool
> >> 	depends on EFI_STUB
> >>
> >> it could be 'select'ed from the TDX Kconfig and no users would ever be
> >> bothered with it.  Would a user *ever* turn this on if they don't have
> >> TDX (or equivalent)?
> > But it's already not user selectable. Note that there's no prompt next to
> > the "bool". The "help" section is just for documentation. I think it can
> > be useful.
> 
> Ahh, gotcha.  I misread it.  Seems like an odd thing to do, but it's
> also fairly widespread in the tree.
> 
> Can you even reach that help text from any of the configuration tools?
> If you're doing an 'oldconfig', you won't get a prompt to do the "?".
> Even in the 'meunconfig' search results, it doesn't display "help" text,
> only the "prompt".

I don't know how get a tool show the text, but my vim sees just fine :P

> BTW, should this text call out that this is for parsing an actual UEFI
> feature along with the spec version?  It's not obvious from the text
> that "unaccepted memory" really is a UEFI thing as opposed to being some
> kernel-only concept.

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-12 20:59                     ` Dave Hansen
@ 2021-08-12 21:23                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2021-08-12 21:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joerg Roedel, Andi Kleen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Aug 12, 2021 at 01:59:01PM -0700, Dave Hansen wrote:
> On 8/12/21 1:49 PM, Kirill A. Shutemov wrote:
> > On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> >> On 8/12/21 1:19 AM, Joerg Roedel wrote:
> >> maybe_accept_page()
> >> {
> >> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> >>
> >> 	/* Test the bit before taking any locks: */
> >> 	if (test_bit(huge_pfn, &accepted_bitmap))
> >> 		return;
> >>
> >> 	spin_lock_irq();
> >> 	/* Retest inside the lock: */
> >> 	if (test_bit(huge_pfn, &accepted_bitmap))
> >> 		return;
> >> 	tdx_accept_page(page, PMD_SIZE);
> >> 	set_bit(huge_pfn, &accepted_bitmap));
> >> 	spin_unlock_irq();
> >> }
> >>
> >> That's still not great.  It's still a global lock and the lock is still
> >> held for quite a while because that accept isn't going to be lightning
> >> fast.  But, at least it's not holding any *other* locks.  It also
> >> doesn't take any locks in the fast path where the 2MB page was already
> >> accepted.
> > 
> > I expect a cache line with bitmap to bounce around during warm up. One
> > cache line covers a gig of RAM.
> 
> The bitmap bouncing around isn't going to really matter when you have a
> global lock protecting it from writes.

The idea with static key would not work if we mark shared memory as
unaccepted there.

> > It's also not clear at all at what point the static key has to be
> > switched. We don't have any obvious point where we can say we are done
> > with accepting (unless you advocate for proactive accepting).
> 
> Two easy options:
> 1. Run over the bitmap and counts the bits left.  That can be done
>    outside the lock even.
> 2. Keep a counter of the number of bits set in the bitmap.
> 
> > I like PageOffline() because we only need to consult the cache page
> > allocator already have in hands before looking into bitmap.
> 
> I like it too.  But, it's really nasty if the value is only valid deep
> in the allocator.
> 
> We could keep the PageOffline() thing and then move it to some other
> field in 'struct page' that's only valid between ClearPageOffline() and
> prep_new_page().   Some magic value that says: "This page has not yet
> been accepted, you better check the bitmap."  Something like:
> 
> 	if (TestClearPageOffline(page))
> 		page->private = 0Xdeadbeef;
> 
> and then check page->private in prep_new_page(). There should be plenty
> of 'struct page' space to hijack in that small window.

PageOffline() encoded in mapcount and check_new_page_bad() would complain
is mapcount is not -1.

> BTW, I was going to actually try and hack something up, but I got
> annoyed that your patches don't apply upstream and gave up.  A git tree
> with all of the dependencies would be nice. <hint, hint>

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] efi/x86: Implement support for unaccepted memory
  2021-08-12 21:14     ` Kirill A. Shutemov
@ 2021-08-12 21:43       ` Dave Hansen
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2021-08-12 21:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Borislav Petkov, Andy Lutomirski, Sean Christopherson,
	Andrew Morton, Joerg Roedel, Andi Kleen,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On 8/12/21 2:14 PM, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 10:50:33AM -0700, Dave Hansen wrote:
>>> +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
>>> +		npages = (end - start) / PAGE_SIZE;
>>> +		__accept_memory(start, start + npages * PAGE_SIZE);
>>> +		return;
>>> +	}
>>
>> Hmm, is it possible to have this case hit, but neither of the two below
>> cases?  This seems to be looking for a case where the range is somehow
>> entirely contained in one PMD_SIZE area, but where it doesn't consume a
>> whole area.
>>
>> Wouldn't that mean that 'start' or 'end' must be unaligned?
> 
> The problem is that if both of them unaligned round_up() and round_down()
> in the cases below would step outside the requested range.

Ahh, got it.

You might want to add some comments like:

	/* Immediately accept whole thing if within a PMD_SIZE block: */

	/* Immediately accept a <PMD_SIZE piece at the start: */

	/* Immediately accept a <PMD_SIZE piece at the end: */

	/* Mark full PMD_SIZE areas so they can be accepted later */

To the three if statements and the bitmap_set().

After looking at this, I do think you probably did this the simplest way
possible.  It just needs a little help.

>>> +	if (start & ~PMD_MASK) {
>>> +		npages = (round_up(start, PMD_SIZE) - start) / PAGE_SIZE;
>>> +		__accept_memory(start, start + npages * PAGE_SIZE);
>>> +		start = round_up(start, PMD_SIZE);
>>> +	}
>>> +
>>> +	if (end & ~PMD_MASK) {
>>> +		npages = (end - round_down(end, PMD_SIZE)) / PAGE_SIZE;
>>> +		end = round_down(end, PMD_SIZE);
>>> +		__accept_memory(end, end + npages * PAGE_SIZE);
>>> +	}
>>> +	npages = (end - start) / PMD_SIZE;
>>> +	bitmap_set((unsigned long *)params->unaccepted_memory,
>>> +		   start / PMD_SIZE, npages);
>>> +}

One note as I'm looking at this again: 'npages' can be 0.  Imagine if
you had an 8k region that started with the last 4k page of a 2M area and
ended on the first 4k page of the next 2M area, like 0x1ff000->0x201000.

I think it's harmless and bitmap_set() seems to handle it correctly.
But, it's probably worth a comment because it's not obvious.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-12 14:14                 ` Dave Hansen
  2021-08-12 20:49                   ` Kirill A. Shutemov
@ 2021-08-13 14:49                   ` Joerg Roedel
  2021-08-17 15:00                     ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Joerg Roedel @ 2021-08-13 14:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

Hi Dave,

On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> maybe_accept_page()
> {
> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> 
> 	/* Test the bit before taking any locks: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 
> 	spin_lock_irq();
> 	/* Retest inside the lock: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 	tdx_accept_page(page, PMD_SIZE);
> 	set_bit(huge_pfn, &accepted_bitmap));
> 	spin_unlock_irq();
> }

Yeah, this could work, but the global lock is likely the show-stopper
here. For SNP we also not allowed to double-validate, so we need
something that basically indicates 'validation-is-ongoing' on a per 2MB
basis.

I am not an mm expert, but a page flag probably doesn't work. The flag
would be on the head of the 2MB range and when that page is already used
somewhere else there is no guarantee that the flag will survive. But
correct me if I am wrong here :)

The other options I can come up with are not great either:

	1) using an AVL bit in the direct-mapping PMD of that page. The
	   page-table would only be walked if the bit in the
	   accept_bitmap is clear. But I am not sure that all memory
	   which needs to be validated is in the direct-map.

	2) Use another page-sized bitmap. If the machine has more than
	   64GB of memory the bit index is wrapped around. This
	   shouldn't be a performance problem at runtime, if this page
	   is only consulted when the valid bit is clear in the
	   accept_bitmap.

MM experts could certainly come up with better ideas :)

> Yeah, I think the *canonical* source of information for accepts is the
> bitmap.  The page flags and any static keys or whatever are
> less-canonical sources that tell you when you _might_ need to consult
> the bitmap.

Right, it also helps the kexec case. The only problem left is how to
track 4kb shared pages for things like the GHCB.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 0/5] x86: Impplement support for unaccepted memory
  2021-08-12 20:22       ` Kirill A. Shutemov
@ 2021-08-13 14:56         ` Joerg Roedel
  0 siblings, 0 replies; 49+ messages in thread
From: Joerg Roedel @ 2021-08-13 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On Thu, Aug 12, 2021 at 11:22:51PM +0300, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 12:33:11PM -0700, Andi Kleen wrote:
> > I think Joerg's question was if TDX has a single ABI for all hypervisors.
> > The GHCI specification supports both hypervisor specific and hypervisor
> > agnostic calls. But these basic operations like MapGPA are all hypervisor
> > agnostic. The only differences would be in the existing hypervisor specific
> > PV code.
> 
> My point was that TDX and SEV-SNP going to be different and we need a way
> to hook the right protocol for each.

Yeah, okay, thanks for the clarification. My misunderstanding was that
there could be per-hypervisor contract on what memory is pre-accepted
and what Linux is responsible for.

Thanks,

	Joerg


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-13 14:49                   ` Joerg Roedel
@ 2021-08-17 15:00                     ` David Hildenbrand
  2021-08-19  9:55                       ` Joerg Roedel
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2021-08-17 15:00 UTC (permalink / raw)
  To: Joerg Roedel, Dave Hansen
  Cc: Andi Kleen, Kirill A. Shutemov, Borislav Petkov, Andy Lutomirski,
	Sean Christopherson, Andrew Morton, Kuppuswamy Sathyanarayanan,
	David Rientjes, Vlastimil Babka, Tom Lendacky, Thomas Gleixner,
	Peter Zijlstra, Paolo Bonzini, Ingo Molnar, Varad Gautam,
	Dario Faggioli, x86, linux-mm, linux-coco, linux-kernel,
	Kirill A. Shutemov

On 13.08.21 16:49, Joerg Roedel wrote:
> Hi Dave,
> 
> On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
>> maybe_accept_page()
>> {
>> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
>>
>> 	/* Test the bit before taking any locks: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>>
>> 	spin_lock_irq();
>> 	/* Retest inside the lock: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>> 	tdx_accept_page(page, PMD_SIZE);
>> 	set_bit(huge_pfn, &accepted_bitmap));
>> 	spin_unlock_irq();
>> }
> 
> Yeah, this could work, but the global lock is likely the show-stopper
> here. For SNP we also not allowed to double-validate, so we need
> something that basically indicates 'validation-is-ongoing' on a per 2MB
> basis.
> 
> I am not an mm expert, but a page flag probably doesn't work. The flag
> would be on the head of the 2MB range and when that page is already used
> somewhere else there is no guarantee that the flag will survive. But
> correct me if I am wrong here :)
> 
> The other options I can come up with are not great either:
> 
> 	1) using an AVL bit in the direct-mapping PMD of that page. The
> 	   page-table would only be walked if the bit in the
> 	   accept_bitmap is clear. But I am not sure that all memory
> 	   which needs to be validated is in the direct-map.
> 
> 	2) Use another page-sized bitmap. If the machine has more than
> 	   64GB of memory the bit index is wrapped around. This
> 	   shouldn't be a performance problem at runtime, if this page
> 	   is only consulted when the valid bit is clear in the
> 	   accept_bitmap.
> 
> MM experts could certainly come up with better ideas :)


Not sure if already discussed, but what about making sure that free 
pages are not a mixture (partially unaccepted, partially accepted).

You'd have to expose the pages in that granularity to the buddy 
(__free_pages_core), indicating the state. You'd have to reject merging 
pages of differing acceptance state.

Accepting a page would then be handled outside of the zone lock, 
completely controlled by the state.

So a page in the buddy would either be completely accepted or completely 
unaccepted, signaled e.g., by PageOffline().

Consequently, when allocating a 4KiB page, you'd split an unaccepted 
2MiB page into separate unaccepted pages. You'd grab one of the 
unaccepted 4KiB pages and accept it before initializing it and handing 
it out.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-17 15:00                     ` David Hildenbrand
@ 2021-08-19  9:55                       ` Joerg Roedel
  2021-08-19 10:06                         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Joerg Roedel @ 2021-08-19  9:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Andi Kleen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

Hi David,

On Tue, Aug 17, 2021 at 05:00:55PM +0200, David Hildenbrand wrote:
> Not sure if already discussed, but what about making sure that free pages
> are not a mixture (partially unaccepted, partially accepted).
> 
> You'd have to expose the pages in that granularity to the buddy
> (__free_pages_core), indicating the state. You'd have to reject merging
> pages of differing acceptance state.
> 
> Accepting a page would then be handled outside of the zone lock, completely
> controlled by the state.
> 
> So a page in the buddy would either be completely accepted or completely
> unaccepted, signaled e.g., by PageOffline().
> 
> Consequently, when allocating a 4KiB page, you'd split an unaccepted 2MiB
> page into separate unaccepted pages. You'd grab one of the unaccepted 4KiB
> pages and accept it before initializing it and handing it out.

Yes, that is the alternative to over-accepting memory on allocation. But
the problem here is that accepting/validating memory is an expensive
operation which also requires a hypercall. The hypercalls on SNP and TDX
can accept/validate multiple pages in one call. So the recommendation is
to accept memory in bigger chunks, like the 2MB that have been proposed.

Only accepting memory in allocation size might be too slow, as there is
a lot of code doing order-0 allocations. I think this approach will also
be more intrusive to the page alloctor, as it needs more changes on the
free path to check for acceptance states before pages can be merged.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/5] mm: Add support for unaccepted memory
  2021-08-19  9:55                       ` Joerg Roedel
@ 2021-08-19 10:06                         ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2021-08-19 10:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Dave Hansen, Andi Kleen, Kirill A. Shutemov, Borislav Petkov,
	Andy Lutomirski, Sean Christopherson, Andrew Morton,
	Kuppuswamy Sathyanarayanan, David Rientjes, Vlastimil Babka,
	Tom Lendacky, Thomas Gleixner, Peter Zijlstra, Paolo Bonzini,
	Ingo Molnar, Varad Gautam, Dario Faggioli, x86, linux-mm,
	linux-coco, linux-kernel, Kirill A. Shutemov

On 19.08.21 11:55, Joerg Roedel wrote:
> Hi David,
> 
> On Tue, Aug 17, 2021 at 05:00:55PM +0200, David Hildenbrand wrote:
>> Not sure if already discussed, but what about making sure that free pages
>> are not a mixture (partially unaccepted, partially accepted).
>>
>> You'd have to expose the pages in that granularity to the buddy
>> (__free_pages_core), indicating the state. You'd have to reject merging
>> pages of differing acceptance state.
>>
>> Accepting a page would then be handled outside of the zone lock, completely
>> controlled by the state.
>>
>> So a page in the buddy would either be completely accepted or completely
>> unaccepted, signaled e.g., by PageOffline().
>>
>> Consequently, when allocating a 4KiB page, you'd split an unaccepted 2MiB
>> page into separate unaccepted pages. You'd grab one of the unaccepted 4KiB
>> pages and accept it before initializing it and handing it out.
> 
> Yes, that is the alternative to over-accepting memory on allocation. But
> the problem here is that accepting/validating memory is an expensive
> operation which also requires a hypercall. The hypercalls on SNP and TDX
> can accept/validate multiple pages in one call. So the recommendation is
> to accept memory in bigger chunks, like the 2MB that have been proposed.
> 

The general idea would be to have one thread scanning the free page list 
and accepting pages in the background. Take a page out, accept it, 
release it back to the buddy. If we place accepted pages to the front of 
the free list, allocations would be quite fast.

Sure, you'd have some slow early allocations, but I'd guess it really 
wouldn't make a big impact overall. We'd have to try.

It's quite similar to the free page reporting infrastructure, except 
that with free page reporting we want to place pages to the tail, not to 
the front. That would be easy to add.


> Only accepting memory in allocation size might be too slow, as there is
> a lot of code doing order-0 allocations. I think this approach will also
> be more intrusive to the page alloctor, as it needs more changes on the
> free path to check for acceptance states before pages can be merged.

That's already mostly done in this patch IIRC.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-08-19 10:06 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  6:26 [PATCH 0/5] x86: Impplement support for unaccepted memory Kirill A. Shutemov
2021-08-10  6:26 ` [PATCH 1/5] mm: Add " Kirill A. Shutemov
2021-08-10  7:48   ` David Hildenbrand
2021-08-10 15:02     ` Kirill A. Shutemov
2021-08-10 15:21       ` David Hildenbrand
2021-08-12 20:34         ` Kirill A. Shutemov
2021-08-10 18:13   ` Dave Hansen
2021-08-10 18:30     ` Andi Kleen
2021-08-10 18:56       ` Dave Hansen
2021-08-10 19:23         ` Andi Kleen
2021-08-10 19:46           ` Dave Hansen
2021-08-10 21:20             ` Andi Kleen
2021-08-12  8:19               ` Joerg Roedel
2021-08-12 14:14                 ` Dave Hansen
2021-08-12 20:49                   ` Kirill A. Shutemov
2021-08-12 20:59                     ` Dave Hansen
2021-08-12 21:23                       ` Kirill A. Shutemov
2021-08-13 14:49                   ` Joerg Roedel
2021-08-17 15:00                     ` David Hildenbrand
2021-08-19  9:55                       ` Joerg Roedel
2021-08-19 10:06                         ` David Hildenbrand
2021-08-10 20:50     ` Dave Hansen
2021-08-12 21:08       ` Kirill A. Shutemov
2021-08-10  6:26 ` [PATCH 2/5] efi/x86: Implement " Kirill A. Shutemov
2021-08-10 17:50   ` Dave Hansen
2021-08-12 21:14     ` Kirill A. Shutemov
2021-08-12 21:43       ` Dave Hansen
2021-08-10 18:30   ` Dave Hansen
2021-08-10 19:08     ` Kirill A. Shutemov
2021-08-10 19:19       ` Dave Hansen
2021-08-12 21:17         ` Kirill A. Shutemov
2021-08-10  6:26 ` [PATCH 3/5] x86/boot/compressed: Handle " Kirill A. Shutemov
2021-08-10  6:26 ` [PATCH 4/5] x86/mm: Provide helpers for " Kirill A. Shutemov
2021-08-10 18:16   ` Dave Hansen
2021-08-12 20:31     ` Kirill A. Shutemov
2021-08-10  6:26 ` [PATCH 5/5] x86/tdx: Unaccepted memory support Kirill A. Shutemov
2021-08-10 14:08 ` [PATCH 0/5] x86: Impplement support for unaccepted memory Dave Hansen
2021-08-10 15:15   ` Kirill A. Shutemov
2021-08-10 15:51     ` Dave Hansen
2021-08-10 17:31       ` Kirill A. Shutemov
2021-08-10 17:36         ` Dave Hansen
2021-08-10 17:51           ` Kirill A. Shutemov
2021-08-10 18:19             ` Dave Hansen
2021-08-10 18:39               ` Kirill A. Shutemov
2021-08-12  8:23 ` Joerg Roedel
2021-08-12 10:10   ` Kirill A. Shutemov
2021-08-12 19:33     ` Andi Kleen
2021-08-12 20:22       ` Kirill A. Shutemov
2021-08-13 14:56         ` Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).