LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v3 0/6] Kernel huge I/O mapping support
@ 2015-03-03 17:44 Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 1/6] mm: Change __get_vm_area_node() to use fls_long() Toshi Kani
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott

ioremap() and its related interfaces are used to create I/O
mappings to memory-mapped I/O devices.  The mapping sizes of
the traditional I/O devices are relatively small.  Non-volatile
memory (NVM), however, has many GB and is going to have TB soon.
It is not very efficient to create large I/O mappings with 4KB. 

This patchset extends the ioremap() interfaces to transparently
create I/O mappings with huge pages whenever possible.  ioremap()
continues to use 4KB mappings when a huge page does not fit into
a requested range.  There is no change necessary to the drivers
using ioremap().  A requested physical address must be aligned by
a huge page size (1GB or 2MB on x86) for using huge page mapping,
though.  The kernel huge I/O mapping will improve performance of
NVM and other devices with large memory, and reduce the time to
create their mappings as well.

On x86, MTRRs can override PAT memory types with a 4KB granularity.
When using a huge page, MTRRs can override the memory type of the
huge page, which may lead a performance penalty.  The processor can
also behave in an undefined manner if a huge page is mapped to a
memory range that MTRRs have mapped with multiple different memory
types.  Therefore, the mapping code falls back to use a smaller page
size toward 4KB when a mapping range is covered by non-WB type of
MTRRs. The WB type of MTRRs has no affect on the PAT memory types.

The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that
the arch supports huge KVA mappings for ioremap().  User may specify
a new kernel option "nohugeiomap" to disable the huge I/O mapping
capability of ioremap() when necessary.

Patch 1-4 change common files to support huge I/O mappings.  There
is no change in the functinalities unless HAVE_ARCH_HUGE_VMAP is
defined on the architecture of the system.

Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
HAVE_ARCH_HUGE_VMAP on x86.

--
v3:
 - Removed config HUGE_IOMAP. Always enable huge page mappings to
   ioremap() when supported by the arch. (Ingo Molnar)
 - Added checks to use 4KB mappings when a memory range is covered
   by MTRRs. (Ingo Molnar, Andrew Morton)
 - Added missing PAT bit handlings to the huge page mapping funcs.

v2:
 - Addressed review comments. (Andrew Morton)
 - Changed HAVE_ARCH_HUGE_VMAP to require X86_PAE set on X86_32.
 - Documented a x86 restriction with multiple MTRRs with different
   memory types.

---
Toshi Kani (6):
  1/6 mm: Change __get_vm_area_node() to use fls_long()
  2/6 lib: Add huge I/O map capability interfaces
  3/6 mm: Change ioremap to set up huge I/O mappings
  4/6 mm: Change vunmap to tear down huge KVA mappings
  5/6 x86, mm: Support huge I/O mapping capability I/F
  6/6 x86, mm: Support huge KVA mappings on x86

---
 Documentation/kernel-parameters.txt |  2 ++
 arch/Kconfig                        |  3 ++
 arch/x86/Kconfig                    |  1 +
 arch/x86/include/asm/page_types.h   |  2 ++
 arch/x86/mm/ioremap.c               | 23 +++++++++++--
 arch/x86/mm/pgtable.c               | 65 +++++++++++++++++++++++++++++++++++++
 include/asm-generic/pgtable.h       | 19 +++++++++++
 include/linux/io.h                  |  7 ++++
 init/main.c                         |  2 ++
 lib/ioremap.c                       | 54 ++++++++++++++++++++++++++++++
 mm/vmalloc.c                        |  8 ++++-
 11 files changed, 183 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/6] mm: Change __get_vm_area_node() to use fls_long()
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 2/6] lib: Add huge I/O map capability interfaces Toshi Kani
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

__get_vm_area_node() takes unsigned long size, which is a 64-bit
value on a 64-bit kernel.  However, fls(size) simply ignores the
upper 32-bit.  Change to use fls_long() to handle the size properly.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 mm/vmalloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 35b25e1..fe1672d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -29,6 +29,7 @@
 #include <linux/atomic.h>
 #include <linux/compiler.h>
 #include <linux/llist.h>
+#include <linux/bitops.h>
 
 #include <asm/uaccess.h>
 #include <asm/tlbflush.h>
@@ -1314,7 +1315,8 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 
 	BUG_ON(in_interrupt());
 	if (flags & VM_IOREMAP)
-		align = 1ul << clamp(fls(size), PAGE_SHIFT, IOREMAP_MAX_ORDER);
+		align = 1ul << clamp_t(int, fls_long(size),
+				       PAGE_SHIFT, IOREMAP_MAX_ORDER);
 
 	size = PAGE_ALIGN(size);
 	if (unlikely(!size))

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 2/6] lib: Add huge I/O map capability interfaces
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 1/6] mm: Change __get_vm_area_node() to use fls_long() Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings Toshi Kani
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

Added ioremap_pud_enabled() and ioremap_pmd_enabled(), which
return 1 when I/O mappings with pud/pmd are enabled on the
kernel.

ioremap_huge_init() calls arch_ioremap_pud_supported() and
arch_ioremap_pmd_supported() to initialize the capabilities
at boot-time.

A new kernel option "nohugeiomap" is also added, so that user
can disable the huge I/O map capabilities when necessary.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 Documentation/kernel-parameters.txt |    2 ++
 arch/Kconfig                        |    3 +++
 include/linux/io.h                  |    7 ++++++
 init/main.c                         |    2 ++
 lib/ioremap.c                       |   38 +++++++++++++++++++++++++++++++++++
 5 files changed, 52 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..55a4ec7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			register save and restore. The kernel will only save
 			legacy floating-point registers on task switch.
 
+	nohugeiomap	[KNL,x86] Disable kernel huge I/O mappings.
+
 	noxsave		[BUGS=X86] Disables x86 extended register state save
 			and restore using xsave. The kernel will fallback to
 			enabling legacy floating-point and sse state.
diff --git a/arch/Kconfig b/arch/Kconfig
index 05d7a8a..55c4440 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_IRQ_TIME_ACCOUNTING
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	bool
 
+config HAVE_ARCH_HUGE_VMAP
+	bool
+
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
diff --git a/include/linux/io.h b/include/linux/io.h
index fa02e55..1ce8b4e 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -38,6 +38,13 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end,
 }
 #endif
 
+void __init ioremap_huge_init(void);
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+int arch_ioremap_pud_supported(void);
+int arch_ioremap_pmd_supported(void);
+#endif
+
 /*
  * Managed iomap interface
  */
diff --git a/init/main.c b/init/main.c
index 6f0f1c5f..119cdf1 100644
--- a/init/main.c
+++ b/init/main.c
@@ -80,6 +80,7 @@
 #include <linux/list.h>
 #include <linux/integrity.h>
 #include <linux/proc_ns.h>
+#include <linux/io.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -484,6 +485,7 @@ static void __init mm_init(void)
 	percpu_init_late();
 	pgtable_init();
 	vmalloc_init();
+	ioremap_huge_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 0c9216c..0ce18aa 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -13,6 +13,44 @@
 #include <asm/cacheflush.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+int __read_mostly ioremap_pud_capable;
+int __read_mostly ioremap_pmd_capable;
+int __read_mostly ioremap_huge_disabled;
+
+static int __init set_nohugeiomap(char *str)
+{
+	ioremap_huge_disabled = 1;
+	return 0;
+}
+early_param("nohugeiomap", set_nohugeiomap);
+
+void __init ioremap_huge_init(void)
+{
+	if (!ioremap_huge_disabled) {
+		if (arch_ioremap_pud_supported())
+			ioremap_pud_capable = 1;
+		if (arch_ioremap_pmd_supported())
+			ioremap_pmd_capable = 1;
+	}
+}
+
+static inline int ioremap_pud_enabled(void)
+{
+	return ioremap_pud_capable;
+}
+
+static inline int ioremap_pmd_enabled(void)
+{
+	return ioremap_pmd_capable;
+}
+
+#else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+void __init ioremap_huge_init(void) { }
+static inline int ioremap_pud_enabled(void) { return 0; }
+static inline int ioremap_pmd_enabled(void) { return 0; }
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 		unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
 {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 1/6] mm: Change __get_vm_area_node() to use fls_long() Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 2/6] lib: Add huge I/O map capability interfaces Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-04 22:09   ` Ingo Molnar
  2015-03-03 17:44 ` [PATCH v3 4/6] mm: Change vunmap to tear down huge KVA mappings Toshi Kani
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

ioremap_pud_range() and ioremap_pmd_range() are changed to create
huge I/O mappings when their capability is enabled, and a request
meets required conditions -- both virtual & physical addresses
are aligned by their huge page size, and a requested range fufills
their huge page size.  When pud_set_huge() or pmd_set_huge() returns
zero, i.e. no-operation is performed, the code simply falls back to
the next level.

The changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is
defined on the architecture.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 include/asm-generic/pgtable.h |   15 +++++++++++++++
 lib/ioremap.c                 |   16 ++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 4d46085..bf6e86c 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -6,6 +6,7 @@
 
 #include <linux/mm_types.h>
 #include <linux/bug.h>
+#include <linux/errno.h>
 
 /*
  * On almost all architectures and configurations, 0 can be used as the
@@ -697,4 +698,18 @@ static inline int pmd_protnone(pmd_t pmd)
 #define io_remap_pfn_range remap_pfn_range
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
+int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
+#else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 0ce18aa..3055ada 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -81,6 +81,14 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
+
+		if (ioremap_pmd_enabled() &&
+		    ((next - addr) == PMD_SIZE) &&
+		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
+			if (pmd_set_huge(pmd, phys_addr + addr, prot))
+				continue;
+		}
+
 		if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
@@ -99,6 +107,14 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
+
+		if (ioremap_pud_enabled() &&
+		    ((next - addr) == PUD_SIZE) &&
+		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
+			if (pud_set_huge(pud, phys_addr + addr, prot))
+				continue;
+		}
+
 		if (ioremap_pmd_range(pud, addr, next, phys_addr + addr, prot))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 4/6] mm: Change vunmap to tear down huge KVA mappings
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
                   ` (2 preceding siblings ...)
  2015-03-03 17:44 ` [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 5/6] x86, mm: Support huge I/O mapping capability I/F Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86 Toshi Kani
  5 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

Changed vunmap_pmd_range() and vunmap_pud_range() to tear down
huge KVA mappings when they are set.  pud_clear_huge() and
pmd_clear_huge() return zero when no-operation is performed,
i.e. huge page mapping was not used.

These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP
is defined on the architecture.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 include/asm-generic/pgtable.h |    4 ++++
 mm/vmalloc.c                  |    4 ++++
 2 files changed, 8 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bf6e86c..b583235 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -701,6 +701,8 @@ static inline int pmd_protnone(pmd_t pmd)
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
 int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
+int pud_clear_huge(pud_t *pud);
+int pmd_clear_huge(pmd_t *pmd);
 #else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
 static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
@@ -710,6 +712,8 @@ static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
 {
 	return 0;
 }
+static inline int pud_clear_huge(pud_t *pud) { return 0; }
+static inline int pmd_clear_huge(pmd_t *pmd) { return 0; }
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fe1672d..9184cf7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -75,6 +75,8 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end)
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_clear_huge(pmd))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next);
@@ -89,6 +91,8 @@ static void vunmap_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end)
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		if (pud_clear_huge(pud))
+			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		vunmap_pmd_range(pud, addr, next);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 5/6] x86, mm: Support huge I/O mapping capability I/F
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
                   ` (3 preceding siblings ...)
  2015-03-03 17:44 ` [PATCH v3 4/6] mm: Change vunmap to tear down huge KVA mappings Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-03 17:44 ` [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86 Toshi Kani
  5 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

This patch implements huge I/O mapping capability interfaces
for ioremap() on x86.

IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and
PMD_SHIFT on x86/32, which overrides the default value
defined in <linux/vmalloc.h>.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/include/asm/page_types.h |    2 ++
 arch/x86/mm/ioremap.c             |   23 +++++++++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 95e11f7..b526093 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -40,8 +40,10 @@
 
 #ifdef CONFIG_X86_64
 #include <asm/page_64_types.h>
+#define IOREMAP_MAX_ORDER       (PUD_SHIFT)
 #else
 #include <asm/page_32_types.h>
+#define IOREMAP_MAX_ORDER       (PMD_SHIFT)
 #endif	/* CONFIG_X86_64 */
 
 #ifndef __ASSEMBLY__
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index fdf617c..5ead4d6 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -67,8 +67,13 @@ static int __ioremap_check_ram(unsigned long start_pfn, unsigned long nr_pages,
 
 /*
  * Remap an arbitrary physical address space into the kernel virtual
- * address space. Needed when the kernel wants to access high addresses
- * directly.
+ * address space. It transparently creates kernel huge I/O mapping when
+ * the physical address is aligned by a huge page size (1GB or 2MB) and
+ * the requested size is at least the huge page size.
+ *
+ * NOTE: MTRRs can override PAT memory types with a 4KB granularity.
+ * Therefore, the mapping code falls back to use a smaller page toward 4KB
+ * when a mapping range is covered by non-WB type of MTRRs.
  *
  * NOTE! We need to allow non-page-aligned mappings too: we will obviously
  * have to convert them into an offset in a page-aligned mapping, but the
@@ -326,6 +331,20 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
+int arch_ioremap_pud_supported(void)
+{
+#ifdef CONFIG_X86_64
+	return cpu_has_gbpages;
+#else
+	return 0;
+#endif
+}
+
+int arch_ioremap_pmd_supported(void)
+{
+	return cpu_has_pse;
+}
+
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
                   ` (4 preceding siblings ...)
  2015-03-03 17:44 ` [PATCH v3 5/6] x86, mm: Support huge I/O mapping capability I/F Toshi Kani
@ 2015-03-03 17:44 ` Toshi Kani
  2015-03-03 22:44   ` Andrew Morton
  5 siblings, 1 reply; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 17:44 UTC (permalink / raw)
  To: akpm, hpa, tglx, mingo, arnd
  Cc: linux-mm, x86, linux-kernel, dave.hansen, Elliott, Toshi Kani

This patch implements huge KVA mapping interfaces on x86.

On x86, MTRRs can override PAT memory types with a 4KB granularity.
When using a huge page, MTRRs can override the memory type of the
huge page, which may lead a performance penalty.  The processor
can also behave in an undefined manner if a huge page is mapped to
a memory range that MTRRs have mapped with multiple different memory
types.  Therefore, the mapping code falls back to use a smaller page
size toward 4KB when a mapping range is covered by non-WB type of
MTRRs.  The WB type of MTRRs has no affect on the PAT memory types.

pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see
if a given range is covered by MTRRs.  MTRR_TYPE_WRBACK indicates
that the range is either covered by WB or not covered and the MTRR
default value is set to WB.  0xFF indicates that MTRRs are disabled.

HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE
is set.  X86_32 without X86_PAE is not supported since such config
can unlikey be benefited from this feature, and there was an issue
found in testing.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
---
 arch/x86/Kconfig      |    1 +
 arch/x86/mm/pgtable.c |   65 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c2fb8a8..ef7d4a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -99,6 +99,7 @@ config X86
 	select IRQ_FORCED_THREADING
 	select HAVE_BPF_JIT if X86_64
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	select HAVE_ARCH_HUGE_VMAP if X86_64 || (X86_32 && X86_PAE)
 	select ARCH_HAS_SG_CHAIN
 	select CLKEVT_I8253
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7b22ada..19c897e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -4,6 +4,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlb.h>
 #include <asm/fixmap.h>
+#include <asm/mtrr.h>
 
 #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
 
@@ -485,3 +486,67 @@ void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys,
 {
 	__native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));
 }
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
+{
+	u8 mtrr;
+
+	/*
+	 * Do not use a huge page when the range is covered by non-WB type
+	 * of MTRRs.
+	 */
+	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
+	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
+		return 0;
+
+	prot = pgprot_4k_2_large(prot);
+
+	set_pte((pte_t *)pud, pfn_pte(
+		(u64)addr >> PAGE_SHIFT,
+		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
+
+	return 1;
+}
+
+int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
+{
+	u8 mtrr;
+
+	/*
+	 * Do not use a huge page when the range is covered by non-WB type
+	 * of MTRRs.
+	 */
+	mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE);
+	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
+		return 0;
+
+	prot = pgprot_4k_2_large(prot);
+
+	set_pte((pte_t *)pmd, pfn_pte(
+		(u64)addr >> PAGE_SHIFT,
+		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
+
+	return 1;
+}
+
+int pud_clear_huge(pud_t *pud)
+{
+	if (pud_large(*pud)) {
+		pud_clear(pud);
+		return 1;
+	}
+
+	return 0;
+}
+
+int pmd_clear_huge(pmd_t *pmd)
+{
+	if (pmd_large(*pmd)) {
+		pmd_clear(pmd);
+		return 1;
+	}
+
+	return 0;
+}
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-03 17:44 ` [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86 Toshi Kani
@ 2015-03-03 22:44   ` Andrew Morton
  2015-03-03 23:14     ` Toshi Kani
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2015-03-03 22:44 UTC (permalink / raw)
  To: Toshi Kani
  Cc: hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel, dave.hansen,
	Elliott

On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:

> This patch implements huge KVA mapping interfaces on x86.
> 
> On x86, MTRRs can override PAT memory types with a 4KB granularity.
> When using a huge page, MTRRs can override the memory type of the
> huge page, which may lead a performance penalty.  The processor
> can also behave in an undefined manner if a huge page is mapped to
> a memory range that MTRRs have mapped with multiple different memory
> types.  Therefore, the mapping code falls back to use a smaller page
> size toward 4KB when a mapping range is covered by non-WB type of
> MTRRs.  The WB type of MTRRs has no affect on the PAT memory types.
> 
> pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see
> if a given range is covered by MTRRs.  MTRR_TYPE_WRBACK indicates
> that the range is either covered by WB or not covered and the MTRR
> default value is set to WB.  0xFF indicates that MTRRs are disabled.
> 
> HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE
> is set.  X86_32 without X86_PAE is not supported since such config
> can unlikey be benefited from this feature, and there was an issue
> found in testing.
> 
> ...
>
> +
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> +{
> +	u8 mtrr;
> +
> +	/*
> +	 * Do not use a huge page when the range is covered by non-WB type
> +	 * of MTRRs.
> +	 */
> +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> +		return 0;

It would be good to notify the operator in some way when this happens. 
Otherwise the kernel will run more slowly and there's no way of knowing
why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?

> +	prot = pgprot_4k_2_large(prot);
> +
> +	set_pte((pte_t *)pud, pfn_pte(
> +		(u64)addr >> PAGE_SHIFT,
> +		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
> +
> +	return 1;
> +}
> +
> +int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
> +{
> +	u8 mtrr;
> +
> +	/*
> +	 * Do not use a huge page when the range is covered by non-WB type
> +	 * of MTRRs.
> +	 */
> +	mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE);
> +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> +		return 0;
> +
> +	prot = pgprot_4k_2_large(prot);
> +
> +	set_pte((pte_t *)pmd, pfn_pte(
> +		(u64)addr >> PAGE_SHIFT,
> +		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
> +
> +	return 1;
> +}
>
> +int pud_clear_huge(pud_t *pud)
> +{
> +	if (pud_large(*pud)) {
> +		pud_clear(pud);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +int pmd_clear_huge(pmd_t *pmd)
> +{
> +	if (pmd_large(*pmd)) {
> +		pmd_clear(pmd);
> +		return 1;
> +	}
> +
> +	return 0;
> +}

I didn't see anywhere where the return values of these functions are
documented.  It's all fairly obvious, but we could help the rearers
a bit.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-03 22:44   ` Andrew Morton
@ 2015-03-03 23:14     ` Toshi Kani
  2015-03-04  1:00       ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Toshi Kani @ 2015-03-03 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel, dave.hansen,
	Elliott

On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
 :
> > +
> > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > +{
> > +	u8 mtrr;
> > +
> > +	/*
> > +	 * Do not use a huge page when the range is covered by non-WB type
> > +	 * of MTRRs.
> > +	 */
> > +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > +		return 0;
> 
> It would be good to notify the operator in some way when this happens. 
> Otherwise the kernel will run more slowly and there's no way of knowing
> why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?

We only use 4KB mappings today, so this case will not make it run
slowly, i.e. it will be the same as today.  Also, adding a message here
can generate a lot of messages when MTRRs cover a large area.  So, I
think we are fine without a message.

> 
> > +	prot = pgprot_4k_2_large(prot);
> > +
> > +	set_pte((pte_t *)pud, pfn_pte(
> > +		(u64)addr >> PAGE_SHIFT,
> > +		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
> > +
> > +	return 1;
> > +}
> > +
> > +int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
> > +{
> > +	u8 mtrr;
> > +
> > +	/*
> > +	 * Do not use a huge page when the range is covered by non-WB type
> > +	 * of MTRRs.
> > +	 */
> > +	mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE);
> > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > +		return 0;
> > +
> > +	prot = pgprot_4k_2_large(prot);
> > +
> > +	set_pte((pte_t *)pmd, pfn_pte(
> > +		(u64)addr >> PAGE_SHIFT,
> > +		__pgprot(pgprot_val(prot) | _PAGE_PSE)));
> > +
> > +	return 1;
> > +}
> >
> > +int pud_clear_huge(pud_t *pud)
> > +{
> > +	if (pud_large(*pud)) {
> > +		pud_clear(pud);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int pmd_clear_huge(pmd_t *pmd)
> > +{
> > +	if (pmd_large(*pmd)) {
> > +		pmd_clear(pmd);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> I didn't see anywhere where the return values of these functions are
> documented.  It's all fairly obvious, but we could help the rearers
> a bit.

Agreed.  I will add function headers with descriptions to the new
functions.

Thanks,
-Toshi 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-03 23:14     ` Toshi Kani
@ 2015-03-04  1:00       ` Andrew Morton
  2015-03-04 16:23         ` Toshi Kani
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2015-03-04  1:00 UTC (permalink / raw)
  To: Toshi Kani
  Cc: hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel, dave.hansen,
	Elliott

On Tue, 03 Mar 2015 16:14:32 -0700 Toshi Kani <toshi.kani@hp.com> wrote:

> On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> > On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
>  :
> > > +
> > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > > +{
> > > +	u8 mtrr;
> > > +
> > > +	/*
> > > +	 * Do not use a huge page when the range is covered by non-WB type
> > > +	 * of MTRRs.
> > > +	 */
> > > +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > > +		return 0;
> > 
> > It would be good to notify the operator in some way when this happens. 
> > Otherwise the kernel will run more slowly and there's no way of knowing
> > why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?
> 
> We only use 4KB mappings today, so this case will not make it run
> slowly, i.e. it will be the same as today.

Yes, but it would be slower than it would be if the operator fixed the
mtrr settings!  How do we let the operator know this?

>  Also, adding a message here
> can generate a lot of messages when MTRRs cover a large area.

Really?  This is only going to happen when a device driver requests a
huge io mapping, isn't it?  That's rare.  We could emit a warning,
return an error code and fall all the way back to the top-level ioremap
code which can then retry with 4k mappings.  Or something similar -
somehow record the fact that this warning has been emitted or use
printk ratelimiting (bad option).


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-04  1:00       ` Andrew Morton
@ 2015-03-04 16:23         ` Toshi Kani
  2015-03-04 20:17           ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Toshi Kani @ 2015-03-04 16:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel, dave.hansen,
	Elliott, Robert (Server Storage)

On Wed, 2015-03-04 at 01:00 +0000, Andrew Morton wrote:
> On Tue, 03 Mar 2015 16:14:32 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> 
> > On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> > > On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> >  :
> > > > +
> > > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > > > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > > > +{
> > > > +	u8 mtrr;
> > > > +
> > > > +	/*
> > > > +	 * Do not use a huge page when the range is covered by non-WB type
> > > > +	 * of MTRRs.
> > > > +	 */
> > > > +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > > > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > > > +		return 0;
> > > 
> > > It would be good to notify the operator in some way when this happens. 
> > > Otherwise the kernel will run more slowly and there's no way of knowing
> > > why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?
> > 
> > We only use 4KB mappings today, so this case will not make it run
> > slowly, i.e. it will be the same as today.
> 
> Yes, but it would be slower than it would be if the operator fixed the
> mtrr settings!  How do we let the operator know this?
> 
> >  Also, adding a message here
> > can generate a lot of messages when MTRRs cover a large area.
> 
> Really?  This is only going to happen when a device driver requests a
> huge io mapping, isn't it?  That's rare.  We could emit a warning,
> return an error code and fall all the way back to the top-level ioremap
> code which can then retry with 4k mappings.  Or something similar -
> somehow record the fact that this warning has been emitted or use
> printk ratelimiting (bad option).

Yes, an IO device with a huge MMIO space that is covered by MTRRs is a
rare case.  BIOS does not need to specify how MMIO of each card needs to
be accessed with MTRRs (or BIOS should not do it since an MMIO address
is configurable on each card).

However, PCIe has the MMCONFIG space, PCIe config space, which is also
memory mapped and must be accessed with UC.  The PCI subsystem calls
ioremap_nocache() to map the entire MMCONFIG space, which covers the
PCIe config space of all possible cards.  Here are boot messages on my
test system.

  :
PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xc0000000-0xcf
ffffff] (base 0xc0000000)
PCI: MMCONFIG at [mem 0xc0000000-0xcfffffff] reserved in E820
  :

And MTRRs cover this MMCONFIG space with UC to assure that the range is
always accessed with UC.

# cat /proc/mtrr
reg00: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable

So, if we add a message into the code, it will be displayed many times
in this ioremap_nocache() call from PCI.

Ideally, pud_set_huge() and pmd_set_huge() should allow using a huge
page mapping when the entire map range is covered by a single MTRR
entry, which is the case with MMCONFIG.  But I did not include such
handling into the patch because UC map is slow by itself, MMCONFIG is
only accessed at boot-time, and mtrr_type_lookup() does not provide the
level of info necessary.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-04 16:23         ` Toshi Kani
@ 2015-03-04 20:17           ` Ingo Molnar
  2015-03-04 21:16             ` Toshi Kani
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2015-03-04 20:17 UTC (permalink / raw)
  To: Toshi Kani
  Cc: Andrew Morton, hpa, tglx, mingo, arnd, linux-mm, x86,
	linux-kernel, dave.hansen, Elliott, Robert (Server Storage)


* Toshi Kani <toshi.kani@hp.com> wrote:

> On Wed, 2015-03-04 at 01:00 +0000, Andrew Morton wrote:
> > On Tue, 03 Mar 2015 16:14:32 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> > 
> > > On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> > > > On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> > >  :
> > > > > +
> > > > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > > > > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > > > > +{
> > > > > +	u8 mtrr;
> > > > > +
> > > > > +	/*
> > > > > +	 * Do not use a huge page when the range is covered by non-WB type
> > > > > +	 * of MTRRs.
> > > > > +	 */
> > > > > +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > > > > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > > > > +		return 0;
> > > > 
> > > > It would be good to notify the operator in some way when this happens. 
> > > > Otherwise the kernel will run more slowly and there's no way of knowing
> > > > why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?
> > > 
> > > We only use 4KB mappings today, so this case will not make it run
> > > slowly, i.e. it will be the same as today.
> > 
> > Yes, but it would be slower than it would be if the operator fixed the
> > mtrr settings!  How do we let the operator know this?
> > 
> > >  Also, adding a message here
> > > can generate a lot of messages when MTRRs cover a large area.
> > 
> > Really?  This is only going to happen when a device driver 
> > requests a huge io mapping, isn't it?  That's rare.  We could emit 
> > a warning, return an error code and fall all the way back to the 
> > top-level ioremap code which can then retry with 4k mappings.  Or 
> > something similar - somehow record the fact that this warning has 
> > been emitted or use printk ratelimiting (bad option).
> 
> Yes, an IO device with a huge MMIO space that is covered by MTRRs is 
> a rare case.  BIOS does not need to specify how MMIO of each card 
> needs to be accessed with MTRRs (or BIOS should not do it since an 
> MMIO address is configurable on each card).
> 
> However, PCIe has the MMCONFIG space, PCIe config space, which is 
> also memory mapped and must be accessed with UC.  The PCI subsystem 
> calls ioremap_nocache() to map the entire MMCONFIG space, which 
> covers the PCIe config space of all possible cards.  Here are boot 
> messages on my test system.
> 
>   :
> PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xc0000000-0xcf
> ffffff] (base 0xc0000000)
> PCI: MMCONFIG at [mem 0xc0000000-0xcfffffff] reserved in E820
>   :
> 
> And MTRRs cover this MMCONFIG space with UC to assure that the range is
> always accessed with UC.

So the PCI code ioremap()s this 256 MB mmconfig space in its entirety 
currently?

> 
> # cat /proc/mtrr
> reg00: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable
> 
> So, if we add a message into the code, it will be displayed many 
> times in this ioremap_nocache() call from PCI.

So, in this specific case, when a single MTRR covers it with a single 
cache policy, I think we can safely map it UC using hugepmds?

That will 'shut up' the warning the right way: by making the code 
work?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86
  2015-03-04 20:17           ` Ingo Molnar
@ 2015-03-04 21:16             ` Toshi Kani
  0 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-04 21:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, hpa, tglx, mingo, arnd, linux-mm, x86,
	linux-kernel, dave.hansen, Elliott, Robert (Server Storage)

On Wed, 2015-03-04 at 21:17 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hp.com> wrote:
> 
> > On Wed, 2015-03-04 at 01:00 +0000, Andrew Morton wrote:
> > > On Tue, 03 Mar 2015 16:14:32 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> > > 
> > > > On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> > > > > On Tue,  3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@hp.com> wrote:
> > > >  :
> > > > > > +
> > > > > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > > > > > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > > > > > +{
> > > > > > +	u8 mtrr;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Do not use a huge page when the range is covered by non-WB type
> > > > > > +	 * of MTRRs.
> > > > > > +	 */
> > > > > > +	mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > > > > > +	if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > > > > > +		return 0;
> > > > > 
> > > > > It would be good to notify the operator in some way when this happens. 
> > > > > Otherwise the kernel will run more slowly and there's no way of knowing
> > > > > why.  I guess slap a pr_info() in there.  Or maybe pr_warn()?
> > > > 
> > > > We only use 4KB mappings today, so this case will not make it run
> > > > slowly, i.e. it will be the same as today.
> > > 
> > > Yes, but it would be slower than it would be if the operator fixed the
> > > mtrr settings!  How do we let the operator know this?
> > > 
> > > >  Also, adding a message here
> > > > can generate a lot of messages when MTRRs cover a large area.
> > > 
> > > Really?  This is only going to happen when a device driver 
> > > requests a huge io mapping, isn't it?  That's rare.  We could emit 
> > > a warning, return an error code and fall all the way back to the 
> > > top-level ioremap code which can then retry with 4k mappings.  Or 
> > > something similar - somehow record the fact that this warning has 
> > > been emitted or use printk ratelimiting (bad option).
> > 
> > Yes, an IO device with a huge MMIO space that is covered by MTRRs is 
> > a rare case.  BIOS does not need to specify how MMIO of each card 
> > needs to be accessed with MTRRs (or BIOS should not do it since an 
> > MMIO address is configurable on each card).
> > 
> > However, PCIe has the MMCONFIG space, PCIe config space, which is 
> > also memory mapped and must be accessed with UC.  The PCI subsystem 
> > calls ioremap_nocache() to map the entire MMCONFIG space, which 
> > covers the PCIe config space of all possible cards.  Here are boot 
> > messages on my test system.
> > 
> >   :
> > PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xc0000000-0xcf
> > ffffff] (base 0xc0000000)
> > PCI: MMCONFIG at [mem 0xc0000000-0xcfffffff] reserved in E820
> >   :
> > 
> > And MTRRs cover this MMCONFIG space with UC to assure that the range is
> > always accessed with UC.
> 
> So the PCI code ioremap()s this 256 MB mmconfig space in its entirety 
> currently?

Yes.

> > # cat /proc/mtrr
> > reg00: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable
> > 
> > So, if we add a message into the code, it will be displayed many 
> > times in this ioremap_nocache() call from PCI.
> 
> So, in this specific case, when a single MTRR covers it with a single 
> cache policy, I think we can safely map it UC using hugepmds?

Yes.

> That will 'shut up' the warning the right way: by making the code 
> work?

I see your point.  I will look into mtrr_type_lookup() to see if we can
make it work in a manageable way.

Thanks,
-Toshi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings
  2015-03-03 17:44 ` [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings Toshi Kani
@ 2015-03-04 22:09   ` Ingo Molnar
  2015-03-04 23:15     ` Toshi Kani
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2015-03-04 22:09 UTC (permalink / raw)
  To: Toshi Kani
  Cc: akpm, hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel,
	dave.hansen, Elliott


* Toshi Kani <toshi.kani@hp.com> wrote:

> ioremap_pud_range() and ioremap_pmd_range() are changed to create 
> huge I/O mappings when their capability is enabled, and a request 
> meets required conditions -- both virtual & physical addresses are 
> aligned by their huge page size, and a requested range fufills their 
> huge page size.  When pud_set_huge() or pmd_set_huge() returns zero, 
> i.e. no-operation is performed, the code simply falls back to the 
> next level.
> 
> The changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is
> defined on the architecture.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> ---
>  include/asm-generic/pgtable.h |   15 +++++++++++++++
>  lib/ioremap.c                 |   16 ++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 4d46085..bf6e86c 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -6,6 +6,7 @@
>  
>  #include <linux/mm_types.h>
>  #include <linux/bug.h>
> +#include <linux/errno.h>
>  
>  /*
>   * On almost all architectures and configurations, 0 can be used as the
> @@ -697,4 +698,18 @@ static inline int pmd_protnone(pmd_t pmd)
>  #define io_remap_pfn_range remap_pfn_range
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
> +int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
> +#else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> +{
> +	return 0;
> +}
> +static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
> +{
> +	return 0;
> +}
> +#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
> +
>  #endif /* _ASM_GENERIC_PGTABLE_H */
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index 0ce18aa..3055ada 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -81,6 +81,14 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pmd_addr_end(addr, end);
> +
> +		if (ioremap_pmd_enabled() &&
> +		    ((next - addr) == PMD_SIZE) &&
> +		    IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> +			if (pmd_set_huge(pmd, phys_addr + addr, prot))
> +				continue;
> +		}
> +
>  		if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot))
>  			return -ENOMEM;
>  	} while (pmd++, addr = next, addr != end);
> @@ -99,6 +107,14 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
>  		return -ENOMEM;
>  	do {
>  		next = pud_addr_end(addr, end);
> +
> +		if (ioremap_pud_enabled() &&
> +		    ((next - addr) == PUD_SIZE) &&
> +		    IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
> +			if (pud_set_huge(pud, phys_addr + addr, prot))
> +				continue;
> +		}
> +
>  		if (ioremap_pmd_range(pud, addr, next, phys_addr + addr, prot))
>  			return -ENOMEM;
>  	} while (pud++, addr = next, addr != end);

Hm, so I don't see where you set the proper x86 PAT table attributes 
for the pmds.

MTRR's are basically a legacy mechanism, the proper way to set cache 
attribute is PAT and I don't see where this generic code does that, 
but I might be missing something?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings
  2015-03-04 22:09   ` Ingo Molnar
@ 2015-03-04 23:15     ` Toshi Kani
  0 siblings, 0 replies; 15+ messages in thread
From: Toshi Kani @ 2015-03-04 23:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, hpa, tglx, mingo, arnd, linux-mm, x86, linux-kernel,
	dave.hansen, Elliott

On Wed, 2015-03-04 at 23:09 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hp.com> wrote:
 :
> Hm, so I don't see where you set the proper x86 PAT table attributes 
> for the pmds.
> 
> MTRR's are basically a legacy mechanism, the proper way to set cache 
> attribute is PAT and I don't see where this generic code does that, 
> but I might be missing something?

It's done by x86 code, not by this generic code.  __ioremap_caller()
takes page_cache_mode and converts it to pgprot_t using the PAT table
attribute.  It then calls this generic func, ioremap_page_range().  When
creating a huge page mapping, pud_set_huge() and pmd_set_huge() handle
the relocation of the PAT bit.

Thanks,
-Toshi    


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-03-04 23:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03 17:44 [PATCH v3 0/6] Kernel huge I/O mapping support Toshi Kani
2015-03-03 17:44 ` [PATCH v3 1/6] mm: Change __get_vm_area_node() to use fls_long() Toshi Kani
2015-03-03 17:44 ` [PATCH v3 2/6] lib: Add huge I/O map capability interfaces Toshi Kani
2015-03-03 17:44 ` [PATCH v3 3/6] mm: Change ioremap to set up huge I/O mappings Toshi Kani
2015-03-04 22:09   ` Ingo Molnar
2015-03-04 23:15     ` Toshi Kani
2015-03-03 17:44 ` [PATCH v3 4/6] mm: Change vunmap to tear down huge KVA mappings Toshi Kani
2015-03-03 17:44 ` [PATCH v3 5/6] x86, mm: Support huge I/O mapping capability I/F Toshi Kani
2015-03-03 17:44 ` [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86 Toshi Kani
2015-03-03 22:44   ` Andrew Morton
2015-03-03 23:14     ` Toshi Kani
2015-03-04  1:00       ` Andrew Morton
2015-03-04 16:23         ` Toshi Kani
2015-03-04 20:17           ` Ingo Molnar
2015-03-04 21:16             ` Toshi Kani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).