LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [QUICKLIST 1/5] Quicklists for page table pages V4
@ 2007-03-23  6:28 Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 2/5] Quicklist support for IA64 Christoph Lameter
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Quicklists for page table pages V4

V3->V4
- Rename quicklist_check to quicklist_trim and allow parameters
  to specify how to clean quicklists.
- Remove dead code

V2->V3
- Fix Kconfig issues by setting CONFIG_QUICKLIST explicitly
  and default to one quicklist if NR_QUICK is not set.
- Fix i386 support. (Cannot mix PMD and PTE allocs.)
- Discussion of V2.
  http://marc.info/?l=linux-kernel&m=117391339914767&w=2

V1->V2
- Add sparch64 patch
- Single i386 and x86_64 patch
- Update attribution
- Update justification
- Update approvals
- Earlier discussion of V1 was at
  http://marc.info/?l=linux-kernel&m=117357922219342&w=2

This patchset introduces an arch independent framework to handle lists
of recently used page table pages to replace the existing (ab)use of the
slab for that purpose.

1. Proven code from the IA64 arch.

	The method used here has been fine tuned for years and
	is NUMA aware. It is based on the knowledge that accesses
	to page table pages are sparse in nature. Taking a page
	off the freelists instead of allocating a zeroed pages
	allows a reduction of number of cachelines touched
	in addition to getting rid of the slab overhead. So
	performance improves. This is particularly useful if pgds
	contain standard mappings. We can save on the teardown
	and setup of such a page if we have some on the quicklists.
	This includes avoiding lists operations that are otherwise
	necessary on alloc and free to track pgds.

2. Light weight alternative to use slab to manage page size pages

	Slab overhead is significant and even page allocator use
	is pretty heavy weight. The use of a per cpu quicklist
	means that we touch only two cachelines for an allocation.
	There is no need to access the page_struct (unless arch code
	needs to fiddle around with it). So the fast past just
	means bringing in one cacheline at the beginning of the
	page. That same cacheline may then be used to store the
	page table entry. Or a second cacheline may be used
	if the page table entry is not in the first cacheline of
	the page. The current code will zero the page which means
	touching 32 cachelines (assuming 128 byte). We get down
	from 32 to 2 cachelines in the fast path.

3. Fix conflicting use of page_structs by slab and arch code.

   	F.e. Both arches use the ->private and ->index field to
	create lists of pgds and i386 also uses other page flags. The slab
	can also use the ->private field for allocations that
	are larger than page size which would occur if one enables
	debugging. In that case the arch code would overwrite the
	pointer to the first page of the compound page allocated
	by the slab. SLAB has been modified to not enable
	debugging for such slabs (!).

	There the potential for additional conflicts
	here especially since some arches also use page flags to mark
	page table pages.

	The patch removes these conflicts by no longer using
	the slab for these purposes. The page allocator is more
	suitable since PAGE_SIZE chunks are its domain.
	Then we can start using standard list operations via
	page->lru instead of improvising linked lists.

	SLUB makes more extensive use of the page struct and so
	far had to create workarounds for these slabs. The ->index
	field is used for the SLUB freelist. So SLUB cannot allow
	the use of a freelist for these slabs and--like slab--
	currently does not allow debugging and forces slabs to
	only contain a single object (avoids freelist).

	If we do not get rid of these issues then both SLAB and SLUB
	have to continue to provide special code paths to support these
	slabs.

4. i386 gets lightweight NUMA aware management of page table pages.

	Note that the use of SLAB on NUMA systems will require the
	use of alien caches to efficiently remove remote page
	table pages. Which (for a PAGE_SIZEd allocation) is a lengthy
	and expensive process. With quicklists no alien caches are
	needed. Pages can be simply returned to the correct node.

5. x86_64 gets lightweight page table page management.

	This will allow x86_64 arch code to faster repopulate pgds
	and other page table entries. The list operations for pgds
	are reduced in the same way as for i386 to the point where
	a pgd is allocated from the page allocator and when it is
	freed back to the page allocator. A pgd can pass through
	the quicklists without having to be reinitialized.

6. Consolidation of code from multiple arches

	So far arches have their own implementation of quicklist
	management. This patch moves that feature into the core allowing
	an easier maintenance and consistent management of quicklists.

Page table pages have the characteristics that they are typically zero
or in a known state when they are freed. This is usually the exactly
same state as needed after allocation. So it makes sense to build a list
of freed page table pages and then consume the pages already in use
first. Those pages have already been initialized correctly (thus no
need to zero them) and are likely already cached in such a way that
the MMU can use them most effectively. Page table pages are used in
a sparse way so zeroing them on allocation is not too useful.

Such an implementation already exits for ia64. Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64.
It also only supported a single quicklist. The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.

Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more
than one quicklist is necessary then we can define NR_QUICK for additional
lists. F.e. i386 needs two and thus has

config NR_QUICK
	int
	default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:


quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)


Page table pages can be freed using:


quicklist_free(<quicklist-nr>, <destructor>, <page>)


Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.

Tested on:
i386 UP / SMP, x86_64 UP, NUMA emulation, IA64 NUMA.

Index: linux-2.6.21-rc4-mm1/include/linux/quicklist.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc4-mm1/include/linux/quicklist.h	2007-03-20 15:03:05.000000000 -0700
@@ -0,0 +1,84 @@
+#ifndef LINUX_QUICKLIST_H
+#define LINUX_QUICKLIST_H
+/*
+ * Fast allocations and disposal of pages. Pages must be in the condition
+ * as needed after allocation when they are freed. Per cpu lists of pages
+ * are kept that only contain node local pages.
+ *
+ * (C) 2007, SGI. Christoph Lameter <clameter@sgi.com>
+ */
+#include <linux/kernel.h>
+#include <linux/gfp.h>
+#include <linux/percpu.h>
+
+#ifdef CONFIG_QUICKLIST
+
+struct quicklist {
+	void *page;
+	int nr_pages;
+};
+
+DECLARE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
+
+/*
+ * The two key functions quicklist_alloc and quicklist_free are inline so
+ * that they may be custom compiled for the platform.
+ * Specifying a NULL ctor can remove constructor support. Specifying
+ * a constant quicklist allows the determination of the exact address
+ * in the per cpu area.
+ *
+ * The fast patch in quicklist_alloc touched only a per cpu cacheline and
+ * the first cacheline of the page itself. There is minmal overhead involved.
+ */
+static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *))
+{
+	struct quicklist *q;
+	void **p = NULL;
+
+	q =&get_cpu_var(quicklist)[nr];
+	p = q->page;
+	if (likely(p)) {
+		q->page = p[0];
+		p[0] = NULL;
+		q->nr_pages--;
+	}
+	put_cpu_var(quicklist);
+	if (likely(p))
+		return p;
+
+	p = (void *)__get_free_page(flags | __GFP_ZERO);
+	if (ctor && p)
+		ctor(p);
+	return p;
+}
+
+static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
+{
+	struct quicklist *q;
+	void **p = pp;
+	struct page *page = virt_to_page(p);
+	int nid = page_to_nid(page);
+
+	if (unlikely(nid != numa_node_id())) {
+		if (dtor)
+			dtor(p);
+		free_page((unsigned long)p);
+		return;
+	}
+
+	q = &get_cpu_var(quicklist)[nr];
+	p[0] = q->page;
+	q->page = p;
+	q->nr_pages++;
+	put_cpu_var(quicklist);
+}
+
+void quicklist_trim(int nr, void (*dtor)(void *),
+	unsigned long min_pages, unsigned long max_free);
+
+unsigned long quicklist_total_size(void);
+
+#endif
+
+#endif /* LINUX_QUICKLIST_H */
+
Index: linux-2.6.21-rc4-mm1/mm/Makefile
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/Makefile	2007-03-20 15:02:58.000000000 -0700
+++ linux-2.6.21-rc4-mm1/mm/Makefile	2007-03-20 15:59:50.000000000 -0700
@@ -30,3 +30,5 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
+obj-$(CONFIG_QUICKLIST) += quicklist.o
+
Index: linux-2.6.21-rc4-mm1/mm/quicklist.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.21-rc4-mm1/mm/quicklist.c	2007-03-20 15:03:05.000000000 -0700
@@ -0,0 +1,88 @@
+/*
+ * Quicklist support.
+ *
+ * Quicklists are light weight lists of pages that have a defined state
+ * on alloc and free. Pages must be in the quicklist specific defined state
+ * (zero by default) when the page is freed. It seems that the initial idea
+ * for such lists first came from Dave Miller and then various other people
+ * improved on it.
+ *
+ * Copyright (C) 2007 SGI,
+ * 	Christoph Lameter <clameter@sgi.com>
+ * 		Generalized, added support for multiple lists and
+ * 		constructors / destructors.
+ */
+#include <linux/kernel.h>
+
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/module.h>
+#include <linux/quicklist.h>
+
+DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK];
+
+#define FRACTION_OF_NODE_MEM	16
+
+static unsigned long max_pages(unsigned long min_pages)
+{
+	unsigned long node_free_pages, max;
+
+	node_free_pages = node_page_state(numa_node_id(),
+			NR_FREE_PAGES);
+	max = node_free_pages / FRACTION_OF_NODE_MEM;
+	return max(max, min_pages);
+}
+
+static long min_pages_to_free(struct quicklist *q,
+	unsigned long min_pages, long max_free)
+{
+	long pages_to_free;
+
+	pages_to_free = q->nr_pages - max_pages(min_pages);
+
+	return min(pages_to_free, max_free);
+}
+
+/*
+ * Trim down the number of pages in the quicklist
+ */
+void quicklist_trim(int nr, void (*dtor)(void *),
+	unsigned long min_pages, unsigned long max_free)
+{
+	long pages_to_free;
+	struct quicklist *q;
+
+	q = &get_cpu_var(quicklist)[nr];
+	if (q->nr_pages > min_pages) {
+		pages_to_free = min_pages_to_free(q, min_pages, max_free);
+
+		while (pages_to_free > 0) {
+			/*
+			 * We pass a gfp_t of 0 to quicklist_alloc here
+			 * because we will never call into the page allocator.
+			 */
+			void *p = quicklist_alloc(nr, 0, NULL);
+
+			if (dtor)
+				dtor(p);
+			free_page((unsigned long)p);
+			pages_to_free--;
+		}
+	}
+	put_cpu_var(quicklist);
+}
+
+unsigned long quicklist_total_size(void)
+{
+	unsigned long count = 0;
+	int cpu;
+	struct quicklist *ql, *q;
+
+	for_each_online_cpu(cpu) {
+		ql = per_cpu(quicklist, cpu);
+		for (q = ql; q < ql + CONFIG_NR_QUICK; q++)
+			count += q->nr_pages;
+	}
+	return count;
+}
+
Index: linux-2.6.21-rc4-mm1/mm/Kconfig
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/Kconfig	2007-03-20 15:03:04.000000000 -0700
+++ linux-2.6.21-rc4-mm1/mm/Kconfig	2007-03-20 16:00:22.000000000 -0700
@@ -220,3 +220,8 @@ config DEBUG_READAHEAD
 
 	  Say N for production servers.
 
+config NR_QUICK
+	int
+	depends on QUICKLIST
+	default "1"
+

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [QUICKLIST 2/5] Quicklist support for IA64
  2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
@ 2007-03-23  6:28 ` Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 3/5] Quicklist support for i386 Christoph Lameter
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Quicklist for IA64

IA64 is the origin of the quicklist implementation. So cut out the pieces
that are now in core code and modify the functions called.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc4-mm1/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/ia64/mm/init.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/ia64/mm/init.c	2007-03-20 14:21:47.000000000 -0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr;	/* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES			25UL
-#define MAX_PGT_FREES_PER_PASS		16L
-#define PGT_FRACTION_OF_NODE_MEM	16
-
-static inline long
-max_pgt_pages(void)
-{
-	u64 node_free_pages, max_pgt_pages;
-
-#ifndef	CONFIG_NUMA
-	node_free_pages = nr_free_pages();
-#else
-	node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-	max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-	max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-	return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-	long pages_to_free;
-
-	pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-	pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-	return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-	long pages_to_free;
-
-	if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES))
-		return;
-
-	preempt_disable();
-	while (unlikely((pages_to_free = min_pages_to_free()) > 0)) {
-		while (pages_to_free--) {
-			free_page((unsigned long)pgtable_quicklist_alloc());
-		}
-		preempt_enable();
-		preempt_disable();
-	}
-	preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc4-mm1/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-ia64/pgalloc.h	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-ia64/pgalloc.h	2007-03-20 14:55:47.000000000 -0700
@@ -18,71 +18,18 @@
 #include <linux/mm.h>
 #include <linux/page-flags.h>
 #include <linux/threads.h>
+#include <linux/quicklist.h>
 
 #include <asm/mmu_context.h>
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-	long ql_size = 0;
-	int cpuid;
-
-	for_each_online_cpu(cpuid) {
-		ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-	}
-	return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-	unsigned long *ret = NULL;
-
-	preempt_disable();
-
-	ret = pgtable_quicklist;
-	if (likely(ret != NULL)) {
-		pgtable_quicklist = (unsigned long *)(*ret);
-		ret[0] = 0;
-		--pgtable_quicklist_size;
-		preempt_enable();
-	} else {
-		preempt_enable();
-		ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-	}
-
-	return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-	int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-	if (unlikely(nid != numa_node_id())) {
-		free_page((unsigned long)pgtable_entry);
-		return;
-	}
-#endif
-
-	preempt_disable();
-	*(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-	pgtable_quicklist = (unsigned long *)pgtable_entry;
-	++pgtable_quicklist_size;
-	preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 {
-	pgtable_quicklist_free(pgd);
+	quicklist_free(0, NULL, pgd);
 }
 
 #ifdef CONFIG_PGTABLE_4
@@ -94,12 +41,12 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pud_free(pud_t * pud)
 {
-	pgtable_quicklist_free(pud);
+	quicklist_free(0, NULL, pud);
 }
 #define __pud_free_tlb(tlb, pud)	pud_free(pud)
 #endif /* CONFIG_PGTABLE_4 */
@@ -112,12 +59,12 @@ pud_populate(struct mm_struct *mm, pud_t
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t * pmd)
 {
-	pgtable_quicklist_free(pmd);
+	quicklist_free(0, NULL, pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
@@ -137,28 +84,31 @@ pmd_populate_kernel(struct mm_struct *mm
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long addr)
 {
-	void *pg = pgtable_quicklist_alloc();
+	void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
 	return pg ? virt_to_page(pg) : NULL;
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long addr)
 {
-	return pgtable_quicklist_alloc();
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pte_free(struct page *pte)
 {
-	pgtable_quicklist_free(page_address(pte));
+	quicklist_free(0, NULL, page_address(pte));
 }
 
 static inline void pte_free_kernel(pte_t * pte)
 {
-	pgtable_quicklist_free(pte);
+	quicklist_free(0, NULL, pte);
 }
 
-#define __pte_free_tlb(tlb, pte)	pte_free(pte)
+static inline void check_pgt_cache(void)
+{
+	quicklist_trim(0, NULL, 25, 16);
+}
 
-extern void check_pgt_cache(void);
+#define __pte_free_tlb(tlb, pte)	pte_free(pte)
 
 #endif				/* _ASM_IA64_PGALLOC_H */
Index: linux-2.6.21-rc4-mm1/arch/ia64/mm/contig.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/ia64/mm/contig.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/ia64/mm/contig.c	2007-03-20 14:21:47.000000000 -0700
@@ -88,7 +88,7 @@ void show_mem(void)
 	printk(KERN_INFO "%d pages shared\n", total_shared);
 	printk(KERN_INFO "%d pages swap cached\n", total_cached);
 	printk(KERN_INFO "Total of %ld pages in page table cache\n",
-	       pgtable_quicklist_total_size());
+	       quicklist_total_size());
 	printk(KERN_INFO "%d free buffer pages\n", nr_free_buffer_pages());
 }
 
Index: linux-2.6.21-rc4-mm1/arch/ia64/mm/discontig.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/ia64/mm/discontig.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/ia64/mm/discontig.c	2007-03-20 14:21:47.000000000 -0700
@@ -563,7 +563,7 @@ void show_mem(void)
 	printk(KERN_INFO "%d pages shared\n", total_shared);
 	printk(KERN_INFO "%d pages swap cached\n", total_cached);
 	printk(KERN_INFO "Total of %ld pages in page table cache\n",
-	       pgtable_quicklist_total_size());
+	       quicklist_total_size());
 	printk(KERN_INFO "%d free buffer pages\n", nr_free_buffer_pages());
 }
 
Index: linux-2.6.21-rc4-mm1/arch/ia64/Kconfig
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/ia64/Kconfig	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/ia64/Kconfig	2007-03-20 14:21:47.000000000 -0700
@@ -29,6 +29,10 @@ config ZONE_DMA
 	def_bool y
 	depends on !IA64_SGI_SN2
 
+config QUICKLIST
+	bool
+	default y
+
 config MMU
 	bool
 	default y

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [QUICKLIST 3/5] Quicklist support for i386
  2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 2/5] Quicklist support for IA64 Christoph Lameter
@ 2007-03-23  6:28 ` Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 4/5] Quicklist support for x86_64 Christoph Lameter
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
Getting rid of that using quicklists allows full use of the page flags
and the page->lru. So get rid of the improvised linked lists using
page->index and page->private.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc4-mm1/arch/i386/mm/init.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/init.c	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/init.c	2007-03-20 14:21:52.000000000 -0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-	if (PTRS_PER_PMD > 1) {
-		pmd_cache = kmem_cache_create("pmd",
-					PTRS_PER_PMD*sizeof(pmd_t),
-					PTRS_PER_PMD*sizeof(pmd_t),
-					0,
-					pmd_ctor,
-					NULL);
-		if (!pmd_cache)
-			panic("pgtable_cache_init(): cannot create pmd cache");
-	}
-	pgd_cache = kmem_cache_create("pgd",
-				PTRS_PER_PGD*sizeof(pgd_t),
-				PTRS_PER_PGD*sizeof(pgd_t),
-				0,
-				pgd_ctor,
-				PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-	if (!pgd_cache)
-		panic("pgtable_cache_init(): Cannot create pgd cache");
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc4-mm1/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/pgtable.c	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/pgtable.c	2007-03-20 14:55:47.000000000 -0700
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 #include <linux/module.h>
+#include <linux/quicklist.h>
 
 #include <asm/system.h>
 #include <asm/pgtable.h>
@@ -198,11 +199,6 @@ struct page *pte_alloc_one(struct mm_str
 	return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-	memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,36 +207,18 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-	struct page *page = virt_to_page(pgd);
-	page->index = (unsigned long)pgd_list;
-	if (pgd_list)
-		set_page_private(pgd_list, (unsigned long)&page->index);
-	pgd_list = page;
-	set_page_private(page, (unsigned long)&pgd_list);
-}
+LIST_HEAD(pgd_list);
 
-static inline void pgd_list_del(pgd_t *pgd)
-{
-	struct page *next, **pprev, *page = virt_to_page(pgd);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page_private(page);
-	*pprev = next;
-	if (next)
-		set_page_private(next, (unsigned long)pprev);
-}
+#define QUICK_PGD 0
+#define QUICK_PMD 1
 
-void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_ctor(void *pgd)
 {
 	unsigned long flags;
+	struct page *page = virt_to_page(pgd);
 
 	if (PTRS_PER_PMD == 1) {
 		memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
@@ -259,31 +237,32 @@ void pgd_ctor(void *pgd, struct kmem_cac
 			__pa(swapper_pg_dir) >> PAGE_SHIFT,
 			USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
-	pgd_list_add(pgd);
+	list_add(&page->lru, &pgd_list);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 /* never called when PTRS_PER_PMD > 1 */
-void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_dtor(void *pgd)
 {
 	unsigned long flags; /* can be called from interrupt context */
+	struct page *page = virt_to_page(pgd);
 
 	paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
 	spin_lock_irqsave(&pgd_lock, flags);
-	pgd_list_del(pgd);
+	list_del(&page->lru);
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	int i;
-	pgd_t *pgd = kmem_cache_alloc(pgd_cache, GFP_KERNEL);
+	pgd_t *pgd = quicklist_alloc(QUICK_PGD, GFP_KERNEL, pgd_ctor);
 
 	if (PTRS_PER_PMD == 1 || !pgd)
 		return pgd;
 
 	for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-		pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
+		pmd_t *pmd = quicklist_alloc(QUICK_PMD, GFP_KERNEL, NULL);
 		if (!pmd)
 			goto out_oom;
 		paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
@@ -296,9 +275,9 @@ out_oom:
 		pgd_t pgdent = pgd[i];
 		void* pmd = (void *)__va(pgd_val(pgdent)-1);
 		paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-		kmem_cache_free(pmd_cache, pmd);
+		quicklist_free(QUICK_PMD, NULL, pmd);
 	}
-	kmem_cache_free(pgd_cache, pgd);
+	quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 	return NULL;
 }
 
@@ -312,8 +291,14 @@ void pgd_free(pgd_t *pgd)
 			pgd_t pgdent = pgd[i];
 			void* pmd = (void *)__va(pgd_val(pgdent)-1);
 			paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
-			kmem_cache_free(pmd_cache, pmd);
+			quicklist_free(QUICK_PMD, NULL, pmd);
 		}
 	/* in the non-PAE case, free_pgtables() clears user pgd entries */
-	kmem_cache_free(pgd_cache, pgd);
+	quicklist_free(QUICK_PGD, pgd_ctor, pgd);
+}
+
+void check_pgt_cache(void)
+{
+	quicklist_trim(QUICK_PGD, pgd_dtor, 25, 16);
+	quicklist_trim(QUICK_PMD, NULL, 25, 16);
 }
Index: linux-2.6.21-rc4-mm1/arch/i386/Kconfig
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/Kconfig	2007-03-20 14:20:27.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/Kconfig	2007-03-20 14:21:52.000000000 -0700
@@ -55,6 +55,14 @@ config ZONE_DMA
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
+config NR_QUICK
+	int
+	default 2
+
 config SBUS
 	bool
 
Index: linux-2.6.21-rc4-mm1/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-i386/pgtable.h	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-i386/pgtable.h	2007-03-20 14:21:52.000000000 -0700
@@ -35,15 +35,12 @@ struct vm_area_struct;
 #define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
 extern unsigned long empty_zero_page[1024];
 extern pgd_t swapper_pg_dir[1024];
-extern struct kmem_cache *pgd_cache;
-extern struct kmem_cache *pmd_cache;
-extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
 
-void pmd_ctor(void *, struct kmem_cache *, unsigned long);
-void pgd_ctor(void *, struct kmem_cache *, unsigned long);
-void pgd_dtor(void *, struct kmem_cache *, unsigned long);
-void pgtable_cache_init(void);
+void check_pgt_cache(void);
+
+extern spinlock_t pgd_lock;
+extern struct list_head pgd_list;
+static inline void pgtable_cache_init(void) {};
 void paging_init(void);
 
 /*
Index: linux-2.6.21-rc4-mm1/arch/i386/kernel/smp.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/kernel/smp.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/kernel/smp.c	2007-03-20 14:21:52.000000000 -0700
@@ -437,7 +437,7 @@ void flush_tlb_mm (struct mm_struct * mm
 	}
 	if (!cpus_empty(cpu_mask))
 		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
-
+	check_pgt_cache();
 	preempt_enable();
 }
 
Index: linux-2.6.21-rc4-mm1/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/kernel/process.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/kernel/process.c	2007-03-20 14:21:52.000000000 -0700
@@ -181,6 +181,7 @@ void cpu_idle(void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 
Index: linux-2.6.21-rc4-mm1/include/asm-i386/pgalloc.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-i386/pgalloc.h	2007-03-20 14:21:00.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-i386/pgalloc.h	2007-03-20 14:21:52.000000000 -0700
@@ -65,6 +65,6 @@ do {									\
 #define pud_populate(mm, pmd, pte)	BUG()
 #endif
 
-#define check_pgt_cache()	do { } while (0)
+extern void check_pgt_cache(void);
 
 #endif /* _I386_PGALLOC_H */
Index: linux-2.6.21-rc4-mm1/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/fault.c	2007-03-20 14:20:28.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/fault.c	2007-03-20 14:21:52.000000000 -0700
@@ -623,11 +623,10 @@ void vmalloc_sync_all(void)
 			struct page *page;
 
 			spin_lock_irqsave(&pgd_lock, flags);
-			for (page = pgd_list; page; page =
-					(struct page *)page->index)
+			list_for_each_entry(page, &pgd_list, lru)
 				if (!vmalloc_sync_one(page_address(page),
 								address)) {
-					BUG_ON(page != pgd_list);
+					BUG();
 					break;
 				}
 			spin_unlock_irqrestore(&pgd_lock, flags);
Index: linux-2.6.21-rc4-mm1/arch/i386/mm/pageattr.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/pageattr.c	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/pageattr.c	2007-03-20 14:21:52.000000000 -0700
@@ -95,7 +95,7 @@ static void set_pmd_pte(pte_t *kpte, uns
 		return;
 
 	spin_lock_irqsave(&pgd_lock, flags);
-	for (page = pgd_list; page; page = (struct page *)page->index) {
+	list_for_each_entry(page, &pgd_list, lru) {
 		pgd_t *pgd;
 		pud_t *pud;
 		pmd_t *pmd;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [QUICKLIST 4/5] Quicklist support for x86_64
  2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 2/5] Quicklist support for IA64 Christoph Lameter
  2007-03-23  6:28 ` [QUICKLIST 3/5] Quicklist support for i386 Christoph Lameter
@ 2007-03-23  6:28 ` Christoph Lameter
  2007-03-23  6:29 ` [QUICKLIST 5/5] Quicklist support for sparc64 Christoph Lameter
  2007-03-23  6:39 ` [QUICKLIST 1/5] Quicklists for page table pages V4 Andrew Morton
  4 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is used to separate out PGD handling. Thus we can carry
the initialized pgds of terminating processes over to the next process
needing them.

Also clean up the pgd_list handling to use regular list macros. Not using
the slab allocator frees up the lru field so we can use regular list macros.

The adding and removal of the pgds to the pgdlist is moved into the
constructor / destructor. We can then avoid moving pgds off the list that
are still in the quicklists reducing the pds creation and allocation
overhead further.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc4-mm1/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/x86_64/Kconfig	2007-03-20 14:20:34.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/x86_64/Kconfig	2007-03-20 14:21:57.000000000 -0700
@@ -56,6 +56,14 @@ config ZONE_DMA
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
+config NR_QUICK
+	int
+	default 2
+
 config ISA
 	bool
 
Index: linux-2.6.21-rc4-mm1/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-x86_64/pgalloc.h	2007-03-20 14:21:06.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-x86_64/pgalloc.h	2007-03-20 14:55:47.000000000 -0700
@@ -4,6 +4,10 @@
 #include <asm/pda.h>
 #include <linux/threads.h>
 #include <linux/mm.h>
+#include <linux/quicklist.h>
+
+#define QUICK_PGD 0	/* We preserve special mappings over free */
+#define QUICK_PT 1	/* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
 	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-	free_page((unsigned long)pmd);
+	quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
 	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-	free_page((unsigned long)pud);
+	quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+	unsigned boundary;
+	pgd_t *pgd = x;
 	struct page *page = virt_to_page(pgd);
 
+	/*
+	 * Copy kernel pointers in from init.
+	 */
+	boundary = pgd_index(__PAGE_OFFSET);
+	memcpy(pgd + boundary,
+		init_level4_pgt + boundary,
+		(PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
 	spin_lock(&pgd_lock);
-	page->index = (pgoff_t)pgd_list;
-	if (pgd_list)
-		pgd_list->private = (unsigned long)&page->index;
-	pgd_list = page;
-	page->private = (unsigned long)&pgd_list;
+	list_add(&page->lru, &pgd_list);
 	spin_unlock(&pgd_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-	struct page *next, **pprev, *page = virt_to_page(pgd);
+	pgd_t *pgd = x;
+	struct page *page = virt_to_page(pgd);
 
 	spin_lock(&pgd_lock);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page->private;
-	*pprev = next;
-	if (next)
-		next->private = (unsigned long)pprev;
+	list_del(&page->lru);
 	spin_unlock(&pgd_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	unsigned boundary;
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (!pgd)
-		return NULL;
-	pgd_list_add(pgd);
-	/*
-	 * Copy kernel pointers in from init.
-	 * Could keep a freelist or slab cache of those because the kernel
-	 * part never changes.
-	 */
-	boundary = pgd_index(__PAGE_OFFSET);
-	memset(pgd, 0, boundary * sizeof(pgd_t));
-	memcpy(pgd + boundary,
-	       init_level4_pgt + boundary,
-	       (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+	pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+			 GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
 	return pgd;
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
 	BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
-	pgd_list_del(pgd);
-	free_page((unsigned long)pgd);
+	quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 	if (!p)
 		return NULL;
 	return virt_to_page(p);
@@ -111,17 +106,22 @@ static inline struct page *pte_alloc_one
 static inline void pte_free_kernel(pte_t *pte)
 {
 	BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-	free_page((unsigned long)pte); 
+	quicklist_free(QUICK_PT, NULL, pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
 	__free_page(pte);
-} 
+}
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 
+static inline void check_pgt_cache(void)
+{
+	quicklist_trim(QUICK_PGD, pgd_dtor, 25, 16);
+	quicklist_trim(QUICK_PT, NULL, 25, 16);
+}
 #endif /* _X86_64_PGALLOC_H */
Index: linux-2.6.21-rc4-mm1/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/x86_64/kernel/process.c	2007-03-20 14:20:35.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/x86_64/kernel/process.c	2007-03-20 14:21:57.000000000 -0700
@@ -207,6 +207,7 @@ void cpu_idle (void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 			if (!idle)
Index: linux-2.6.21-rc4-mm1/arch/x86_64/kernel/smp.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/x86_64/kernel/smp.c	2007-03-20 14:20:35.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/x86_64/kernel/smp.c	2007-03-20 14:21:57.000000000 -0700
@@ -242,7 +242,7 @@ void flush_tlb_mm (struct mm_struct * mm
 	}
 	if (!cpus_empty(cpu_mask))
 		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
-
+	check_pgt_cache();
 	preempt_enable();
 }
 EXPORT_SYMBOL(flush_tlb_mm);
Index: linux-2.6.21-rc4-mm1/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/x86_64/mm/fault.c	2007-03-20 14:20:35.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/x86_64/mm/fault.c	2007-03-20 14:21:57.000000000 -0700
@@ -585,7 +585,7 @@ do_sigbus:
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
+LIST_HEAD(pgd_list);
 
 void vmalloc_sync_all(void)
 {
@@ -605,8 +605,7 @@ void vmalloc_sync_all(void)
 			if (pgd_none(*pgd_ref))
 				continue;
 			spin_lock(&pgd_lock);
-			for (page = pgd_list; page;
-			     page = (struct page *)page->index) {
+			list_for_each_entry(page, &pgd_list, lru) {
 				pgd_t *pgd;
 				pgd = (pgd_t *)page_address(page) + pgd_index(address);
 				if (pgd_none(*pgd))
Index: linux-2.6.21-rc4-mm1/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-x86_64/pgtable.h	2007-03-20 14:21:06.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-x86_64/pgtable.h	2007-03-20 14:21:57.000000000 -0700
@@ -402,7 +402,7 @@ static inline pte_t pte_modify(pte_t pte
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
 extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
+extern struct list_head pgd_list;
 void vmalloc_sync_all(void);
 
 #endif /* !__ASSEMBLY__ */
@@ -419,7 +419,6 @@ extern int kern_addr_valid(unsigned long
 #define HAVE_ARCH_UNMAPPED_AREA
 
 #define pgtable_cache_init()   do { } while (0)
-#define check_pgt_cache()      do { } while (0)
 
 #define PAGE_AGP    PAGE_KERNEL_NOCACHE
 #define HAVE_PAGE_AGP 1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [QUICKLIST 5/5] Quicklist support for sparc64
  2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-03-23  6:28 ` [QUICKLIST 4/5] Quicklist support for x86_64 Christoph Lameter
@ 2007-03-23  6:29 ` Christoph Lameter
  2007-03-23  6:39 ` [QUICKLIST 1/5] Quicklists for page table pages V4 Andrew Morton
  4 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:29 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

From: David Miller <davem@davemloft.net>

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller <davem@davemloft.net>

Index: linux-2.6.21-rc4-mm1/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/Kconfig	2007-03-20 14:20:33.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/Kconfig	2007-03-20 14:22:03.000000000 -0700
@@ -26,6 +26,10 @@ config MMU
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
 config STACKTRACE_SUPPORT
 	bool
 	default y
Index: linux-2.6.21-rc4-mm1/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/mm/init.c	2007-03-20 14:20:33.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/mm/init.c	2007-03-20 14:22:03.000000000 -0700
@@ -178,30 +178,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long flags)
-{
-	clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-	pgtable_cache = kmem_cache_create("pgtable_cache",
-					  PAGE_SIZE, PAGE_SIZE,
-					  SLAB_HWCACHE_ALIGN |
-					  SLAB_MUST_HWCACHE_ALIGN,
-					  zero_ctor,
-					  NULL);
-	if (!pgtable_cache) {
-		prom_printf("Could not create pgtable_cache\n");
-		prom_halt();
-	}
-	tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: linux-2.6.21-rc4-mm1/arch/sparc64/mm/tsb.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/mm/tsb.c	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/mm/tsb.c	2007-03-20 14:22:03.000000000 -0700
@@ -252,7 +252,7 @@ static const char *tsb_cache_names[8] = 
 	"tsb_1MB",
 };
 
-void __init tsb_cache_init(void)
+void __init pgtable_cache_init(void)
 {
 	unsigned long i;
 
Index: linux-2.6.21-rc4-mm1/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/asm-sparc64/pgalloc.h	2007-03-15 17:20:01.000000000 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-sparc64/pgalloc.h	2007-03-20 14:55:47.000000000 -0700
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/quicklist.h>
 
 #include <asm/spitfire.h>
 #include <asm/cpudata.h>
@@ -13,52 +14,50 @@
 #include <asm/page.h>
 
 /* Page table allocation/freeing. */
-extern struct kmem_cache *pgtable_cache;
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
-	kmem_cache_free(pgtable_cache, pgd);
+	quicklist_free(0, NULL, pgd);
 }
 
 #define pud_populate(MM, PUD, PMD)	pud_set(PUD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache,
-				GFP_KERNEL|__GFP_REPEAT);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache, pmd);
+	quicklist_free(0, NULL, pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return kmem_cache_alloc(pgtable_cache,
-				GFP_KERNEL|__GFP_REPEAT);
+	return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 					 unsigned long address)
 {
-	return virt_to_page(pte_alloc_one_kernel(mm, address));
+	void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
+	return pg ? virt_to_page(pg) : NULL;
 }
 		
 static inline void pte_free_kernel(pte_t *pte)
 {
-	kmem_cache_free(pgtable_cache, pte);
+	quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
 {
-	pte_free_kernel(page_address(ptepage));
+	quicklist_free(0, NULL, page_address(ptepage));
 }
 
 
@@ -66,6 +65,9 @@ static inline void pte_free(struct page 
 #define pmd_populate(MM,PMD,PTE_PAGE)		\
 	pmd_populate_kernel(MM,PMD,page_address(PTE_PAGE))
 
-#define check_pgt_cache()	do { } while (0)
+static inline void check_pgt_cache(void)
+{
+	quicklist_trim(0, NULL, 25, 16);
+}
 
 #endif /* _SPARC64_PGALLOC_H */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-03-23  6:29 ` [QUICKLIST 5/5] Quicklist support for sparc64 Christoph Lameter
@ 2007-03-23  6:39 ` Andrew Morton
  2007-03-23  6:52   ` Christoph Lameter
  4 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2007-03-23  6:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> 1. Proven code from the IA64 arch.
> 
> 	The method used here has been fine tuned for years and
> 	is NUMA aware. It is based on the knowledge that accesses
> 	to page table pages are sparse in nature. Taking a page
> 	off the freelists instead of allocating a zeroed pages
> 	allows a reduction of number of cachelines touched
> 	in addition to getting rid of the slab overhead. So
> 	performance improves.

By how much?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  6:39 ` [QUICKLIST 1/5] Quicklists for page table pages V4 Andrew Morton
@ 2007-03-23  6:52   ` Christoph Lameter
  2007-03-23  7:48     ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23  6:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Thu, 22 Mar 2007, Andrew Morton wrote:

> On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > 1. Proven code from the IA64 arch.
> > 
> > 	The method used here has been fine tuned for years and
> > 	is NUMA aware. It is based on the knowledge that accesses
> > 	to page table pages are sparse in nature. Taking a page
> > 	off the freelists instead of allocating a zeroed pages
> > 	allows a reduction of number of cachelines touched
> > 	in addition to getting rid of the slab overhead. So
> > 	performance improves.
> 
> By how much?

About 40% on fork+exit. See 

http://marc.info/?l=linux-ia64&m=110942798406005&w=2

 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  6:52   ` Christoph Lameter
@ 2007-03-23  7:48     ` Andrew Morton
  2007-03-23 11:23       ` William Lee Irwin III
                         ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Andrew Morton @ 2007-03-23  7:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Thu, 22 Mar 2007 23:52:05 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 22 Mar 2007, Andrew Morton wrote:
> 
> > On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > 1. Proven code from the IA64 arch.
> > > 
> > > 	The method used here has been fine tuned for years and
> > > 	is NUMA aware. It is based on the knowledge that accesses
> > > 	to page table pages are sparse in nature. Taking a page
> > > 	off the freelists instead of allocating a zeroed pages
> > > 	allows a reduction of number of cachelines touched
> > > 	in addition to getting rid of the slab overhead. So
> > > 	performance improves.
> > 
> > By how much?
> 
> About 40% on fork+exit. See 
> 
> http://marc.info/?l=linux-ia64&m=110942798406005&w=2
> 

afacit that two-year-old, totally-different patch has nothing to do with my
repeatedly-asked question.  It appears to be consolidating three separate
quicklist allocators into one common implementation.

In an attempt to answer my own question (and hence to justify the retention
of this custom allocator) I did this:


diff -puN include/linux/quicklist.h~qlhack include/linux/quicklist.h
--- a/include/linux/quicklist.h~qlhack
+++ a/include/linux/quicklist.h
@@ -32,45 +32,17 @@ DECLARE_PER_CPU(struct quicklist, quickl
  */
 static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *))
 {
-	struct quicklist *q;
-	void **p = NULL;
-
-	q =&get_cpu_var(quicklist)[nr];
-	p = q->page;
-	if (likely(p)) {
-		q->page = p[0];
-		p[0] = NULL;
-		q->nr_pages--;
-	}
-	put_cpu_var(quicklist);
-	if (likely(p))
-		return p;
-
-	p = (void *)__get_free_page(flags | __GFP_ZERO);
+	void *p = (void *)__get_free_page(flags | __GFP_ZERO);
 	if (ctor && p)
 		ctor(p);
 	return p;
 }
 
-static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
+static inline void quicklist_free(int nr, void (*dtor)(void *), void *p)
 {
-	struct quicklist *q;
-	void **p = pp;
-	struct page *page = virt_to_page(p);
-	int nid = page_to_nid(page);
-
-	if (unlikely(nid != numa_node_id())) {
-		if (dtor)
-			dtor(p);
-		free_page((unsigned long)p);
-		return;
-	}
-
-	q = &get_cpu_var(quicklist)[nr];
-	p[0] = q->page;
-	q->page = p;
-	q->nr_pages++;
-	put_cpu_var(quicklist);
+	if (dtor)
+		dtor(p);
+	free_page((unsigned long)p);
 }
 
 void quicklist_trim(int nr, void (*dtor)(void *),
@@ -81,4 +53,3 @@ unsigned long quicklist_total_size(void)
 #endif
 
 #endif /* LINUX_QUICKLIST_H */
-
_

but it crashes early in the page allocator (i386) and I don't see why.  It
makes me wonder if we have a use-after-free which is hidden by the presence
of the quicklist buffering or something.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  7:48     ` Andrew Morton
@ 2007-03-23 11:23       ` William Lee Irwin III
  2007-03-23 14:58         ` Christoph Lameter
  2007-03-23 11:29       ` William Lee Irwin III
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-23 11:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 11:48:48PM -0800, Andrew Morton wrote:
> afacit that two-year-old, totally-different patch has nothing to do with my
> repeatedly-asked question.  It appears to be consolidating three separate
> quicklist allocators into one common implementation.
> In an attempt to answer my own question (and hence to justify the retention
> of this custom allocator) I did this:
[... patch changing allocator alloc()/free() to bare page allocations ...]
> but it crashes early in the page allocator (i386) and I don't see why.  It
> makes me wonder if we have a use-after-free which is hidden by the presence
> of the quicklist buffering or something.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  7:48     ` Andrew Morton
  2007-03-23 11:23       ` William Lee Irwin III
@ 2007-03-23 11:29       ` William Lee Irwin III
  2007-03-23 14:57         ` William Lee Irwin III
  2007-03-23 11:39       ` Nick Piggin
  2007-03-23 15:08       ` Christoph Lameter
  3 siblings, 1 reply; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-23 11:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 11:48:48PM -0800, Andrew Morton wrote:
> afacit that two-year-old, totally-different patch has nothing to do with my
> repeatedly-asked question.  It appears to be consolidating three separate
> quicklist allocators into one common implementation.
> In an attempt to answer my own question (and hence to justify the retention
> of this custom allocator) I did this:
[... patch changing allocator alloc()/free() to bare page allocations ...]
> but it crashes early in the page allocator (i386) and I don't see why.  It
> makes me wonder if we have a use-after-free which is hidden by the presence
> of the quicklist buffering or something.

Sorry I flubbed the first message. Anyway this does mean something is
seriously wrong and needs to be debugged. Looking into it now.


-- wli

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  7:48     ` Andrew Morton
  2007-03-23 11:23       ` William Lee Irwin III
  2007-03-23 11:29       ` William Lee Irwin III
@ 2007-03-23 11:39       ` Nick Piggin
  2007-03-24  5:14         ` Andrew Morton
  2007-03-23 15:08       ` Christoph Lameter
  3 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-03-23 11:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

Andrew Morton wrote:

> but it crashes early in the page allocator (i386) and I don't see why.  It
> makes me wonder if we have a use-after-free which is hidden by the presence
> of the quicklist buffering or something.

Does CONFIG_DEBUG_PAGEALLOC catch it?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 11:29       ` William Lee Irwin III
@ 2007-03-23 14:57         ` William Lee Irwin III
  2007-03-23 19:17           ` William Lee Irwin III
  0 siblings, 1 reply; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-23 14:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 11:48:48PM -0800, Andrew Morton wrote:
>> afacit that two-year-old, totally-different patch has nothing to do with my
>> repeatedly-asked question.  It appears to be consolidating three separate
>> quicklist allocators into one common implementation.
>> In an attempt to answer my own question (and hence to justify the retention
>> of this custom allocator) I did this:
> [... patch changing allocator alloc()/free() to bare page allocations ...]
>> but it crashes early in the page allocator (i386) and I don't see why.  It
>> makes me wonder if we have a use-after-free which is hidden by the presence
>> of the quicklist buffering or something.

On Fri, Mar 23, 2007 at 04:29:20AM -0700, William Lee Irwin III wrote:
> Sorry I flubbed the first message. Anyway this does mean something is
> seriously wrong and needs to be debugged. Looking into it now.

I know what's happening. I just need to catch the culprit.


-- wli

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 11:23       ` William Lee Irwin III
@ 2007-03-23 14:58         ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23 14:58 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, 23 Mar 2007, William Lee Irwin III wrote:

> [... patch changing allocator alloc()/free() to bare page allocations ...]
> > but it crashes early in the page allocator (i386) and I don't see why.  It
> > makes me wonder if we have a use-after-free which is hidden by the presence
> > of the quicklist buffering or something.

Sorry there seems to be some email dropouts today. I am getting 
fragments of slab and quicklist discussions. Maybe I can get the whole story from 
the mailing lists.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23  7:48     ` Andrew Morton
                         ` (2 preceding siblings ...)
  2007-03-23 11:39       ` Nick Piggin
@ 2007-03-23 15:08       ` Christoph Lameter
  2007-03-23 17:54         ` Christoph Lameter
  3 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23 15:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Thu, 22 Mar 2007, Andrew Morton wrote:

> > About 40% on fork+exit. See 
> > 
> > http://marc.info/?l=linux-ia64&m=110942798406005&w=2
> > 
> 
> afacit that two-year-old, totally-different patch has nothing to do with my
> repeatedly-asked question.  It appears to be consolidating three separate
> quicklist allocators into one common implementation.

Yes it shows the performance gains from the quicklist approach. This the 
work Robin Holt did on the problem. The problem is how to validate the 
patch because there should be no change at all on ia64 and on i386 we 
basically measure the overhead of the slab allocations. One could 
measure the impact x86_64 because this introduces quicklists to that 
platform.

The earlier discussion focused on avoiding zeroing of pte as far as I can 
recall.
 
> but it crashes early in the page allocator (i386) and I don't see why.  It
> makes me wonder if we have a use-after-free which is hidden by the presence
> of the quicklist buffering or something.

This was on i386? Could be hidden now by the slab use ther.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 15:08       ` Christoph Lameter
@ 2007-03-23 17:54         ` Christoph Lameter
  2007-03-24  6:21           ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2007-03-23 17:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

Here are the results of aim9 tests on x86_64. There are some minor performance 
improvements and some fluctuations. Page size is only a fourth of that on 
ia64 so the resulting benefit is less in terms of saved cacheline fetches.

The benefit is also likely higher on i386 because it can fit double the 
page table entries into a page.

 1 add_double   1096039.60 1096039.60       0.00  0.00% Thousand Double Precision Additions/second
 2 add_float    1087128.71 1099009.90   11881.19  1.09% Thousand Single Precision Additions/second
 3 add_long     4019704.43 4374384.24  354679.81  8.82% Thousand Long Integer Additions/second
 4 add_int      3772277.23 3772277.23       0.00  0.00% Thousand Integer Additions/second
 5 add_short    3754455.45 3761194.03    6738.58  0.18% Thousand Short Integer Additions/second
 6 creat-clo    259405.94 267164.18      7758.24  2.99% File Creations and Closes/second
 7 page_test    233118.81 235970.15      2851.34  1.22% System Allocations & Pages/second
 8 brk_test     3425247.52 3408457.71  -16789.81 -0.49% System Memory Allocations/second
 9 jmp_test     21819306.93 21808457.71 -10849.22 -0.05% Non-local gotos/second
10 signal_test  669154.23 689552.24     20398.01  3.05% Signal Traps/second
11 exec_test    747.52 743.78              -3.74 -0.50% Program Loads/second
12 fork_test    8267.33 8457.71           190.38  2.30% Task Creations/second
13 link_test    43819.31 44318.32         499.01  1.14% Link/Unlink Pairs/second
28 fun_cal      326463366.34 326559203.98 95837.64 0.03% Function Calls (no arguments)/second
29 fun_cal1     358906930.69 388202985.07 29296054.38 8.16% Function Calls (1 argument)/second
30 fun_cal2     356372277.23 356362189.05 -10088.18 -0.00% Function Calls (2 arguments)/second
31 fun_cal15    156641584.16 156656716.42 15132.26  0.01% Function Calls (15 arguments)/second
45 mem_rtns_2   1588762.38 1610298.51   21536.13  1.36% Block Memory Operations/second
46 sort_rtns_1  935.32 1004.98             69.66  7.45% Sort Operations/second
47 misc_rtns_1  17099.01 17268.66         169.65  0.99% Auxiliary Loops/second
48 dir_rtns_1   5925742.57 6313432.84  387690.27  6.54% Directory Operations/second
52 series_1     11469950.50 11625771.14 155820.64 1.36% Series Evaluations/second
53 shared_memory 1187313.43 1177910.45  -9402.98 -0.79% Shared Memory Operations/second
54 tcp_test     83183.17 83507.46         324.29  0.39% TCP/IP Messages/second
55 udp_test     273514.85 269801.00     -3713.85 -1.36% UDP/IP DataGrams/second
56 fifo_test    741237.62 803930.35     62692.73  8.46% FIFO Messages/second
57 stream_pipe  885099.01 1058059.70   172960.69 19.54% Stream Pipe Messages/second
58 dgram_pipe   881782.18 957213.93     75431.75  8.55% DataGram Pipe Messages/second
59 pipe_cpy     1355891.09 1316766.17  -39124.92 -2.89% Pipe Messages/second


2.6.21-rc4 bare

------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 add_double           2.02        123   60.89109      1096039.60 Thousand Double Precision Additions/second
     2 add_float            2.02        183   90.59406      1087128.71 Thousand Single Precision Additions/second
     3 add_long             2.03        136   66.99507      4019704.43 Thousand Long Integer Additions/second
     4 add_int              2.02        127   62.87129      3772277.23 Thousand Integer Additions/second
     5 add_short            2.02        316  156.43564      3754455.45 Thousand Short Integer Additions/second
     6 creat-clo            2.02        524  259.40594       259405.94 File Creations and Closes/second
     7 page_test            2.02        277  137.12871       233118.81 System Allocations & Pages/second
     8 brk_test             2.02        407  201.48515      3425247.52 System Memory Allocations/second
     9 jmp_test             2.02      44075 21819.30693     21819306.93 Non-local gotos/second
    10 signal_test          2.01       1345  669.15423       669154.23 Signal Traps/second
    11 exec_test            2.02        302  149.50495          747.52 Program Loads/second
    12 fork_test            2.02        167   82.67327         8267.33 Task Creations/second
    13 link_test            2.02       1405  695.54455        43819.31 Link/Unlink Pairs/second
    14 disk_rr              2.02         65   32.17822       164752.48 Random Disk Reads (K)/second
    15 disk_rw              2.03         55   27.09360       138719.21 Random Disk Writes (K)/second
    16 disk_rd              2.02        467  231.18812      1183683.17 Sequential Disk Reads (K)/second
    17 disk_wrt             2.02         81   40.09901       205306.93 Sequential Disk Writes (K)/second
    18 disk_cp              2.04         65   31.86275       163137.25 Disk Copies (K)/second
    19 sync_disk_rw         2.05          2    0.97561         2497.56 Sync Random Disk Writes (K)/second
    20 sync_disk_wrt        2.15          1    0.46512         1190.70 Sync Sequential Disk Writes (K)/second
    21 sync_disk_cp         2.49          1    0.40161         1028.11 Sync Disk Copies (K)/second
    22 disk_src             2.02       1049  519.30693        38948.02 Directory Searches/second
    23 div_double           2.02        141   69.80198       209405.94 Thousand Double Precision Divides/second
    24 div_float            2.02        141   69.80198       209405.94 Thousand Single Precision Divides/second
    25 div_long             2.04         68   33.33333        30000.00 Thousand Long Integer Divides/second
    26 div_int              2.02        120   59.40594        53465.35 Thousand Integer Divides/second
    27 div_short            2.03        119   58.62069        52758.62 Thousand Short Integer Divides/second
    28 fun_cal              2.02       1288  637.62376    326463366.34 Function Calls (no arguments)/second
    29 fun_cal1             2.02       1416  700.99010    358906930.69 Function Calls (1 argument)/second
    30 fun_cal2             2.02       1406  696.03960    356372277.23 Function Calls (2 arguments)/second
    31 fun_cal15            2.02        618  305.94059    156641584.16 Function Calls (15 arguments)/second
    32 sieve                2.15         10    4.65116           23.26 Integer Sieves/second
    33 mul_double           2.02        185   91.58416      1099009.90 Thousand Double Precision Multiplies/second
    34 mul_float            2.02        185   91.58416      1099009.90 Thousand Single Precision Multiplies/second
    35 mul_long             2.02       6129 3034.15842       728198.02 Thousand Long Integer Multiplies/second
    36 mul_int              2.02       8517 4216.33663      1011920.79 Thousand Integer Multiplies/second
    37 mul_short            2.02       6817 3374.75248      1012425.74 Thousand Short Integer Multiplies/second
    38 num_rtns_1           2.02       6260 3099.00990       309900.99 Numeric Functions/second
    39 new_raph             2.02       6305 3121.28713       624257.43 Zeros Found/second
    40 trig_rtns            2.02        243  120.29703      1202970.30 Trigonometric Functions/second
    41 matrix_rtns          2.02      69718 34513.86139      3451386.14 Point Transformations/second
    42 array_rtns           2.01        165   82.08955         1641.79 Linear Systems Solved/second
    43 string_rtns          2.02        120   59.40594         5940.59 String Manipulations/second
    44 mem_rtns_1           2.02        370  183.16832      5495049.50 Dynamic Memory Operations/second
    45 mem_rtns_2           2.02      32093 15887.62376      1588762.38 Block Memory Operations/second
    46 sort_rtns_1          2.01        188   93.53234          935.32 Sort Operations/second
    47 misc_rtns_1          2.02       3454 1709.90099        17099.01 Auxiliary Loops/second
    48 dir_rtns_1           2.02       1197  592.57426      5925742.57 Directory Operations/second
    49 shell_rtns_1         2.02        328  162.37624          162.38 Shell Scripts/second
    50 shell_rtns_2         2.02        327  161.88119          161.88 Shell Scripts/second
    51 shell_rtns_3         2.02        327  161.88119          161.88 Shell Scripts/second
    52 series_1             2.02     231693 114699.50495     11469950.50 Series Evaluations/second
    53 shared_memory        2.01      23865 11873.13433      1187313.43 Shared Memory Operations/second
    54 tcp_test             2.02       1867  924.25743        83183.17 TCP/IP Messages/second
    55 udp_test             2.02       5525 2735.14851       273514.85 UDP/IP DataGrams/second
    56 fifo_test            2.02      14973 7412.37624       741237.62 FIFO Messages/second
    57 stream_pipe          2.02      17879 8850.99010       885099.01 Stream Pipe Messages/second
    58 dgram_pipe           2.02      17812 8817.82178       881782.18 DataGram Pipe Messages/second
    59 pipe_cpy             2.02      27389 13558.91089      1355891.09 Pipe Messages/second
    60 ram_copy             2.02     353141 174822.27723   4374053376.24 Memory to Memory Copy/second

2.6.21-rc4 x86_64 quicklist

------------------------------------------------------------------------------------------------------------
 Test        Test        Elapsed  Iteration    Iteration          Operation
Number       Name      Time (sec)   Count   Rate (loops/sec)    Rate (ops/sec)
------------------------------------------------------------------------------------------------------------
     1 add_double           2.02        123   60.89109      1096039.60 Thousand Double Precision Additions/second
     2 add_float            2.02        185   91.58416      1099009.90 Thousand Single Precision Additions/second
     3 add_long             2.03        148   72.90640      4374384.24 Thousand Long Integer Additions/second
     4 add_int              2.02        127   62.87129      3772277.23 Thousand Integer Additions/second
     5 add_short            2.01        315  156.71642      3761194.03 Thousand Short Integer Additions/second
     6 creat-clo            2.01        537  267.16418       267164.18 File Creations and Closes/second
     7 page_test            2.01        279  138.80597       235970.15 System Allocations & Pages/second
     8 brk_test             2.01        403  200.49751      3408457.71 System Memory Allocations/second
     9 jmp_test             2.01      43835 21808.45771     21808457.71 Non-local gotos/second
    10 signal_test          2.01       1386  689.55224       689552.24 Signal Traps/second
    11 exec_test            2.01        299  148.75622          743.78 Program Loads/second
    12 fork_test            2.01        170   84.57711         8457.71 Task Creations/second
    13 link_test            2.02       1421  703.46535        44318.32 Link/Unlink Pairs/second
    14 disk_rr              2.01         63   31.34328       160477.61 Random Disk Reads (K)/second
    15 disk_rw              2.02         53   26.23762       134336.63 Random Disk Writes (K)/second
    16 disk_rd              2.01        498  247.76119      1268537.31 Sequential Disk Reads (K)/second
    17 disk_wrt             2.02         78   38.61386       197702.97 Sequential Disk Writes (K)/second
    18 disk_cp              2.04         64   31.37255       160627.45 Disk Copies (K)/second
    19 sync_disk_rw         2.65          2    0.75472         1932.08 Sync Random Disk Writes (K)/second
    20 sync_disk_wrt        3.96          2    0.50505         1292.93 Sync Sequential Disk Writes (K)/second
    21 sync_disk_cp         2.31          1    0.43290         1108.23 Sync Disk Copies (K)/second
    22 disk_src             2.01       1079  536.81592        40261.19 Directory Searches/second
    23 div_double           2.02        141   69.80198       209405.94 Thousand Double Precision Divides/second
    24 div_float            2.01        140   69.65174       208955.22 Thousand Single Precision Divides/second
    25 div_long             2.01         67   33.33333        30000.00 Thousand Long Integer Divides/second
    26 div_int              2.02        120   59.40594        53465.35 Thousand Integer Divides/second
    27 div_short            2.01        118   58.70647        52835.82 Thousand Short Integer Divides/second
    28 fun_cal              2.01       1282  637.81095    326559203.98 Function Calls (no arguments)/second
    29 fun_cal1             2.01       1524  758.20896    388202985.07 Function Calls (1 argument)/second
    30 fun_cal2             2.01       1399  696.01990    356362189.05 Function Calls (2 arguments)/second
    31 fun_cal15            2.01        615  305.97015    156656716.42 Function Calls (15 arguments)/second
    32 sieve                2.16         10    4.62963           23.15 Integer Sieves/second
    33 mul_double           2.02        185   91.58416      1099009.90 Thousand Double Precision Multiplies/second
    34 mul_float            2.02        185   91.58416      1099009.90 Thousand Single Precision Multiplies/second
    35 mul_long             2.02       6128 3033.66337       728079.21 Thousand Long Integer Multiplies/second
    36 mul_int              2.01       8475 4216.41791      1011940.30 Thousand Integer Multiplies/second
    37 mul_short            2.01       6783 3374.62687      1012388.06 Thousand Short Integer Multiplies/second
    38 num_rtns_1           2.01       6264 3116.41791       311641.79 Numeric Functions/second
    39 new_raph             2.01       6261 3114.92537       622985.07 Zeros Found/second
    40 trig_rtns            2.01        239  118.90547      1189054.73 Trigonometric Functions/second
    41 matrix_rtns          2.01      69555 34604.47761      3460447.76 Point Transformations/second
    42 array_rtns           2.01        165   82.08955         1641.79 Linear Systems Solved/second
    43 string_rtns          2.01        118   58.70647         5870.65 String Manipulations/second
    44 mem_rtns_1           2.01        370  184.07960      5522388.06 Dynamic Memory Operations/second
    45 mem_rtns_2           2.01      32367 16102.98507      1610298.51 Block Memory Operations/second
    46 sort_rtns_1          2.01        202  100.49751         1004.98 Sort Operations/second
    47 misc_rtns_1          2.01       3471 1726.86567        17268.66 Auxiliary Loops/second
    48 dir_rtns_1           2.01       1269  631.34328      6313432.84 Directory Operations/second
    49 shell_rtns_1         2.01        321  159.70149          159.70 Shell Scripts/second
    50 shell_rtns_2         2.01        324  161.19403          161.19 Shell Scripts/second
    51 shell_rtns_3         2.01        325  161.69154          161.69 Shell Scripts/second
    52 series_1             2.01     233678 116257.71144     11625771.14 Series Evaluations/second
    53 shared_memory        2.01      23676 11779.10448      1177910.45 Shared Memory Operations/second
    54 tcp_test             2.01       1865  927.86070        83507.46 TCP/IP Messages/second
    55 udp_test             2.01       5423 2698.00995       269801.00 UDP/IP DataGrams/second
    56 fifo_test            2.01      16159 8039.30348       803930.35 FIFO Messages/second
    57 stream_pipe          2.01      21267 10580.59701      1058059.70 Stream Pipe Messages/second
    58 dgram_pipe           2.01      19240 9572.13930       957213.93 DataGram Pipe Messages/second
    59 pipe_cpy             2.01      26467 13167.66169      1316766.17 Pipe Messages/second
    60 ram_copy             2.01     351052 174652.73632   4369811462.69 Memory to Memory Copy/second

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 14:57         ` William Lee Irwin III
@ 2007-03-23 19:17           ` William Lee Irwin III
  0 siblings, 0 replies; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-23 19:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Thu, Mar 22, 2007 at 11:48:48PM -0800, Andrew Morton wrote:
>>> afacit that two-year-old, totally-different patch has nothing to do with my
>>> repeatedly-asked question.  It appears to be consolidating three separate
>>> quicklist allocators into one common implementation.
>>> In an attempt to answer my own question (and hence to justify the retention
>>> of this custom allocator) I did this:
>> [... patch changing allocator alloc()/free() to bare page allocations ...]
>>> but it crashes early in the page allocator (i386) and I don't see why.  It
>>> makes me wonder if we have a use-after-free which is hidden by the presence
>>> of the quicklist buffering or something.

On Fri, Mar 23, 2007 at 04:29:20AM -0700, William Lee Irwin III wrote:
>> Sorry I flubbed the first message. Anyway this does mean something is
>> seriously wrong and needs to be debugged. Looking into it now.

On Fri, Mar 23, 2007 at 07:57:07AM -0700, William Lee Irwin III wrote:
> I know what's happening. I just need to catch the culprit.

Are you tripping the BUG_ON() in include/linux/mm.h:256 with
CONFIG_DEBUG_VM set?


-- wli

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 11:39       ` Nick Piggin
@ 2007-03-24  5:14         ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2007-03-24  5:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, linux-mm, linux-kernel, William Lee Irwin III

On Fri, 23 Mar 2007 22:39:24 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Andrew Morton wrote:
> 
> > but it crashes early in the page allocator (i386) and I don't see why.  It
> > makes me wonder if we have a use-after-free which is hidden by the presence
> > of the quicklist buffering or something.
> 
> Does CONFIG_DEBUG_PAGEALLOC catch it?

It'll be a while before I can get onto doing anything with this.
I do have an oops trace:


kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 296k freed
Write protecting the kernel read-only data: 921k
BUG: unable to handle kernel paging request at virtual address 00100104
 printing eip:
c015b676
*pde = 00000000
Oops: 0002 [#1]
SMP 
Modules linked in:
CPU:    1
EIP:    0060:[<c015b676>]    Not tainted VLI
EFLAGS: 00010002   (2.6.21-rc4 #6)
EIP is at get_page_from_freelist+0x166/0x3d0
eax: c1b110bc   ebx: 00000001   ecx: 00100100   edx: 00200200
esi: c1b11090   edi: c04cc500   ebp: f67d3b88   esp: f67d3b34
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process default.hotplug (pid: 872, ti=f67d2000 task=f6748030 task.ti=f67d2000)
Stack: 00000001 00000044 c067eae8 00000001 00000001 00000000 c04cc6c0 c04cc4a0 
       00000001 00000000 000284d0 c04ccb78 00000286 00000001 00000000 f67b6000 
       00000000 00000001 c04cc4a0 f6748030 000084d0 f67d3bcc c015b92e 00000044 
Call Trace:
 [<c0103e6a>] show_trace_log_lvl+0x1a/0x30
 [<c0103f29>] show_stack_log_lvl+0xa9/0xd0
 [<c0104139>] show_registers+0x1e9/0x2f0
 [<c0104355>] die+0x115/0x250
 [<c011561e>] do_page_fault+0x27e/0x630
 [<c03d5f64>] error_code+0x7c/0x84
 [<c015b92e>] __alloc_pages+0x4e/0x2f0
 [<c0114c84>] pte_alloc_one+0x14/0x20
 [<c0163d1b>] __pte_alloc+0x1b/0xa0
 [<c016459d>] __handle_mm_fault+0x7fd/0x940
 [<c01154b9>] do_page_fault+0x119/0x630
 [<c03d5f64>] error_code+0x7c/0x84
 [<c01a5e8f>] padzero+0x1f/0x30
 [<c01a744e>] load_elf_binary+0x76e/0x1a80
 [<c017c2c7>] search_binary_handler+0x97/0x220
 [<c01a5886>] load_script+0x1d6/0x220
 [<c017c2c7>] search_binary_handler+0x97/0x220
 [<c017dd0f>] do_execve+0x14f/0x200
 [<c010140e>] sys_execve+0x2e/0x80
 [<c0102dcc>] sysenter_past_esp+0x5d/0x99
 =======================
Code: 06 8b 4d c0 8b 7d c8 8d 04 81 8d 44 82 20 01 c7 9c 8f 45 dc fa e8 4b f4 fd ff 8b 07 85 c0 74 7b 8b 47 0c 8b 08 8d 70 d4 8b 50 04 <89> 51 04 89 0a c7 40 04 00 02 20 00 c7 00 00 01 10 00 ff 0f 8b 
EIP: [<c015b676>] get_page_from_freelist+0x166/0x3d0 SS:ESP 0068:f67d3b34

Not pretty.  That was bare mainline+christoph's patches+that patch which I sent.
Using http://userweb.kernel.org/~akpm/config-vmm.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-23 17:54         ` Christoph Lameter
@ 2007-03-24  6:21           ` Andrew Morton
  2007-03-26 16:52             ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2007-03-24  6:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Fri, 23 Mar 2007 10:54:12 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> Here are the results of aim9 tests on x86_64. There are some minor performance 
> improvements and some fluctuations.

There are a lot of numbers there - what do they tell us?

> 2.6.21-rc4 bare
> 2.6.21-rc4 x86_64 quicklist

So what has changed here?  From a quick look it appears that x86_64 is
using get_zeroed_page() for ptes, puds and pmds and is using a custom
quicklist for pgds.

After your patches, x86_64 is using a common quicklist allocator for puds,
pmds and pgds and continues to use get_zeroed_page() for ptes.

Or something totally different, dunno.  I tire.


My question is pretty simple: how do we justify the retention of this
custom allocator?

Because simply removing it is the preferable way of fixing the SLUB
problem.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-24  6:21           ` Andrew Morton
@ 2007-03-26 16:52             ` Christoph Lameter
  2007-03-26 18:14               ` Christoph Lameter
  2007-03-26 18:26               ` Andrew Morton
  0 siblings, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-26 16:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Fri, 23 Mar 2007, Andrew Morton wrote:

> On Fri, 23 Mar 2007 10:54:12 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Here are the results of aim9 tests on x86_64. There are some minor performance 
> > improvements and some fluctuations.
> 
> There are a lot of numbers there - what do they tell us?

That there are performance improvements because of quicklists.

> So what has changed here?  From a quick look it appears that x86_64 is
> using get_zeroed_page() for ptes, puds and pmds and is using a custom
> quicklist for pgds.

x86_64 is only using a list in order to track pgds. There is no 
quicklist without this patchset.
 
> After your patches, x86_64 is using a common quicklist allocator for puds,
> pmds and pgds and continues to use get_zeroed_page() for ptes.

x86_64 should be using quicklists for all ptes after this patch. I did not 
convert pte_free() since it is only used for freeing ptes during races 
(see __pte_alloc). Since pte_free gets passed a page struct it would require 
virt_to_page before being put onto the freelist. Not worth doing.

Hmmm... Then how does x86_64 free the ptes? Seems that we do 
free_page_and_swap_cache() in tlb_remove_pages. Yup so ptes are not 
handled which limits the speed improvements that we see.

> My question is pretty simple: how do we justify the retention of this
> custom allocator?

I would expect this functionality (never thought about it as an allocator) 
to extract common code from many arches that use one or the other form of 
preserving zeroed pages for page table pages. I saw lots of arches doing 
the same with some getting into trouble with the page structs. Having a 
common code base that does not have this issue would clean up the kernel 
and deal with the slab issue.

> Because simply removing it is the preferable way of fixing the SLUB
> problem.

That would reduce performance. I did not think that a common feature 
that is used throughout many arches would need rejustification.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-26 16:52             ` Christoph Lameter
@ 2007-03-26 18:14               ` Christoph Lameter
  2007-03-26 18:26               ` Andrew Morton
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-26 18:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Mon, 26 Mar 2007, Christoph Lameter wrote:

> > After your patches, x86_64 is using a common quicklist allocator for puds,
> > pmds and pgds and continues to use get_zeroed_page() for ptes.
> 
> x86_64 should be using quicklists for all ptes after this patch. I did not 
> convert pte_free() since it is only used for freeing ptes during races 
> (see __pte_alloc). Since pte_free gets passed a page struct it would require 
> virt_to_page before being put onto the freelist. Not worth doing.
> 
> Hmmm... Then how does x86_64 free the ptes? Seems that we do 
> free_page_and_swap_cache() in tlb_remove_pages. Yup so ptes are not 
> handled which limits the speed improvements that we see.

And if we would try to put the ptes onto quicklists then we would get into 
more difficulties with the tlb shootdown code. Sigh. We cannot easily 
deal with ptes. Quicklists on i386 and x86_64 only work for pgds,puds and 
pmds. And as was pointed out elsewhere in this thread: The performance 
gains are therefore limited on these platforms.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-26 16:52             ` Christoph Lameter
  2007-03-26 18:14               ` Christoph Lameter
@ 2007-03-26 18:26               ` Andrew Morton
  2007-03-27  1:06                 ` William Lee Irwin III
  2007-03-27 11:19                 ` William Lee Irwin III
  1 sibling, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2007-03-26 18:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, linux-kernel

On Mon, 26 Mar 2007 09:52:17 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 23 Mar 2007, Andrew Morton wrote:
> 
> > On Fri, 23 Mar 2007 10:54:12 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > Here are the results of aim9 tests on x86_64. There are some minor performance 
> > > improvements and some fluctuations.
> > 
> > There are a lot of numbers there - what do they tell us?
> 
> That there are performance improvements because of quicklists.

Christoph, you can continue to be obtuse, and I can continue to ignore
these patches until

a) it has been demonstrated that this patch is superior to simply removing
   the quicklists and

b) we understand why the below simple modification crashes i386.


diff -puN include/linux/quicklist.h~qlhack include/linux/quicklist.h
--- a/include/linux/quicklist.h~qlhack
+++ a/include/linux/quicklist.h
@@ -32,45 +32,17 @@ DECLARE_PER_CPU(struct quicklist, quickl
  */
 static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *))
 {
-	struct quicklist *q;
-	void **p = NULL;
-
-	q =&get_cpu_var(quicklist)[nr];
-	p = q->page;
-	if (likely(p)) {
-		q->page = p[0];
-		p[0] = NULL;
-		q->nr_pages--;
-	}
-	put_cpu_var(quicklist);
-	if (likely(p))
-		return p;
-
-	p = (void *)__get_free_page(flags | __GFP_ZERO);
+	void *p = (void *)__get_free_page(flags | __GFP_ZERO);
 	if (ctor && p)
 		ctor(p);
 	return p;
 }
 
-static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
+static inline void quicklist_free(int nr, void (*dtor)(void *), void *p)
 {
-	struct quicklist *q;
-	void **p = pp;
-	struct page *page = virt_to_page(p);
-	int nid = page_to_nid(page);
-
-	if (unlikely(nid != numa_node_id())) {
-		if (dtor)
-			dtor(p);
-		free_page((unsigned long)p);
-		return;
-	}
-
-	q = &get_cpu_var(quicklist)[nr];
-	p[0] = q->page;
-	q->page = p;
-	q->nr_pages++;
-	put_cpu_var(quicklist);
+	if (dtor)
+		dtor(p);
+	free_page((unsigned long)p);
 }
 
 void quicklist_trim(int nr, void (*dtor)(void *),
@@ -81,4 +53,3 @@ unsigned long quicklist_total_size(void)
 #endif
 
 #endif /* LINUX_QUICKLIST_H */
-
_



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-26 18:26               ` Andrew Morton
@ 2007-03-27  1:06                 ` William Lee Irwin III
  2007-03-27  1:22                   ` Christoph Lameter
  2007-03-27  1:45                   ` David Miller
  2007-03-27 11:19                 ` William Lee Irwin III
  1 sibling, 2 replies; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-27  1:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Mon, Mar 26, 2007 at 10:26:51AM -0800, Andrew Morton wrote:
> a) it has been demonstrated that this patch is superior to simply removing
>    the quicklists and

Not that clameter really needs my help, but I agree with his position
on several fronts, and advocate accordingly, so here is where I'm at.

>From prior experience, I believe I know how to extract positive results,
and that's primarily by PTE caching because they're the most frequently
zeroed pagetable nodes. The upper levels of pagetables will remain in
the noise until the leaf level bottleneck is dealt with.

PTE's need a custom tlb.h to deal with the TLB issues noted above; the
asm-generic variant will not suffice. Results above the noise level
need PTE caching. Sparse fault handling (esp. after execve() is done)
is one place in particular where improvements should be most readily
demonstrable, as only single cachelines on each allocated node should
be touched. lmbench should have a fault handling latency test for this.


On Mon, Mar 26, 2007 at 10:26:51AM -0800, Andrew Morton wrote:
> b) we understand why the below simple modification crashes i386.

Full eager zeroing patches not dependent on quicklist code don't crash,
so there is no latent use-after-free issue covered up by caching. I'll
help out more on the i386 front as-needed.


-- wli

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-27  1:06                 ` William Lee Irwin III
@ 2007-03-27  1:22                   ` Christoph Lameter
  2007-03-27  1:45                   ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-27  1:22 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, linux-mm, linux-kernel

On Mon, 26 Mar 2007, William Lee Irwin III wrote:

> Not that clameter really needs my help, but I agree with his position
> on several fronts, and advocate accordingly, so here is where I'm at.

Yes thank you. I386 is not my field, I have no interest per se in 
improving i386 performance and without your help I would have to drop this 
and keep the special casing in SLUB for i386. Generic tlb.h changes may 
also help to introduce quicklists to x86_64. The current quicklist patches 
can only work on higher levels due to the freeing of ptes via 
tlb_remove_page().


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-27  1:06                 ` William Lee Irwin III
  2007-03-27  1:22                   ` Christoph Lameter
@ 2007-03-27  1:45                   ` David Miller
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2007-03-27  1:45 UTC (permalink / raw)
  To: wli; +Cc: akpm, clameter, linux-mm, linux-kernel

From: William Lee Irwin III <wli@holomorphy.com>
Date: Mon, 26 Mar 2007 18:06:24 -0700

> On Mon, Mar 26, 2007 at 10:26:51AM -0800, Andrew Morton wrote:
> > b) we understand why the below simple modification crashes i386.
> 
> Full eager zeroing patches not dependent on quicklist code don't crash,
> so there is no latent use-after-free issue covered up by caching. I'll
> help out more on the i386 front as-needed.

I've looked into this a few times and I am quite mystified as
to why that simple test patch crashes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [QUICKLIST 1/5] Quicklists for page table pages V4
  2007-03-26 18:26               ` Andrew Morton
  2007-03-27  1:06                 ` William Lee Irwin III
@ 2007-03-27 11:19                 ` William Lee Irwin III
  1 sibling, 0 replies; 26+ messages in thread
From: William Lee Irwin III @ 2007-03-27 11:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-mm, linux-kernel

On Mon, Mar 26, 2007 at 10:26:51AM -0800, Andrew Morton wrote:
> b) we understand why the below simple modification crashes i386.

This doesn't crash i386 in qemu here on a port of the quicklist patches
to 2.6.21-rc5-mm2. I suppose I'll have to dump it on some real hardware
to see if I can reproduce it there.


-- wli

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [QUICKLIST 4/5] Quicklist support for x86_64
  2007-03-19 23:37 [QUICKLIST 1/5] Quicklists for page table pages V3 Christoph Lameter
@ 2007-03-19 23:37 ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2007-03-19 23:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Christoph Lameter, linux-kernel

Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is used to separate out PGD handling. Thus we can carry
the initialized pgds of terminating processes over to the next process
needing them.

Also clean up the pgd_list handling to use regular list macros. Not using
the slab allocator frees up the lru field so we can use regular list macros.

The adding and removal of the pgds to the pgdlist is moved into the
constructor / destructor. We can then avoid moving pgds off the list that
are still in the quicklists reducing the pds creation and allocation
overhead further.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/Kconfig	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/Kconfig	2007-03-15 22:00:04.000000000 -0700
@@ -56,6 +56,14 @@ config ZONE_DMA
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
+config NR_QUICK
+	int
+	default 2
+
 config ISA
 	bool
 
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgalloc.h	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgalloc.h	2007-03-15 21:59:31.000000000 -0700
@@ -4,6 +4,10 @@
 #include <asm/pda.h>
 #include <linux/threads.h>
 #include <linux/mm.h>
+#include <linux/quicklist.h>
+
+#define QUICK_PGD 0	/* We preserve special mappings over free */
+#define QUICK_PT 1	/* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
 	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-	free_page((unsigned long)pmd);
+	quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
 	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-	free_page((unsigned long)pud);
+	quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+	unsigned boundary;
+	pgd_t *pgd = x;
 	struct page *page = virt_to_page(pgd);
 
+	/*
+	 * Copy kernel pointers in from init.
+	 */
+	boundary = pgd_index(__PAGE_OFFSET);
+	memcpy(pgd + boundary,
+		init_level4_pgt + boundary,
+		(PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
 	spin_lock(&pgd_lock);
-	page->index = (pgoff_t)pgd_list;
-	if (pgd_list)
-		pgd_list->private = (unsigned long)&page->index;
-	pgd_list = page;
-	page->private = (unsigned long)&pgd_list;
+	list_add(&page->lru, &pgd_list);
 	spin_unlock(&pgd_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-	struct page *next, **pprev, *page = virt_to_page(pgd);
+	pgd_t *pgd = x;
+	struct page *page = virt_to_page(pgd);
 
 	spin_lock(&pgd_lock);
-	next = (struct page *)page->index;
-	pprev = (struct page **)page->private;
-	*pprev = next;
-	if (next)
-		next->private = (unsigned long)pprev;
+	list_del(&page->lru);
 	spin_unlock(&pgd_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	unsigned boundary;
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (!pgd)
-		return NULL;
-	pgd_list_add(pgd);
-	/*
-	 * Copy kernel pointers in from init.
-	 * Could keep a freelist or slab cache of those because the kernel
-	 * part never changes.
-	 */
-	boundary = pgd_index(__PAGE_OFFSET);
-	memset(pgd, 0, boundary * sizeof(pgd_t));
-	memcpy(pgd + boundary,
-	       init_level4_pgt + boundary,
-	       (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+	pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+			 GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
 	return pgd;
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
 	BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
-	pgd_list_del(pgd);
-	free_page((unsigned long)pgd);
+	quicklist_free(QUICK_PGD, pgd_dtor, pgd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+	void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
 	if (!p)
 		return NULL;
 	return virt_to_page(p);
@@ -111,17 +106,22 @@ static inline struct page *pte_alloc_one
 static inline void pte_free_kernel(pte_t *pte)
 {
 	BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-	free_page((unsigned long)pte); 
+	quicklist_free(QUICK_PT, NULL, pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
 	__free_page(pte);
-} 
+}
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 
+static inline void check_pgt_cache(void)
+{
+	quicklist_check(QUICK_PGD, pgd_dtor);
+	quicklist_check(QUICK_PT, NULL);
+}
 #endif /* _X86_64_PGALLOC_H */
Index: linux-2.6.21-rc3-mm2/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/kernel/process.c	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/kernel/process.c	2007-03-15 21:59:31.000000000 -0700
@@ -207,6 +207,7 @@ void cpu_idle (void)
 			if (__get_cpu_var(cpu_idle_state))
 				__get_cpu_var(cpu_idle_state) = 0;
 
+			check_pgt_cache();
 			rmb();
 			idle = pm_idle;
 			if (!idle)
Index: linux-2.6.21-rc3-mm2/arch/x86_64/kernel/smp.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/kernel/smp.c	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/kernel/smp.c	2007-03-15 21:59:31.000000000 -0700
@@ -242,7 +242,7 @@ void flush_tlb_mm (struct mm_struct * mm
 	}
 	if (!cpus_empty(cpu_mask))
 		flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
-
+	check_pgt_cache();
 	preempt_enable();
 }
 EXPORT_SYMBOL(flush_tlb_mm);
Index: linux-2.6.21-rc3-mm2/arch/x86_64/mm/fault.c
===================================================================
--- linux-2.6.21-rc3-mm2.orig/arch/x86_64/mm/fault.c	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/arch/x86_64/mm/fault.c	2007-03-15 21:59:31.000000000 -0700
@@ -585,7 +585,7 @@ do_sigbus:
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
+LIST_HEAD(pgd_list);
 
 void vmalloc_sync_all(void)
 {
@@ -605,8 +605,7 @@ void vmalloc_sync_all(void)
 			if (pgd_none(*pgd_ref))
 				continue;
 			spin_lock(&pgd_lock);
-			for (page = pgd_list; page;
-			     page = (struct page *)page->index) {
+			list_for_each_entry(page, &pgd_list, lru) {
 				pgd_t *pgd;
 				pgd = (pgd_t *)page_address(page) + pgd_index(address);
 				if (pgd_none(*pgd))
Index: linux-2.6.21-rc3-mm2/include/asm-x86_64/pgtable.h
===================================================================
--- linux-2.6.21-rc3-mm2.orig/include/asm-x86_64/pgtable.h	2007-03-13 00:09:50.000000000 -0700
+++ linux-2.6.21-rc3-mm2/include/asm-x86_64/pgtable.h	2007-03-15 21:59:31.000000000 -0700
@@ -402,7 +402,7 @@ static inline pte_t pte_modify(pte_t pte
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })
 
 extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
+extern struct list_head pgd_list;
 void vmalloc_sync_all(void);
 
 #endif /* !__ASSEMBLY__ */
@@ -419,7 +419,6 @@ extern int kern_addr_valid(unsigned long
 #define HAVE_ARCH_UNMAPPED_AREA
 
 #define pgtable_cache_init()   do { } while (0)
-#define check_pgt_cache()      do { } while (0)
 
 #define PAGE_AGP    PAGE_KERNEL_NOCACHE
 #define HAVE_PAGE_AGP 1

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-03-27 11:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-23  6:28 [QUICKLIST 1/5] Quicklists for page table pages V4 Christoph Lameter
2007-03-23  6:28 ` [QUICKLIST 2/5] Quicklist support for IA64 Christoph Lameter
2007-03-23  6:28 ` [QUICKLIST 3/5] Quicklist support for i386 Christoph Lameter
2007-03-23  6:28 ` [QUICKLIST 4/5] Quicklist support for x86_64 Christoph Lameter
2007-03-23  6:29 ` [QUICKLIST 5/5] Quicklist support for sparc64 Christoph Lameter
2007-03-23  6:39 ` [QUICKLIST 1/5] Quicklists for page table pages V4 Andrew Morton
2007-03-23  6:52   ` Christoph Lameter
2007-03-23  7:48     ` Andrew Morton
2007-03-23 11:23       ` William Lee Irwin III
2007-03-23 14:58         ` Christoph Lameter
2007-03-23 11:29       ` William Lee Irwin III
2007-03-23 14:57         ` William Lee Irwin III
2007-03-23 19:17           ` William Lee Irwin III
2007-03-23 11:39       ` Nick Piggin
2007-03-24  5:14         ` Andrew Morton
2007-03-23 15:08       ` Christoph Lameter
2007-03-23 17:54         ` Christoph Lameter
2007-03-24  6:21           ` Andrew Morton
2007-03-26 16:52             ` Christoph Lameter
2007-03-26 18:14               ` Christoph Lameter
2007-03-26 18:26               ` Andrew Morton
2007-03-27  1:06                 ` William Lee Irwin III
2007-03-27  1:22                   ` Christoph Lameter
2007-03-27  1:45                   ` David Miller
2007-03-27 11:19                 ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2007-03-19 23:37 [QUICKLIST 1/5] Quicklists for page table pages V3 Christoph Lameter
2007-03-19 23:37 ` [QUICKLIST 4/5] Quicklist support for x86_64 Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).