LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
@ 2007-04-28 4:43 Rik van Riel
2007-05-04 10:53 ` Nick Piggin
0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2007-04-28 4:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 917 bytes --]
With lazy freeing of anonymous pages through MADV_FREE, performance of
the MySQL sysbench workload more than doubles on my quad-core system.
Madvise with MADV_FREE is used by applications to tell the kernel that
memory no longer contains useful data and can be reclaimed by the
kernel if it is needed elsewhere. However, if the application puts
new data in the page (dirty bit gets set by hardware), the kernel
will not throw away the data.
This makes applications that free() and then later on malloc() the
same data again run a lot faster, since page faults are avoided.
In low memory situations, the kernel still knows which pages to
reclaim.
"Doing it all in userspace" is not a good solution for this problem,
because if the system needs the memory it is way cheaper to just throw
away these freed pages than to do the disk IO of swapping them out and
back in.
Signed-off-by: Rik van Riel <riel@redhat.com>
[-- Attachment #2: linux-2.6.21-madv_free.patch --]
[-- Type: text/x-patch, Size: 13537 bytes --]
--- linux-2.6.21.noarch/mm/rmap.c.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/mm/rmap.c 2007-04-27 16:03:22.000000000 -0400
@@ -656,7 +656,17 @@ static int try_to_unmap_one(struct page
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageAnon(page)) {
+ /* MADV_FREE is used to lazily free memory from userspace. */
+ if (PageLazyFree(page) && !migration) {
+ if (unlikely(pte_dirty(pteval))) {
+ /* There is new data in the page. Reinstate it. */
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ /* Throw the page away. */
+ dec_mm_counter(mm, anon_rss);
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
if (PageSwapCache(page)) {
--- linux-2.6.21.noarch/mm/page_alloc.c.madv_free 2007-04-27 16:03:22.000000000 -0400
+++ linux-2.6.21.noarch/mm/page_alloc.c 2007-04-27 16:03:22.000000000 -0400
@@ -203,6 +203,7 @@ static void bad_page(struct page *page)
1 << PG_slab |
1 << PG_swapcache |
1 << PG_writeback |
+ 1 << PG_lazyfree |
1 << PG_buddy );
set_page_count(page, 0);
reset_page_mapcount(page);
@@ -442,6 +443,8 @@ static inline int free_pages_check(struc
bad_page(page);
if (PageDirty(page))
__ClearPageDirty(page);
+ if (PageLazyFree(page))
+ __ClearPageLazyFree(page);
/*
* For now, we report if PG_reserved was found set, but do not
* clear it, and do not free the page. But we shall soon need
@@ -588,6 +591,7 @@ static int prep_new_page(struct page *pa
1 << PG_swapcache |
1 << PG_writeback |
1 << PG_reserved |
+ 1 << PG_lazyfree |
1 << PG_buddy ))))
bad_page(page);
--- linux-2.6.21.noarch/mm/memory.c.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/mm/memory.c 2007-04-27 21:12:57.000000000 -0400
@@ -432,6 +432,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ int dirty = 0;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -466,6 +467,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
+ dirty = pte_dirty(pte);
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
@@ -483,6 +485,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
get_page(page);
page_dup_rmap(page);
rss[!!PageAnon(page)]++;
+ if (dirty && PageLazyFree(page))
+ ClearPageLazyFree(page);
}
out_set_pte:
@@ -661,6 +665,28 @@ static unsigned long zap_pte_range(struc
(page->index < details->first_index ||
page->index > details->last_index))
continue;
+
+ /*
+ * MADV_FREE is used to lazily recycle
+ * anon memory. The process no longer
+ * needs the data and wants to avoid IO.
+ */
+ if (details->madv_free && PageAnon(page)) {
+ if (unlikely(PageSwapCache(page)) &&
+ !TestSetPageLocked(page)) {
+ remove_exclusive_swap_page(page);
+ unlock_page(page);
+ }
+ ptep_test_and_clear_dirty(vma, addr, pte);
+ ptep_test_and_clear_young(vma, addr, pte);
+ SetPageLazyFree(page);
+ if (PageActive(page))
+ deactivate_tail_page(page);
+ /* tlb_remove_page frees it again */
+ get_page(page);
+ tlb_remove_page(tlb, page);
+ continue;
+ }
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
@@ -689,7 +715,8 @@ static unsigned long zap_pte_range(struc
* If details->check_mapping, we leave swap entries;
* if details->nonlinear_vma, we leave file entries.
*/
- if (unlikely(details))
+ if (unlikely(details && (details->check_mapping ||
+ details->nonlinear_vma)))
continue;
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));
@@ -755,7 +782,8 @@ static unsigned long unmap_page_range(st
pgd_t *pgd;
unsigned long next;
- if (details && !details->check_mapping && !details->nonlinear_vma)
+ if (details && !details->check_mapping && !details->nonlinear_vma
+ && !details->madv_free)
details = NULL;
BUG_ON(addr >= end);
--- linux-2.6.21.noarch/mm/vmscan.c.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/mm/vmscan.c 2007-04-27 16:03:22.000000000 -0400
@@ -473,6 +473,24 @@ static unsigned long shrink_page_list(st
sc->nr_scanned++;
+ /*
+ * MADV_DONTNEED pages get reclaimed lazily, unless the
+ * process reuses them before we get to them.
+ */
+ if (PageLazyFree(page)) {
+ switch (try_to_unmap(page, 0)) {
+ case SWAP_FAIL:
+ ClearPageLazyFree(page);
+ goto activate_locked;
+ case SWAP_AGAIN:
+ ClearPageLazyFree(page);
+ goto keep_locked;
+ case SWAP_SUCCESS:
+ ClearPageLazyFree(page);
+ goto free_it;
+ }
+ }
+
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
--- linux-2.6.21.noarch/mm/madvise.c.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/mm/madvise.c 2007-04-27 21:20:11.000000000 -0400
@@ -130,7 +130,8 @@ static long madvise_willneed(struct vm_a
*/
static long madvise_dontneed(struct vm_area_struct * vma,
struct vm_area_struct ** prev,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ int behavior)
{
*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
@@ -142,8 +143,14 @@ static long madvise_dontneed(struct vm_a
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, &details);
- } else
+ } else if (behavior == MADV_FREE) {
+ struct zap_details details = {
+ .madv_free = 1,
+ };
+ zap_page_range(vma, start, end - start, &details);
+ } else /* behavior == MADV_DONTNEED */
zap_page_range(vma, start, end - start, NULL);
+
return 0;
}
@@ -215,5 +222,6 @@ madvise_vma(struct vm_area_struct *vma,
break;
case MADV_DONTNEED:
- error = madvise_dontneed(vma, prev, start, end);
+ case MADV_FREE:
+ error = madvise_dontneed(vma, prev, start, end, behavior);
break;
--- linux-2.6.21.noarch/mm/swap.c.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/mm/swap.c 2007-04-27 16:03:22.000000000 -0400
@@ -151,6 +151,20 @@ void fastcall activate_page(struct page
spin_unlock_irq(&zone->lru_lock);
}
+void fastcall deactivate_tail_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page) && PageActive(page)) {
+ del_page_from_active_list(zone, page);
+ ClearPageActive(page);
+ add_page_to_inactive_list_tail(zone, page);
+ __count_vm_event(PGDEACTIVATE);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+}
+
/*
* Mark a page as having seen activity.
*
--- linux-2.6.21.noarch/include/linux/page-flags.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/linux/page-flags.h 2007-04-27 16:03:22.000000000 -0400
@@ -91,6 +91,8 @@
#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define PG_lazyfree 20 /* MADV_FREE potential throwaway */
+
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
@@ -237,6 +239,11 @@ static inline void SetPageUptodate(struc
#define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
#define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
+#define PageLazyFree(page) test_bit(PG_lazyfree, &(page)->flags)
+#define SetPageLazyFree(page) set_bit(PG_lazyfree, &(page)->flags)
+#define ClearPageLazyFree(page) clear_bit(PG_lazyfree, &(page)->flags)
+#define __ClearPageLazyFree(page) __clear_bit(PG_lazyfree, &(page)->flags)
+
#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
#define __SetPageCompound(page) __set_bit(PG_compound, &(page)->flags)
#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
--- linux-2.6.21.noarch/include/linux/mm.h.madv_free 2007-04-27 16:03:22.000000000 -0400
+++ linux-2.6.21.noarch/include/linux/mm.h 2007-04-27 16:03:22.000000000 -0400
@@ -716,6 +716,7 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */
unsigned long truncate_count; /* Compare vm_truncate_count */
+ short madv_free; /* MADV_FREE anonymous memory */
};
struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
--- linux-2.6.21.noarch/include/linux/swap.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/linux/swap.h 2007-04-27 16:03:22.000000000 -0400
@@ -181,6 +181,7 @@ extern unsigned int nr_free_pagecache_pa
extern void FASTCALL(lru_cache_add(struct page *));
extern void FASTCALL(lru_cache_add_active(struct page *));
extern void FASTCALL(activate_page(struct page *));
+extern void FASTCALL(deactivate_tail_page(struct page *));
extern void FASTCALL(mark_page_accessed(struct page *));
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
--- linux-2.6.21.noarch/include/linux/mm_inline.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/linux/mm_inline.h 2007-04-27 16:03:22.000000000 -0400
@@ -13,6 +13,13 @@ add_page_to_inactive_list(struct zone *z
}
static inline void
+add_page_to_inactive_list_tail(struct zone *zone, struct page *page)
+{
+ list_add_tail(&page->lru, &zone->inactive_list);
+ __inc_zone_state(zone, NR_INACTIVE);
+}
+
+static inline void
del_page_from_active_list(struct zone *zone, struct page *page)
{
list_del(&page->lru);
--- linux-2.6.21.noarch/include/asm-sparc/mman.h.madv_free 2007-04-27 21:13:53.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-sparc/mman.h 2007-04-27 21:14:13.000000000 -0400
@@ -33,8 +33,6 @@
#define MC_LOCKAS 5 /* Lock an entire address space of the calling process */
#define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */
-#define MADV_FREE 0x5 /* (Solaris) contents can be freed */
-
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
#define arch_mmap_check sparc_mmap_check
--- linux-2.6.21.noarch/include/asm-parisc/mman.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-parisc/mman.h 2007-04-27 16:03:22.000000000 -0400
@@ -38,6 +38,7 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21.noarch/include/asm-xtensa/mman.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-xtensa/mman.h 2007-04-27 16:03:22.000000000 -0400
@@ -72,6 +72,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21.noarch/include/asm-generic/mman.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-generic/mman.h 2007-04-27 16:03:22.000000000 -0400
@@ -29,6 +29,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21.noarch/include/asm-mips/mman.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-mips/mman.h 2007-04-27 16:03:22.000000000 -0400
@@ -65,6 +65,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--- linux-2.6.21.noarch/include/asm-sparc64/mman.h.madv_free 2007-04-27 21:14:00.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-sparc64/mman.h 2007-04-27 21:14:16.000000000 -0400
@@ -33,8 +33,6 @@
#define MC_LOCKAS 5 /* Lock an entire address space of the calling process */
#define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */
-#define MADV_FREE 0x5 /* (Solaris) contents can be freed */
-
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
#define arch_mmap_check sparc64_mmap_check
--- linux-2.6.21.noarch/include/asm-alpha/mman.h.madv_free 2007-04-25 23:08:32.000000000 -0400
+++ linux-2.6.21.noarch/include/asm-alpha/mman.h 2007-04-27 16:03:22.000000000 -0400
@@ -42,6 +42,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-04-28 4:43 [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory Rik van Riel
@ 2007-05-04 10:53 ` Nick Piggin
2007-05-04 11:58 ` Rik van Riel
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-04 10:53 UTC (permalink / raw)
To: Rik van Riel
Cc: Linus Torvalds, linux-kernel, linux-mm, Andrew Morton,
Ulrich Drepper, Jakub Jelinek
Rik van Riel wrote:
> With lazy freeing of anonymous pages through MADV_FREE, performance of
> the MySQL sysbench workload more than doubles on my quad-core system.
OK, I've run some tests on a 16 core Opteron system, both sysbench with
MySQL 5.33 (set up as described in the freebsd vs linux page), and with
ebizzy.
What I found is that, on this system, MADV_FREE performance improvement
was in the noise when you look at it on top of the MADV_DONTNEED glibc
and down_read(mmap_sem) patch in sysbench.
In ebizzy it was slightly up at low loads and slightly down at high loads,
though I wouldn't put as much stock in ebizzy as the real workload,
because the numbers are going to be highly dependand on access patterns.
Now these numbers are collected under best-case conditions for MADV_FREE,
ie. no page reclaim going on. If you consider page reclaim, then you would
think there might be room for regressions.
So far, I'm not convinced this is a good use of a page flag or the added
complexity. There are lots of ways we can improve performance using a page
flag (my recent PG_waiters, PG_mlock, PG_replicated, etc.) to improve
performance, so I think we need some more numbers.
(I'll be away for the weekend...)
LHS is # threads, numbers are +/- 99.9% confidence.
sysbench transactions per sec (higher is better)
2.6.21
1, 453.092000 +/- 7.089284
2, 831.722000 +/- 13.138541
4, 1468.590000 +/- 40.160654
8, 2139.822000 +/- 62.223220
16, 2118.802000 +/- 83.247076
32, 1051.596000 +/- 62.455236
64, 917.078000 +/- 21.086954
new glibc
1, 466.376000 +/- 9.018054
2, 867.020000 +/- 26.163901
4, 1535.880000 +/- 25.784081
8, 2261.856000 +/- 53.350146
16, 2249.020000 +/- 120.361138
32, 1521.858000 +/- 110.236781
64, 1405.262000 +/- 85.260624
mmap_sem
1, 476.144000 +/- 15.865284
2, 871.778000 +/- 12.736486
4, 1529.348000 +/- 21.400517
8, 2235.590000 +/- 54.192125
16, 2177.422000 +/- 27.416498
32, 2120.986000 +/- 58.499708
64, 1949.362000 +/- 51.177977
madv_free
1, 475.056000 +/- 6.943168
2, 861.438000 +/- 22.101826
4, 1564.782000 +/- 55.190110
8, 2211.792000 +/- 59.843995
16, 2163.232000 +/- 46.031627
32, 2100.544000 +/- 86.744497
64, 1947.058000 +/- 62.392049
ebizzy elapsed time (lower is better)
mmap_sem
1, 45.544000 +/- 3.538529
4, 78.492000 +/- 8.881464
16, 224.538000 +/- 7.762784
64, 913.466000 +/- 53.506338
madv_free
1, 43.350000 +/- 0.778292
4, 68.190000 +/- 8.623731
16, 225.568000 +/- 14.940109
64, 899.136000 +/- 56.153209
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 10:53 ` Nick Piggin
@ 2007-05-04 11:58 ` Rik van Riel
2007-05-04 23:49 ` Nick Piggin
2007-05-04 16:04 ` Ulrich Drepper
2007-05-29 16:59 ` Rik van Riel
2 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2007-05-04 11:58 UTC (permalink / raw)
To: Nick Piggin
Cc: Linus Torvalds, linux-kernel, linux-mm, Andrew Morton,
Ulrich Drepper, Jakub Jelinek
Nick Piggin wrote:
> Rik van Riel wrote:
>> With lazy freeing of anonymous pages through MADV_FREE, performance of
>> the MySQL sysbench workload more than doubles on my quad-core system.
>
> OK, I've run some tests on a 16 core Opteron system, both sysbench with
> MySQL 5.33 (set up as described in the freebsd vs linux page), and with
> ebizzy.
>
> What I found is that, on this system, MADV_FREE performance improvement
> was in the noise when you look at it on top of the MADV_DONTNEED glibc
> and down_read(mmap_sem) patch in sysbench.
Interesting, very different results from my system.
First, did you run with the properly TLB batched version of
the MADV_FREE patch? And did you make sure that MADV_FREE
takes the mmap_sem for reading? Without that, I did see
a similar thing to what you saw...
Secondly, I'll have to try some test runs one of the larger
systems in the lab.
Maybe the results from my quad core Intel system are not
typical; maybe the results from your 16 core Opteron are
not typical. Either way, I want to find out :)
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 10:53 ` Nick Piggin
2007-05-04 11:58 ` Rik van Riel
@ 2007-05-04 16:04 ` Ulrich Drepper
2007-05-04 23:47 ` Nick Piggin
2007-05-09 16:38 ` [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory Hugh Dickins
2007-05-29 16:59 ` Rik van Riel
2 siblings, 2 replies; 22+ messages in thread
From: Ulrich Drepper @ 2007-05-04 16:04 UTC (permalink / raw)
To: Nick Piggin
Cc: Rik van Riel, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
[-- Attachment #1: Type: text/plain, Size: 835 bytes --]
Nick Piggin wrote:
> What I found is that, on this system, MADV_FREE performance improvement
> was in the noise when you look at it on top of the MADV_DONTNEED glibc
> and down_read(mmap_sem) patch in sysbench.
I don't want to judge the numbers since I cannot but I want to make an
observations: even if in the SMP case MADV_FREE turns out to not be a
bigger boost then there is still the UP case to keep in mind where Rik
measured a significant speed-up. As long as the SMP case isn't hurt
this is reaosn enough to use the patch. With more and more cores on one
processor SMP systems are pushed evermore to the high-end side. You'll
find many installations which today use SMP will be happy enough with
many-core UP machines.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 16:04 ` Ulrich Drepper
@ 2007-05-04 23:47 ` Nick Piggin
2007-05-05 0:10 ` Ulrich Drepper
` (2 more replies)
2007-05-09 16:38 ` [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory Hugh Dickins
1 sibling, 3 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-04 23:47 UTC (permalink / raw)
To: Ulrich Drepper
Cc: Rik van Riel, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Ulrich Drepper wrote:
> Nick Piggin wrote:
>
>>What I found is that, on this system, MADV_FREE performance improvement
>>was in the noise when you look at it on top of the MADV_DONTNEED glibc
>>and down_read(mmap_sem) patch in sysbench.
>
>
> I don't want to judge the numbers since I cannot but I want to make an
> observations: even if in the SMP case MADV_FREE turns out to not be a
> bigger boost then there is still the UP case to keep in mind where Rik
> measured a significant speed-up. As long as the SMP case isn't hurt
> this is reaosn enough to use the patch. With more and more cores on one
> processor SMP systems are pushed evermore to the high-end side. You'll
> find many installations which today use SMP will be happy enough with
> many-core UP machines.
OK, sure. I think we need more numbers though.
And even if this was a patch with _no_ possibility for regressions and it
was a completely trivial one that improves performance in some cases...
one big problem is that it uses another page flag.
I literally have about 4 or 5 new page flags I'd like to add today :) I
can't of course, because we have very few spare ones left.
From the MySQL numbers on this system, it seems like performance is in the
noise, and MADV_DONTNEED makes the _vast_ majority of the improvement.
This is also the case with Rik's benchmarks, and while he did see some
improvement, I found the runs to be quite variable, so it would be ideal
to get a larger sample.
And the fact that the poor behaviour of the old style malloc/free went
unnoticed for so long indicates that it won't be the end of the world if
we didn't merge MADV_FREE right now.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 11:58 ` Rik van Riel
@ 2007-05-04 23:49 ` Nick Piggin
0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-04 23:49 UTC (permalink / raw)
To: Rik van Riel
Cc: Linus Torvalds, linux-kernel, linux-mm, Andrew Morton,
Ulrich Drepper, Jakub Jelinek
Rik van Riel wrote:
> Nick Piggin wrote:
>
>> Rik van Riel wrote:
>>
>>> With lazy freeing of anonymous pages through MADV_FREE, performance of
>>> the MySQL sysbench workload more than doubles on my quad-core system.
>>
>>
>> OK, I've run some tests on a 16 core Opteron system, both sysbench with
>> MySQL 5.33 (set up as described in the freebsd vs linux page), and with
>> ebizzy.
>>
>> What I found is that, on this system, MADV_FREE performance improvement
>> was in the noise when you look at it on top of the MADV_DONTNEED glibc
>> and down_read(mmap_sem) patch in sysbench.
>
>
> Interesting, very different results from my system.
>
> First, did you run with the properly TLB batched version of
> the MADV_FREE patch? And did you make sure that MADV_FREE
> takes the mmap_sem for reading? Without that, I did see
> a similar thing to what you saw...
Yes and yes (I initially forgot to add MADV_FREE to the down_read
case and saw horrible performance!)
> Secondly, I'll have to try some test runs one of the larger
> systems in the lab.
>
> Maybe the results from my quad core Intel system are not
> typical; maybe the results from your 16 core Opteron are
> not typical. Either way, I want to find out :)
Yep. We might have something like that here, and I'll try with
some other architectures as well next week, if I can get glibc
built.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 23:47 ` Nick Piggin
@ 2007-05-05 0:10 ` Ulrich Drepper
2007-05-06 22:43 ` Rik van Riel
2007-05-08 3:51 ` [PATCH] stub MADV_FREE implementation Rik van Riel
2 siblings, 0 replies; 22+ messages in thread
From: Ulrich Drepper @ 2007-05-05 0:10 UTC (permalink / raw)
To: Nick Piggin
Cc: Rik van Riel, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
[-- Attachment #1: Type: text/plain, Size: 362 bytes --]
Nick Piggin wrote:
> I literally have about 4 or 5 new page flags I'd like to add today :) I
> can't of course, because we have very few spare ones left.
I remember Rik saying that if need be he can (try to?) think of a method
to implement it without a page flag.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 23:47 ` Nick Piggin
2007-05-05 0:10 ` Ulrich Drepper
@ 2007-05-06 22:43 ` Rik van Riel
2007-05-07 2:42 ` Ulrich Drepper
2007-05-08 6:12 ` Nick Piggin
2007-05-08 3:51 ` [PATCH] stub MADV_FREE implementation Rik van Riel
2 siblings, 2 replies; 22+ messages in thread
From: Rik van Riel @ 2007-05-06 22:43 UTC (permalink / raw)
To: Nick Piggin
Cc: Ulrich Drepper, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Nick Piggin wrote:
> OK, sure. I think we need more numbers though.
Thinking about the issue some more, I think I know just the
number we might want to know.
It is pretty obvious that the kernel needs to do less work
with the MADV_FREE code present. However, it is possible
that userspace needs to do more work, by accessing pages
that are not in the CPU cache, or in another CPU's cache.
In the test cases where you see similar performance on the
workload with and without the MADV_FREE code, are you by any
chance seeing lower system time and higher user time?
I think that maybe for 2.6.22 we should just alias MADV_FREE
to run with the MADV_DONTNEED functionality, so that the glibc
people can make the change on their side while we figure out
what will be the best thing to do on the kernel side.
I'll send in a patch that does that once Linus has committed
your most recent flood of patches. What do you think?
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-06 22:43 ` Rik van Riel
@ 2007-05-07 2:42 ` Ulrich Drepper
2007-05-07 4:56 ` Rik van Riel
2007-05-08 6:12 ` Nick Piggin
1 sibling, 1 reply; 22+ messages in thread
From: Ulrich Drepper @ 2007-05-07 2:42 UTC (permalink / raw)
To: Rik van Riel
Cc: Nick Piggin, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Rik van Riel wrote:
> I think that maybe for 2.6.22 we should just alias MADV_FREE
> to run with the MADV_DONTNEED functionality, so that the glibc
> people can make the change on their side while we figure out
> what will be the best thing to do on the kernel side.
No need for that. We can later extend glibc to use MADV_FREE and fall
back on MADV_DONTNEED.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-07 4:56 ` Rik van Riel
@ 2007-05-07 4:53 ` Ulrich Drepper
2007-05-07 16:51 ` Rik van Riel
0 siblings, 1 reply; 22+ messages in thread
From: Ulrich Drepper @ 2007-05-07 4:53 UTC (permalink / raw)
To: Rik van Riel
Cc: Nick Piggin, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Rik van Riel wrote:
> It's trivial to merge the MADV_FREE #defines into the kernel
> though, and aliasing MADV_FREE to MADV_DONTNEED for the time
> being is a one-liner - just an extra constant into the big
> switch statement in sys_madvise().
Until the semantics of the implementation is cut into stone by having it
in the kernel I'll not start using it.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-07 2:42 ` Ulrich Drepper
@ 2007-05-07 4:56 ` Rik van Riel
2007-05-07 4:53 ` Ulrich Drepper
0 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2007-05-07 4:56 UTC (permalink / raw)
To: Ulrich Drepper
Cc: Nick Piggin, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Ulrich Drepper wrote:
> Rik van Riel wrote:
>> I think that maybe for 2.6.22 we should just alias MADV_FREE
>> to run with the MADV_DONTNEED functionality, so that the glibc
>> people can make the change on their side while we figure out
>> what will be the best thing to do on the kernel side.
>
> No need for that. We can later extend glibc to use MADV_FREE and fall
> back on MADV_DONTNEED.
It's trivial to merge the MADV_FREE #defines into the kernel
though, and aliasing MADV_FREE to MADV_DONTNEED for the time
being is a one-liner - just an extra constant into the big
switch statement in sys_madvise().
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-07 4:53 ` Ulrich Drepper
@ 2007-05-07 16:51 ` Rik van Riel
0 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2007-05-07 16:51 UTC (permalink / raw)
To: Ulrich Drepper
Cc: Nick Piggin, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Ulrich Drepper wrote:
> Rik van Riel wrote:
>> It's trivial to merge the MADV_FREE #defines into the kernel
>> though, and aliasing MADV_FREE to MADV_DONTNEED for the time
>> being is a one-liner - just an extra constant into the big
>> switch statement in sys_madvise().
>
> Until the semantics of the implementation is cut into stone by having it
> in the kernel I'll not start using it.
The current MADV_DONTNEED implementation conforms to the
semantics of MADV_FREE :)
With MADV_FREE you can get back either your old data, or
a freshly zeroed out new page. Always getting back the
second alternative is conformant :)
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] stub MADV_FREE implementation
2007-05-04 23:47 ` Nick Piggin
2007-05-05 0:10 ` Ulrich Drepper
2007-05-06 22:43 ` Rik van Riel
@ 2007-05-08 3:51 ` Rik van Riel
2007-05-08 23:05 ` Andrew Morton
2 siblings, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2007-05-08 3:51 UTC (permalink / raw)
To: Nick Piggin
Cc: Ulrich Drepper, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek, Dave Jones
[-- Attachment #1: Type: text/plain, Size: 2344 bytes --]
Until we have better performance numbers on the lazy reclaim path,
we can just alias MADV_FREE to MADV_DONTNEED with this trivial
patch.
This way glibc can go ahead with the optimization on their side
and we can figure out the kernel side later.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
When I get back from the Red Hat Summit (Saturday), I will run more
performance numbers with and without the lazy reclaiming of pages.
Nick Piggin wrote:
> Ulrich Drepper wrote:
>> Nick Piggin wrote:
>>
>>> What I found is that, on this system, MADV_FREE performance improvement
>>> was in the noise when you look at it on top of the MADV_DONTNEED glibc
>>> and down_read(mmap_sem) patch in sysbench.
>>
>>
>> I don't want to judge the numbers since I cannot but I want to make an
>> observations: even if in the SMP case MADV_FREE turns out to not be a
>> bigger boost then there is still the UP case to keep in mind where Rik
>> measured a significant speed-up. As long as the SMP case isn't hurt
>> this is reaosn enough to use the patch. With more and more cores on one
>> processor SMP systems are pushed evermore to the high-end side. You'll
>> find many installations which today use SMP will be happy enough with
>> many-core UP machines.
>
> OK, sure. I think we need more numbers though.
>
> And even if this was a patch with _no_ possibility for regressions and it
> was a completely trivial one that improves performance in some cases...
> one big problem is that it uses another page flag.
>
> I literally have about 4 or 5 new page flags I'd like to add today :) I
> can't of course, because we have very few spare ones left.
>
> From the MySQL numbers on this system, it seems like performance is in the
> noise, and MADV_DONTNEED makes the _vast_ majority of the improvement.
> This is also the case with Rik's benchmarks, and while he did see some
> improvement, I found the runs to be quite variable, so it would be ideal
> to get a larger sample.
>
> And the fact that the poor behaviour of the old style malloc/free went
> unnoticed for so long indicates that it won't be the end of the world if
> we didn't merge MADV_FREE right now.
>
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
[-- Attachment #2: stub-madv_free --]
[-- Type: text/plain, Size: 4878 bytes --]
include/asm-alpha/mman.h | 1 +
include/asm-generic/mman.h | 1 +
include/asm-mips/mman.h | 1 +
include/asm-parisc/mman.h | 1 +
include/asm-sparc/mman.h | 2 --
include/asm-sparc64/mman.h | 2 --
include/asm-xtensa/mman.h | 1 +
mm/madvise.c | 2 ++
8 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/asm-alpha/mman.h b/include/asm-alpha/mman.h
index 90d7c35..d47b5a3 100644
--- a/include/asm-alpha/mman.h
+++ b/include/asm-alpha/mman.h
@@ -42,6 +42,7 @@ #define MADV_SEQUENTIAL 2 /* expect seq
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 5e3dde2..34a9ff1 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -29,6 +29,7 @@ #define MADV_RANDOM 1 /* expect random
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/include/asm-mips/mman.h b/include/asm-mips/mman.h
index e4d6f1f..68067ff 100644
--- a/include/asm-mips/mman.h
+++ b/include/asm-mips/mman.h
@@ -65,6 +65,7 @@ #define MADV_RANDOM 1 /* expect random
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/include/asm-parisc/mman.h b/include/asm-parisc/mman.h
index defe752..347fbca 100644
--- a/include/asm-parisc/mman.h
+++ b/include/asm-parisc/mman.h
@@ -38,6 +38,7 @@ #define MADV_DONTNEED 4
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* don't need the pages or the data */
/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/include/asm-sparc/mman.h b/include/asm-sparc/mman.h
index b7dc40b..5ec7106 100644
--- a/include/asm-sparc/mman.h
+++ b/include/asm-sparc/mman.h
@@ -33,8 +33,6 @@ #define MC_UNLOCK 3 /* Unlock pag
#define MC_LOCKAS 5 /* Lock an entire address space of the calling process */
#define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */
-#define MADV_FREE 0x5 /* (Solaris) contents can be freed */
-
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
#define arch_mmap_check sparc_mmap_check
diff --git a/include/asm-sparc64/mman.h b/include/asm-sparc64/mman.h
index 8cc1860..03b05d5 100644
--- a/include/asm-sparc64/mman.h
+++ b/include/asm-sparc64/mman.h
@@ -33,8 +33,6 @@ #define MC_UNLOCK 3 /* Unlock pag
#define MC_LOCKAS 5 /* Lock an entire address space of the calling process */
#define MC_UNLOCKAS 6 /* Unlock entire address space of calling process */
-#define MADV_FREE 0x5 /* (Solaris) contents can be freed */
-
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
#define arch_mmap_check sparc64_mmap_check
diff --git a/include/asm-xtensa/mman.h b/include/asm-xtensa/mman.h
index 9b92620..1345703 100644
--- a/include/asm-xtensa/mman.h
+++ b/include/asm-xtensa/mman.h
@@ -72,6 +72,7 @@ #define MADV_RANDOM 1 /* expect random
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* don't need the pages or the data */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index e75096b..ad067f2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -22,6 +22,7 @@ static int madvise_need_mmap_write(int b
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -234,6 +235,7 @@ madvise_vma(struct vm_area_struct *vma,
break;
case MADV_DONTNEED:
+ case MADV_FREE:
error = madvise_dontneed(vma, prev, start, end);
break;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-06 22:43 ` Rik van Riel
2007-05-07 2:42 ` Ulrich Drepper
@ 2007-05-08 6:12 ` Nick Piggin
2007-05-08 14:59 ` Rik van Riel
2007-05-08 18:35 ` Jakub Jelinek
1 sibling, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-08 6:12 UTC (permalink / raw)
To: Rik van Riel
Cc: Ulrich Drepper, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Rik van Riel wrote:
> Nick Piggin wrote:
>
>> OK, sure. I think we need more numbers though.
>
>
> Thinking about the issue some more, I think I know just the
> number we might want to know.
>
> It is pretty obvious that the kernel needs to do less work
> with the MADV_FREE code present. However, it is possible
> that userspace needs to do more work, by accessing pages
> that are not in the CPU cache, or in another CPU's cache.
>
> In the test cases where you see similar performance on the
> workload with and without the MADV_FREE code, are you by any
> chance seeing lower system time and higher user time?
I didn't actually check system and user times for the mysql
benchmark, but that's exactly what I had in mind when I
mentioned the poor cache behaviour this patch could cause. I
definitely did see user times go up in benchmarks where I
measured.
We have percpu and cache affine page allocators, so when
userspace just frees a page, it is likely to be cache hot, so
we want to free it up so it can be reused by this CPU ASAP.
Likewise, when we newly allocate a page, we want it to be one
that is cache hot on this CPU.
> I think that maybe for 2.6.22 we should just alias MADV_FREE
> to run with the MADV_DONTNEED functionality, so that the glibc
> people can make the change on their side while we figure out
> what will be the best thing to do on the kernel side.
>
> I'll send in a patch that does that once Linus has committed
> your most recent flood of patches. What do you think?
I'll let you and Ulrich decide on that. Keep in mind that older
kernels (without the mmap_sem patch for MADV_DONTNEED) still
seem to get a pretty decent improvement from using MADV_DONTNEED,
so it is possible glibc will want to start using that anyway.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-08 6:12 ` Nick Piggin
@ 2007-05-08 14:59 ` Rik van Riel
2007-05-08 23:23 ` Nick Piggin
2007-05-08 18:35 ` Jakub Jelinek
1 sibling, 1 reply; 22+ messages in thread
From: Rik van Riel @ 2007-05-08 14:59 UTC (permalink / raw)
To: Nick Piggin
Cc: Ulrich Drepper, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Nick Piggin wrote:
> We have percpu and cache affine page allocators, so when
> userspace just frees a page, it is likely to be cache hot, so
> we want to free it up so it can be reused by this CPU ASAP.
> Likewise, when we newly allocate a page, we want it to be one
> that is cache hot on this CPU.
Actually, isn't the clear page function capable of doing
some magic, when it writes all zeroes into the page, that
causes the zeroes to just live in CPU cache without the old
data ever being loaded from RAM?
That would sure be faster than touching RAM. Not sure if
we use/trigger that kind of magic, though :)
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-08 6:12 ` Nick Piggin
2007-05-08 14:59 ` Rik van Riel
@ 2007-05-08 18:35 ` Jakub Jelinek
2007-05-08 23:43 ` Nick Piggin
1 sibling, 1 reply; 22+ messages in thread
From: Jakub Jelinek @ 2007-05-08 18:35 UTC (permalink / raw)
To: Nick Piggin
Cc: Rik van Riel, Ulrich Drepper, Linus Torvalds, linux-kernel,
linux-mm, Andrew Morton
On Tue, May 08, 2007 at 04:12:00PM +1000, Nick Piggin wrote:
> I didn't actually check system and user times for the mysql
> benchmark, but that's exactly what I had in mind when I
> mentioned the poor cache behaviour this patch could cause. I
> definitely did see user times go up in benchmarks where I
> measured.
>
> We have percpu and cache affine page allocators, so when
> userspace just frees a page, it is likely to be cache hot, so
> we want to free it up so it can be reused by this CPU ASAP.
> Likewise, when we newly allocate a page, we want it to be one
> that is cache hot on this CPU.
malloc has per-thread arenas, so when using MADV_FREE the pages
should be local to the thread as well (unless the thread has switched
to a different CPU also to the CPU) and in case of sysbench should
be cache hot as well (it is reused RSN). With MADV_DONTNEED you need to
clear the pages while that is not necessary with MADV_FREE.
Jakub
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] stub MADV_FREE implementation
2007-05-08 3:51 ` [PATCH] stub MADV_FREE implementation Rik van Riel
@ 2007-05-08 23:05 ` Andrew Morton
2007-05-09 17:15 ` Ulrich Drepper
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2007-05-08 23:05 UTC (permalink / raw)
To: Rik van Riel
Cc: Nick Piggin, Ulrich Drepper, Linus Torvalds, linux-kernel,
linux-mm, Jakub Jelinek, Dave Jones
On Mon, 07 May 2007 23:51:47 -0400
Rik van Riel <riel@redhat.com> wrote:
> Until we have better performance numbers on the lazy reclaim path,
> we can just alias MADV_FREE to MADV_DONTNEED with this trivial
> patch.
>
> This way glibc can go ahead with the optimization on their side
> and we can figure out the kernel side later.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
Could someone please explain what is going on here?
And has Ulrich indicated that glibc would indeed go out ahead of
the kernel in this fashion?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-08 14:59 ` Rik van Riel
@ 2007-05-08 23:23 ` Nick Piggin
0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-08 23:23 UTC (permalink / raw)
To: Rik van Riel
Cc: Ulrich Drepper, Linus Torvalds, linux-kernel, linux-mm,
Andrew Morton, Jakub Jelinek
Rik van Riel wrote:
> Nick Piggin wrote:
>
>> We have percpu and cache affine page allocators, so when
>> userspace just frees a page, it is likely to be cache hot, so
>> we want to free it up so it can be reused by this CPU ASAP.
>> Likewise, when we newly allocate a page, we want it to be one
>> that is cache hot on this CPU.
>
>
> Actually, isn't the clear page function capable of doing
> some magic, when it writes all zeroes into the page, that
> causes the zeroes to just live in CPU cache without the old
> data ever being loaded from RAM?
>
> That would sure be faster than touching RAM. Not sure if
> we use/trigger that kind of magic, though :)
>
powerpc has and uses an instruction to zero a full cacheline, yes.
Not sure about x86-64 CPUs... I don't think they can do it.
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-08 18:35 ` Jakub Jelinek
@ 2007-05-08 23:43 ` Nick Piggin
0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2007-05-08 23:43 UTC (permalink / raw)
To: Jakub Jelinek
Cc: Rik van Riel, Ulrich Drepper, Linus Torvalds, linux-kernel,
linux-mm, Andrew Morton
Jakub Jelinek wrote:
> On Tue, May 08, 2007 at 04:12:00PM +1000, Nick Piggin wrote:
>
>>I didn't actually check system and user times for the mysql
>>benchmark, but that's exactly what I had in mind when I
>>mentioned the poor cache behaviour this patch could cause. I
>>definitely did see user times go up in benchmarks where I
>>measured.
>>
>>We have percpu and cache affine page allocators, so when
>>userspace just frees a page, it is likely to be cache hot, so
>>we want to free it up so it can be reused by this CPU ASAP.
>>Likewise, when we newly allocate a page, we want it to be one
>>that is cache hot on this CPU.
>
>
> malloc has per-thread arenas, so when using MADV_FREE the pages
> should be local to the thread as well (unless the thread has switched
> to a different CPU also to the CPU) and in case of sysbench should
> be cache hot as well (it is reused RSN).
Right, but the kernel also wants to use cache hot pages for other
things, and it also frees back its own cache hot pages into the
allocator.
The fact that sysbench is a good candidate for this but does not
show any improvements is telling... if the workload does not reuse
the page RSN, or if it is reclaiming them, we could actually see
regressions.
> With MADV_DONTNEED you need to
> clear the pages while that is not necessary with MADV_FREE.
With MADV_FREE, you don't need to zero the memory, but the page
is uninitialised. So you need to initialise it *somehow* (ie. use
either a zeroing alloc, or initialise it with application specific
data). At that point, you have to touch the cachelines anyway, so
the extra zeroing is going to cost very little (and you can see
that single threaded performance isn't improved).
--
SUSE Labs, Novell Inc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 16:04 ` Ulrich Drepper
2007-05-04 23:47 ` Nick Piggin
@ 2007-05-09 16:38 ` Hugh Dickins
1 sibling, 0 replies; 22+ messages in thread
From: Hugh Dickins @ 2007-05-09 16:38 UTC (permalink / raw)
To: Ulrich Drepper
Cc: Nick Piggin, Rik van Riel, Linus Torvalds, linux-kernel,
linux-mm, Andrew Morton, Jakub Jelinek
On Fri, 4 May 2007, Ulrich Drepper wrote:
>
> I don't want to judge the numbers since I cannot but I want to make an
> observations: even if in the SMP case MADV_FREE turns out to not be a
> bigger boost then there is still the UP case to keep in mind where Rik
> measured a significant speed-up. As long as the SMP case isn't hurt
> this is reaosn enough to use the patch. With more and more cores on one
> processor SMP systems are pushed evermore to the high-end side. You'll
> find many installations which today use SMP will be happy enough with
> many-core UP machines.
Just remembered this mail from a few days ago, and how puzzled I'd been
by your last sentence or two: I seem to be reading it in the wrong way,
and don't understand why users of SMP kernels will be moving to UP?
UP in the sense of one processor but many cores? But that still needs
an SMP kernel to use all those cores. Or you're thinking of growing
virtualization? Would you please explain further?
Thanks,
Hugh
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] stub MADV_FREE implementation
2007-05-08 23:05 ` Andrew Morton
@ 2007-05-09 17:15 ` Ulrich Drepper
0 siblings, 0 replies; 22+ messages in thread
From: Ulrich Drepper @ 2007-05-09 17:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Nick Piggin, Ulrich Drepper, Linus Torvalds,
linux-kernel, linux-mm, Jakub Jelinek, Dave Jones
On 5/8/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> And has Ulrich indicated that glibc would indeed go out ahead of
> the kernel in this fashion?
Rik is concerned to get a glibc version which allows him to test the
improvements. That's really not a big problem. We laready have a
patch for this and can provide appropriate RPMs easily.
I don't want to set a precedence for adding glibc support for phantom
features. So, I would not add support to the official glibc anyway
until there is a fixed implementation which then also means a fixed
ABI. So, Andrew, applying the patch won't do any good.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory
2007-05-04 10:53 ` Nick Piggin
2007-05-04 11:58 ` Rik van Riel
2007-05-04 16:04 ` Ulrich Drepper
@ 2007-05-29 16:59 ` Rik van Riel
2 siblings, 0 replies; 22+ messages in thread
From: Rik van Riel @ 2007-05-29 16:59 UTC (permalink / raw)
To: Nick Piggin
Cc: Linus Torvalds, linux-kernel, linux-mm, Andrew Morton,
Ulrich Drepper, Jakub Jelinek
Nick Piggin wrote:
> Rik van Riel wrote:
>> With lazy freeing of anonymous pages through MADV_FREE, performance of
>> the MySQL sysbench workload more than doubles on my quad-core system.
>
> OK, I've run some tests on a 16 core Opteron system, both sysbench with
> MySQL 5.33 (set up as described in the freebsd vs linux page), and with
> ebizzy.
>
> What I found is that, on this system, MADV_FREE performance improvement
> was in the noise when you look at it on top of the MADV_DONTNEED glibc
> and down_read(mmap_sem) patch in sysbench.
It turns out that setting the pte accessed bit in hardware
can apparently take a few thousand CPU cycles - 3000 cycles
is the number I've heard for one CPU family.
This is a similar number of cycles as is needed to zero out
a page. Giving a cache hot page to userspace could cancel
out the rest of the cost of the page fault handling.
Lets stick with the simpler MADV_DONTNEED code for now and
save the page flag for something else...
--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2007-05-29 16:59 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-28 4:43 [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory Rik van Riel
2007-05-04 10:53 ` Nick Piggin
2007-05-04 11:58 ` Rik van Riel
2007-05-04 23:49 ` Nick Piggin
2007-05-04 16:04 ` Ulrich Drepper
2007-05-04 23:47 ` Nick Piggin
2007-05-05 0:10 ` Ulrich Drepper
2007-05-06 22:43 ` Rik van Riel
2007-05-07 2:42 ` Ulrich Drepper
2007-05-07 4:56 ` Rik van Riel
2007-05-07 4:53 ` Ulrich Drepper
2007-05-07 16:51 ` Rik van Riel
2007-05-08 6:12 ` Nick Piggin
2007-05-08 14:59 ` Rik van Riel
2007-05-08 23:23 ` Nick Piggin
2007-05-08 18:35 ` Jakub Jelinek
2007-05-08 23:43 ` Nick Piggin
2007-05-08 3:51 ` [PATCH] stub MADV_FREE implementation Rik van Riel
2007-05-08 23:05 ` Andrew Morton
2007-05-09 17:15 ` Ulrich Drepper
2007-05-09 16:38 ` [PATCH] MM: implement MADV_FREE lazy freeing of anonymous memory Hugh Dickins
2007-05-29 16:59 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).