LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [patch 0/6] [RFC] MMU Notifiers V3 @ 2008-01-30 2:29 Christoph Lameter 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter ` (5 more replies) 0 siblings, 6 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. The known immediate users are KVM (establishes a refcount to the page. External references called spte) GRU (simple TLB shootdown without refcount. Has its own pagetable/tlb) XPmem (uses its own reverse mappings and refcount. Remote ptes, Needs to sleep when sending messages) Issues: - Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU Early tests with the GRU were successful. - Pages may be freed before the external mapping are torn down through invalidate_range() if no refcount on the page is taken. There is the chance that page content may be visible after they have been reallocated (mainly an issue for the GRU that takes no refcount). - invalidate_range() callbacks are sometimes called under i_mmap_lock. These need to be dealt with or XPmem needs to be able to work around these. - filemap_xip.c does not follow conventions for Rmap callbacks. We could depends on XIP support not being active to avoid the issue. Things that we leave as is: - RCU quiescent periods are required on registering and unregistering notifiers to guarantee visibility to other processors. Currently only mmu_notifier_release() does the correct thing. It is up to the user to provide RCU quiescent periods for register/unregister functions if they are called outside of the ->release method. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 1/6] mmu_notifier: Core code 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 2008-01-30 15:37 ` Andrea Arcangeli 2008-01-30 18:02 ` Robin Holt 2008-01-30 2:29 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter ` (4 subsequent siblings) 5 siblings, 2 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_core --] [-- Type: text/plain, Size: 15337 bytes --] Core code for mmu notifiers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> --- include/linux/list.h | 14 ++ include/linux/mm_types.h | 6 + include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++ include/linux/page-flags.h | 10 ++ kernel/fork.c | 2 mm/Kconfig | 4 mm/Makefile | 1 mm/mmap.c | 2 mm/mmu_notifier.c | 101 ++++++++++++++++++++ 9 files changed, 350 insertions(+) Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/include/linux/mm_types.h 2008-01-29 16:56:36.000000000 -0800 @@ -153,6 +153,10 @@ struct vm_area_struct { #endif }; +struct mmu_notifier_head { + struct hlist_head head; +}; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; @@ -219,6 +223,8 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-29 16:56:36.000000000 -0800 @@ -0,0 +1,210 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +/* + * MMU motifier + * + * Notifier functions for hardware and software that establishes external + * references to pages of a Linux system. The notifier calls ensure that + * the external mappings are removed when the Linux VM removes memory ranges + * or individual pages from a process. + * + * These fall into two classes + * + * 1. mmu_notifier + * + * These are callbacks registered with an mm_struct. If mappings are + * removed from an address space then callbacks are performed. + * Spinlocks must be held in order to the walk reverse maps and the + * notifications are performed while the spinlock is held. + * + * + * 2. mmu_rmap_notifier + * + * Callbacks for subsystems that provide their own rmaps. These + * need to walk their own rmaps for a page. The invalidate_page + * callback is outside of locks so that we are not in a strictly + * atomic context (but we may be in a PF_MEMALLOC context if the + * notifier is called from reclaim code) and are able to sleep. + * Rmap notifiers need an extra page bit and are only available + * on 64 bit platforms. It is up to the subsystem to mark pags + * as PageExternalRmap as needed to trigger the callbacks. Pages + * must be marked dirty if dirty bits are set in the external + * pte. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/rcupdate.h> +#include <linux/mm_types.h> + +struct mmu_notifier_ops; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +struct mmu_notifier_ops { + /* + * Note: The mmu_notifier structure must be released with + * call_rcu() since other processors are only guaranteed to + * see the changes after a quiescent period. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * lock indicates that the function is called under spinlock. + */ + void (*invalidate_range)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int lock); +}; + +struct mmu_rmap_notifier_ops; + +struct mmu_rmap_notifier { + struct hlist_node hlist; + const struct mmu_rmap_notifier_ops *ops; +}; + +struct mmu_rmap_notifier_ops { + /* + * Called with the page lock held after ptes are modified or removed + * so that a subsystem with its own rmap's can remove remote ptes + * mapping a page. + */ + void (*invalidate_page)(struct mmu_rmap_notifier *mrn, + struct page *page); +}; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads + */ +extern void __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* Will acquire mmap_sem for write*/ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* + * Will acquire mmap_sem for write. + * + * A quiescent period needs to pass before the mmu_notifier structure + * can be released. mmu_notifier_release() will wait for a quiescent period + * after calling the ->release callback. So it is safe to call + * mmu_notifier_unregister from the ->release function. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); + + +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); + +extern struct hlist_head mmu_rmap_notifier_list; + +#define mmu_rmap_notifier(function, args...) \ + do { \ + struct mmu_rmap_notifier *__mrn; \ + struct hlist_node *__n; \ + \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mrn, __n, \ + &mmu_rmap_notifier_list, \ + hlist) \ + if (__mrn->ops->function) \ + __mrn->ops->function(__mrn, args); \ + rcu_read_unlock(); \ + } while (0); + +#else /* CONFIG_MMU_NOTIFIER */ + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +#define mmu_rmap_notifier(function, args...) \ + do { \ + if (0) { \ + struct mmu_rmap_notifier *__mrn; \ + \ + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \ + __mrn->ops->function(__mrn, args); \ + } \ + } while (0); + +static inline void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_release(struct mm_struct *mm) {} +static inline int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} + +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) + {} +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) + {} + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/include/linux/page-flags.h 2008-01-29 16:56:36.000000000 -0800 @@ -105,6 +105,7 @@ * 64 bit | FIELDS | ?????? FLAGS | * 63 32 0 */ +#define PG_external_rmap 30 /* Page has external rmap */ #define PG_uncached 31 /* Page has been mapped as uncached */ #endif @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT) +#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags) +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags) +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \ + &(page)->flags) +#else +#define PageExternalRmap(page) 0 +#endif + struct page; /* forward declaration */ extern void cancel_dirty_page(struct page *page, unsigned int account_size); Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/Kconfig 2008-01-29 16:56:36.000000000 -0800 @@ -193,3 +193,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/Makefile 2008-01-29 16:56:36.000000000 -0800 @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/mmu_notifier.c 2008-01-29 16:57:26.000000000 -0800 @@ -0,0 +1,101 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> + +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *t; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_safe_rcu(mn, n, t, + &mm->mmu_notifier.head, hlist) { + hlist_del_rcu(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } + rcu_read_unlock(); + synchronize_rcu(); + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +/* + * Note that all notifiers use RCU. The updates are only guaranteed to be + * visible to other processes after a RCU quiescent period! + */ +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); + +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + down_write(&mm->mmap_sem); + __mmu_notifier_register(mn, mm); + up_write(&mm->mmap_sem); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + down_write(&mm->mmap_sem); + hlist_del_rcu(&mn->hlist); + up_write(&mm->mmap_sem); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); + +static DEFINE_SPINLOCK(mmu_notifier_list_lock); +HLIST_HEAD(mmu_rmap_notifier_list); + +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_register); + +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_del_rcu(&mrn->hlist); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_unregister); + Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/kernel/fork.c 2008-01-29 16:56:36.000000000 -0800 @@ -52,6 +52,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } free_mm(mm); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-29 16:56:36.000000000 -0800 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + mmu_notifier_release(mm); /* * Walk the list again, actually closing and freeing it, Index: linux-2.6/include/linux/list.h =================================================================== --- linux-2.6.orig/include/linux/list.h 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/include/linux/list.h 2008-01-29 16:56:36.000000000 -0800 @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ pos = pos->next) +/** + * hlist_for_each_entry_safe_rcu - iterate over list of given type + * @tpos: the type * to use as a loop cursor. + * @pos: the &struct hlist_node to use as a loop cursor. + * @n: temporary pointer + * @head: the head for your list. + * @member: the name of the hlist_node within the struct. + */ +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \ + for (pos = (head)->first; \ + rcu_dereference(pos) && ({ n = pos->next; 1;}) && \ + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ + pos = n) + #else #warning "don't include kernel headers in userspace" #endif /* __KERNEL__ */ -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter @ 2008-01-30 15:37 ` Andrea Arcangeli 2008-01-30 15:53 ` Jack Steiner 2008-01-30 17:10 ` Peter Zijlstra 2008-01-30 18:02 ` Robin Holt 1 sibling, 2 replies; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-30 15:37 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote: > +void mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n, *t; > + > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > + rcu_read_lock(); > + hlist_for_each_entry_safe_rcu(mn, n, t, > + &mm->mmu_notifier.head, hlist) { > + hlist_del_rcu(&mn->hlist); This will race and kernel crash against mmu_notifier_register in SMP. You should resurrect the per-mmu_notifier_head lock in my last patch (except it can be converted from a rwlock_t to a regular spinlock_t) and drop the mmap_sem from mmu_notifier_register/unregister. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 15:37 ` Andrea Arcangeli @ 2008-01-30 15:53 ` Jack Steiner 2008-01-30 16:38 ` Andrea Arcangeli 2008-01-30 19:19 ` Christoph Lameter 2008-01-30 17:10 ` Peter Zijlstra 1 sibling, 2 replies; 48+ messages in thread From: Jack Steiner @ 2008-01-30 15:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, Jan 30, 2008 at 04:37:49PM +0100, Andrea Arcangeli wrote: > On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote: > > +void mmu_notifier_release(struct mm_struct *mm) > > +{ > > + struct mmu_notifier *mn; > > + struct hlist_node *n, *t; > > + > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + rcu_read_lock(); > > + hlist_for_each_entry_safe_rcu(mn, n, t, > > + &mm->mmu_notifier.head, hlist) { > > + hlist_del_rcu(&mn->hlist); > > This will race and kernel crash against mmu_notifier_register in > SMP. You should resurrect the per-mmu_notifier_head lock in my last > patch (except it can be converted from a rwlock_t to a regular > spinlock_t) and drop the mmap_sem from > mmu_notifier_register/unregister. Agree. That will also resolve the problem we discussed yesterday. I want to unregister my mmu_notifier when a GRU segment is unmapped. This would not necessarily be at task termination. However, the mmap_sem is already held for write by the core VM at the point I would call the unregister function. Currently, there is no __mmu_notifier_unregister() defined. Moving to a different lock solves the problem. -- jack ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 15:53 ` Jack Steiner @ 2008-01-30 16:38 ` Andrea Arcangeli 2008-01-30 19:19 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-30 16:38 UTC (permalink / raw) To: Jack Steiner Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, Jan 30, 2008 at 09:53:06AM -0600, Jack Steiner wrote: > That will also resolve the problem we discussed yesterday. > I want to unregister my mmu_notifier when a GRU segment is > unmapped. This would not necessarily be at task termination. My proof that there is something wrong in the smp locking of the current code is very simple: it can't be right to use hlist_for_each_entry_safe_rcu and rcu_read_lock inside mmu_notifier_release, and then to call hlist_del_rcu without any spinlock or semaphore. If we walk the list with hlist_for_each_entry_safe_rcu (and not with hlist_for_each_entry_safe), it means the list _can_ change from under us, and in turn the hlist_del_rcu must be surrounded by a spinlock or sempahore too! If by design the list _can't_ change from under us and calling hlist_del_rcu was safe w/o locks, then hlist_for_each_entry_safe is _sure_ enough for mmu_notifier_release, and rcu_read_lock most certainly can be removed too. To make an usage case where the race could trigger, I was thinking at somebody bumping the mm_count (not mm_users) and registering a notifier while mmu_notifier_release runs and relaying on ->release to know if it has to run mmu_notifier_unregister. However I now started wondering how it can relay on ->release to know that if ->release is called after hlist_del_rcu because with the latest changes ->release will also allow the mn to release itself ;). It's unsafe to call list_del_rcu twice (the second will crash on a poisoned entry). This starts to make me think we should remove the auto-disarming feature and require the notifier-user to have the ->release call mmu_notifier_unregister first and to free the "mn" inside ->release too if needed. Or alternatively the notifier-user can bump mm_count and to call a mmu_notifier_unregister before calling mmdrop (like kvm could do). Another approach is to simply define mmu_notifier_release as implicitly serialized by other code design, with a real lock (not rcu) against the whole register/unregister operations. So to guarantee the notifier list can't change from under us while mmu_notifier_release runs. If we go this route, yes, the auto-disarming hlist_del can be kept, the current code would have been safe, but to avoid confusion the mmu_notifier_release shall become this: void mmu_notifier_release(struct mm_struct *mm) { struct mmu_notifier *mn; struct hlist_node *n, *t; if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { hlist_for_each_entry_safe(mn, n, t, &mm->mmu_notifier.head, hlist) { hlist_del(&mn->hlist); if (mn->ops->release) mn->ops->release(mn, mm); } } } > However, the mmap_sem is already held for write by the core > VM at the point I would call the unregister function. > Currently, there is no __mmu_notifier_unregister() defined. > > Moving to a different lock solves the problem. Unless the mmu_notifier_release becomes like above and we rely on the user of the mmu notifiers to implement a highlevel external lock that will we definitely forbid to bump the mm_count of the mm, and to call register/unregister while mmu_notifier_release could run, 1) moving to a different lock and 2) removing the auto-disarming hlist_del_rcu from mmu_notifier_release sounds the only possible smp safe way. As far as KVM is concerned mmu_notifier_released could be changed to the version I written above and everything should be ok. For KVM the mm_count bump is done by the task that also holds a mm_user, so when exit_mmap runs I don't think the list could possible change anymore. Anyway those are details that can be perfected after mainline merging, so this isn't something to worry about too much right now. My idea is to keep working to perfect it while I hope progress is being made by Christoph to merge the mmu notifiers V3 patchset in mainline ;). ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 15:53 ` Jack Steiner 2008-01-30 16:38 ` Andrea Arcangeli @ 2008-01-30 19:19 ` Christoph Lameter 2008-01-30 22:20 ` Robin Holt 1 sibling, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 19:19 UTC (permalink / raw) To: Jack Steiner Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, 30 Jan 2008, Jack Steiner wrote: > Moving to a different lock solves the problem. Well it gets us back to the issue why we removed the lock. As Robin said before: If its global then we can have a huge number of tasks contending for the lock on startup of a process with a large number of ranks. The reason to go to mmap_sem was that it was placed in the mm_struct and so we would just have a couple of contentions per mm_struct. I'll be looking for some other way to do this. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 19:19 ` Christoph Lameter @ 2008-01-30 22:20 ` Robin Holt 2008-01-30 23:38 ` Andrea Arcangeli 0 siblings, 1 reply; 48+ messages in thread From: Robin Holt @ 2008-01-30 22:20 UTC (permalink / raw) To: Christoph Lameter Cc: Jack Steiner, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote: > On Wed, 30 Jan 2008, Jack Steiner wrote: > > > Moving to a different lock solves the problem. > > Well it gets us back to the issue why we removed the lock. As Robin said > before: If its global then we can have a huge number of tasks contending > for the lock on startup of a process with a large number of ranks. The > reason to go to mmap_sem was that it was placed in the mm_struct and so we > would just have a couple of contentions per mm_struct. > > I'll be looking for some other way to do this. I think Andrea's original concept of the lock in the mmu_notifier_head structure was the best. I agree with him that it should be a spinlock instead of the rw_lock. Thanks, Robin ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 22:20 ` Robin Holt @ 2008-01-30 23:38 ` Andrea Arcangeli 2008-01-30 23:55 ` Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-30 23:38 UTC (permalink / raw) To: Robin Holt Cc: Christoph Lameter, Jack Steiner, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, Jan 30, 2008 at 04:20:35PM -0600, Robin Holt wrote: > On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote: > > On Wed, 30 Jan 2008, Jack Steiner wrote: > > > > > Moving to a different lock solves the problem. > > > > Well it gets us back to the issue why we removed the lock. As Robin said > > before: If its global then we can have a huge number of tasks contending > > for the lock on startup of a process with a large number of ranks. The > > reason to go to mmap_sem was that it was placed in the mm_struct and so we > > would just have a couple of contentions per mm_struct. > > > > I'll be looking for some other way to do this. > > I think Andrea's original concept of the lock in the mmu_notifier_head > structure was the best. I agree with him that it should be a spinlock > instead of the rw_lock. BTW, I don't see the scalability concern with huge number of tasks: the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction; up_write(mm->mmap_sem) is always going to scale worse than spin_lock(mm->somethingelse); oneinstruction; spin_unlock(mm->somethinglese). Furthermore if we go this route and we don't relay on implicit serialization of all the mmu notifier users against exit_mmap (i.e. the mmu notifier user must agree to stop calling mmu_notifier_register on a mm after the last mmput) the autodisarming feature will likely have to be removed or it can't possibly be safe to run mmu_notifier_unregister while mmu_notifier_release runs. With the auto-disarming feature, there is no way to safely know if mmu_notifier_unregister has to be called or not. I'm ok with removing the auto-disarming feature and to have as self-contained-as-possible locking. Then mmu_notifier_release can just become the invalidate_all_after and invalidate_all, invalidate_all_before. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 23:38 ` Andrea Arcangeli @ 2008-01-30 23:55 ` Christoph Lameter 2008-01-31 0:12 ` [kvm-devel] " Andrea Arcangeli 0 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 23:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Jack Steiner, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > > I think Andrea's original concept of the lock in the mmu_notifier_head > > structure was the best. I agree with him that it should be a spinlock > > instead of the rw_lock. > > BTW, I don't see the scalability concern with huge number of tasks: > the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction; > up_write(mm->mmap_sem) is always going to scale worse than > spin_lock(mm->somethingelse); oneinstruction; > spin_unlock(mm->somethinglese). If we put it elsewhere in the mm then we increase the size of the memory used in the mm_struct. > Furthermore if we go this route and we don't relay on implicit > serialization of all the mmu notifier users against exit_mmap > (i.e. the mmu notifier user must agree to stop calling > mmu_notifier_register on a mm after the last mmput) the autodisarming > feature will likely have to be removed or it can't possibly be safe to > run mmu_notifier_unregister while mmu_notifier_release runs. With the > auto-disarming feature, there is no way to safely know if > mmu_notifier_unregister has to be called or not. I'm ok with removing > the auto-disarming feature and to have as self-contained-as-possible > locking. Then mmu_notifier_release can just become the > invalidate_all_after and invalidate_all, invalidate_all_before. Hmmmm.. exit_mmap is only called when the last reference is removed against the mm right? So no tasks are running anymore. No pages are left. Do we need to serialize at all for mmu_notifier_release? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code 2008-01-30 23:55 ` Christoph Lameter @ 2008-01-31 0:12 ` Andrea Arcangeli 2008-01-31 1:27 ` Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-31 0:12 UTC (permalink / raw) To: Christoph Lameter Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt, Jack Steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, Hugh Dickins On Wed, Jan 30, 2008 at 03:55:37PM -0800, Christoph Lameter wrote: > On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > > > > I think Andrea's original concept of the lock in the mmu_notifier_head > > > structure was the best. I agree with him that it should be a spinlock > > > instead of the rw_lock. > > > > BTW, I don't see the scalability concern with huge number of tasks: > > the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction; > > up_write(mm->mmap_sem) is always going to scale worse than > > spin_lock(mm->somethingelse); oneinstruction; > > spin_unlock(mm->somethinglese). > > If we put it elsewhere in the mm then we increase the size of the memory > used in the mm_struct. Yes, and it will increase of the same amount of RAM that you pretend everyone to pay even if MMU_NOTIFIER=n after your patch is applied (vs mine that generated 0 ram utilization increase when MMU_NOTIFIER=n). And the additional ram will provide not just self-contained locking but higher scalability too. I think it's much more important to generate zero ram and CPU overhead for the embedded (this is something I was very careful to enforce in all my patches), than to reduce scalability and not having a self contained locking on full configurations with MMU_NOTIFIER=y. > Hmmmm.. exit_mmap is only called when the last reference is removed > against the mm right? So no tasks are running anymore. No pages are left. > Do we need to serialize at all for mmu_notifier_release? KVM sure doesn't need any locking there. I thought somebody had to possibly take a pin on the "mm_count" and pretend to call mmu_notifier_register at will until mmdrop was finally called, in a out of order fashion given mmu_notifier_release was implemented like if the list could change from under it. Note mmdrop != mmput. mmput and in turn mm_users is the serialization point if you prefer to drop all locking from _release. Nobody must ever attempt a mmu_notifier_* after calling mmput for that mm. That should be enough to be safe. I'm fine either ways... ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [kvm-devel] [patch 1/6] mmu_notifier: Core code 2008-01-31 0:12 ` [kvm-devel] " Andrea Arcangeli @ 2008-01-31 1:27 ` Christoph Lameter 0 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-31 1:27 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt, Jack Steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman, Robin Holt, Hugh Dickins On Thu, 31 Jan 2008, Andrea Arcangeli wrote: > > Hmmmm.. exit_mmap is only called when the last reference is removed > > against the mm right? So no tasks are running anymore. No pages are left. > > Do we need to serialize at all for mmu_notifier_release? > > KVM sure doesn't need any locking there. I thought somebody had to > possibly take a pin on the "mm_count" and pretend to call > mmu_notifier_register at will until mmdrop was finally called, in a > out of order fashion given mmu_notifier_release was implemented like > if the list could change from under it. Note mmdrop != mmput. mmput > and in turn mm_users is the serialization point if you prefer to drop > all locking from _release. Nobody must ever attempt a mmu_notifier_* > after calling mmput for that mm. That should be enough to be > safe. I'm fine either ways... exit_mmap (where we call invalidate_all() and release()) is called when mm_users == 0: void mmput(struct mm_struct *mm) { might_sleep(); if (atomic_dec_and_test(&mm->mm_users)) { exit_aio(mm); exit_mmap(mm); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); list_del(&mm->mmlist); spin_unlock(&mmlist_lock); } put_swap_token(mm); mmdrop(mm); } } EXPORT_SYMBOL_GPL(mmput); So there is only a single thread executing at the time when invalidate_all() is called from exit_mmap(). Then we drop the pages, and the page tables. After the page tables we call the ->release method and then remove the vmas. So even dropping off the mmu_notifier chain in invalidate_all() could be done without an issue and without locking. Trouble is if other callbacks attempt the same. Do we need to support the removal from the mmu_notifier list in invalidate_range()? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 15:37 ` Andrea Arcangeli 2008-01-30 15:53 ` Jack Steiner @ 2008-01-30 17:10 ` Peter Zijlstra 2008-01-30 19:28 ` Christoph Lameter 1 sibling, 1 reply; 48+ messages in thread From: Peter Zijlstra @ 2008-01-30 17:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, 2008-01-30 at 16:37 +0100, Andrea Arcangeli wrote: > On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote: > > +void mmu_notifier_release(struct mm_struct *mm) > > +{ > > + struct mmu_notifier *mn; > > + struct hlist_node *n, *t; > > + > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + rcu_read_lock(); > > + hlist_for_each_entry_safe_rcu(mn, n, t, > > + &mm->mmu_notifier.head, hlist) { > > + hlist_del_rcu(&mn->hlist); > > This will race and kernel crash against mmu_notifier_register in > SMP. You should resurrect the per-mmu_notifier_head lock in my last > patch (except it can be converted from a rwlock_t to a regular > spinlock_t) and drop the mmap_sem from > mmu_notifier_register/unregister. Agreed, sorry for this oversight. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 17:10 ` Peter Zijlstra @ 2008-01-30 19:28 ` Christoph Lameter 0 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 19:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins How about just taking the mmap_sem writelock in release? We have only a single caller of mmu_notifier_release() in mm/mmap.c and we know that we are not holding mmap_sem at that point. So just acquire it when needed? Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 11:21:57.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-01-30 11:24:59.000000000 -0800 @@ -18,6 +19,7 @@ void mmu_notifier_release(struct mm_stru struct hlist_node *n, *t; if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + down_write(&mm->mmap_sem); rcu_read_lock(); hlist_for_each_entry_safe_rcu(mn, n, t, &mm->mmu_notifier.head, hlist) { @@ -26,6 +28,7 @@ void mmu_notifier_release(struct mm_stru mn->ops->release(mn, mm); } rcu_read_unlock(); + up_write(&mm->mmap_sem); synchronize_rcu(); } } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-30 15:37 ` Andrea Arcangeli @ 2008-01-30 18:02 ` Robin Holt 2008-01-30 19:08 ` Christoph Lameter 2008-01-30 19:14 ` Christoph Lameter 1 sibling, 2 replies; 48+ messages in thread From: Robin Holt @ 2008-01-30 18:02 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins Back to one of Andrea's points from a couple days ago, I think we still have a problem with the PageExternalRmap page flag. If I had two drivers with external rmap implementations, there is no way I can think of for a simple flag to coordinate a single page being exported and maintained by the two. Since the intended use seems to point in the direction of the external rmap must be maintained consistent with the all pages the driver has exported and the driver will already need to handle cases where the page does not appear in its rmap, I would propose the setting and clearing should be handled in the mmu_notifier code. This is the first of two patches. This one is intended as an addition to patch 1/6. I will post the other shortly under the patch 3/6 thread. Index: git-linus/include/linux/mmu_notifier.h =================================================================== --- git-linus.orig/include/linux/mmu_notifier.h 2008-01-30 11:43:45.000000000 -0600 +++ git-linus/include/linux/mmu_notifier.h 2008-01-30 11:44:35.000000000 -0600 @@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); +extern void mmu_rmap_export_page(struct page *page); extern struct hlist_head mmu_rmap_notifier_list; Index: git-linus/mm/mmu_notifier.c =================================================================== --- git-linus.orig/mm/mmu_notifier.c 2008-01-30 11:43:45.000000000 -0600 +++ git-linus/mm/mmu_notifier.c 2008-01-30 11:56:08.000000000 -0600 @@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct } EXPORT_SYMBOL(mmu_rmap_notifier_unregister); +void mmu_rmap_export_page(struct page *page) +{ + SetPageExternalRmap(page); +} +EXPORT_SYMBOL(mmu_rmap_export_page); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 18:02 ` Robin Holt @ 2008-01-30 19:08 ` Christoph Lameter 2008-01-30 19:14 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 19:08 UTC (permalink / raw) To: Robin Holt Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Wed, 30 Jan 2008, Robin Holt wrote: > Index: git-linus/mm/mmu_notifier.c > =================================================================== > --- git-linus.orig/mm/mmu_notifier.c 2008-01-30 11:43:45.000000000 -0600 > +++ git-linus/mm/mmu_notifier.c 2008-01-30 11:56:08.000000000 -0600 > @@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct > } > EXPORT_SYMBOL(mmu_rmap_notifier_unregister); > > +void mmu_rmap_export_page(struct page *page) > +{ > + SetPageExternalRmap(page); > +} > +EXPORT_SYMBOL(mmu_rmap_export_page); Then mmu_rmap_export_page would have to be called before the subsystem establishes the rmap entry for the page. Could we do all PageExternalRmap modifications under Pagelock? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-30 18:02 ` Robin Holt 2008-01-30 19:08 ` Christoph Lameter @ 2008-01-30 19:14 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 19:14 UTC (permalink / raw) To: Robin Holt Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins Ok. So I added the following patch: --- include/linux/mmu_notifier.h | 1 + mm/mmu_notifier.c | 12 ++++++++++++ 2 files changed, 13 insertions(+) Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:09:06.000000000 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:10:38.000000000 -0800 @@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); +extern void mmu_rmap_export_page(struct page *page); extern struct hlist_head mmu_rmap_notifier_list; Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 11:09:01.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-01-30 11:12:10.000000000 -0800 @@ -99,3 +99,15 @@ void mmu_rmap_notifier_unregister(struct } EXPORT_SYMBOL(mmu_rmap_notifier_unregister); +/* + * Export a page. + * + * Pagelock must be held. + * Must be called before a page is put on an external rmap. + */ +void mmu_rmap_export_page(struct page *page) +{ + BUG_ON(!PageLocked(page)); + SetPageExternalRmap(page); +} +EXPORT_SYMBOL(mmu_rmap_export_page); ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 2008-01-30 2:29 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter ` (3 subsequent siblings) 5 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_invalidate_range_callbacks --] [-- Type: text/plain, Size: 4641 bytes --] The invalidation of address ranges in a mm_struct needs to be performed when pages are removed or permissions etc change. Most of the VM address space changes can use the range invalidate callback. invalidate_range() is generally called with mmap_sem held but no spinlocks are active. If invalidate_range() is called with locks held then we pass a flag into invalidate_range() Comments state that mmap_sem must be held for remap_pfn_range() but various drivers do not seem to do this. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/fremap.c | 2 ++ mm/hugetlb.c | 2 ++ mm/memory.c | 11 +++++++++-- mm/mmap.c | 1 + 4 files changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/fremap.c 2008-01-29 16:59:24.000000000 -0800 @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -212,6 +213,7 @@ asmlinkage long sys_remap_file_pages(uns } err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier(invalidate_range, mm, start, start + size, 0); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/memory.c 2008-01-29 16:59:24.000000000 -0800 @@ -50,6 +50,7 @@ #include <linux/delayacct.h> #include <linux/init.h> #include <linux/writeback.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end, + (details ? (details->i_mmap_lock != NULL) : 0)); return end; } @@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end, 0); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1669,6 +1674,8 @@ gotten: page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + mmu_notifier(invalidate_range, mm, address, + address + PAGE_SIZE - 1, 0); if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-29 16:56:36.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-29 16:58:15.000000000 -0800 @@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end, 0); } /* Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-01-29 16:56:33.000000000 -0800 +++ linux-2.6/mm/hugetlb.c 2008-01-29 16:58:15.000000000 -0800 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end, 1); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-30 2:29 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 2008-01-30 18:03 ` Robin Holt 2008-01-30 2:29 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter ` (2 subsequent siblings) 5 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_invalidate_page_rmap_callbacks --] [-- Type: text/plain, Size: 1416 bytes --] Callbacks to remove individual pages if the subsystem has an rmap capability. The pagelock is held but no spinlocks are held. The refcount of the page is elevated so that dropping the refcount in the subsystem will not directly free the page. The callbacks occur after the Linux rmaps have been walked. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/rmap.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-01-25 14:24:19.000000000 -0800 +++ linux-2.6/mm/rmap.c 2008-01-25 14:24:38.000000000 -0800 @@ -49,6 +49,7 @@ #include <linux/rcupdate.h> #include <linux/module.h> #include <linux/kallsyms.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> @@ -473,6 +474,8 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); if (page_test_dirty(page)) { page_clear_dirty(page); ret = 1; @@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (unlikely(PageExternalRmap(page))) + mmu_rmap_notifier(invalidate_page, page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap 2008-01-30 2:29 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter @ 2008-01-30 18:03 ` Robin Holt 0 siblings, 0 replies; 48+ messages in thread From: Robin Holt @ 2008-01-30 18:03 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins This is the second part of a patch posted to patch 1/6. Index: git-linus/mm/rmap.c =================================================================== --- git-linus.orig/mm/rmap.c 2008-01-30 11:55:56.000000000 -0600 +++ git-linus/mm/rmap.c 2008-01-30 12:01:28.000000000 -0600 @@ -476,8 +476,10 @@ int page_mkclean(struct page *page) struct address_space *mapping = page_mapping(page); if (mapping) { ret = page_mkclean_file(mapping, page); - if (unlikely(PageExternalRmap(page))) + if (unlikely(PageExternalRmap(page))) { mmu_rmap_notifier(invalidate_page, page); + ClearPageExported(page); + } if (page_test_dirty(page)) { page_clear_dirty(page); ret = 1; @@ -980,8 +982,10 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); - if (unlikely(PageExternalRmap(page))) + if (unlikely(PageExternalRmap(page))) { mmu_rmap_notifier(invalidate_page, page); + ClearPageExported(page); + } if (!page_mapped(page)) ret = SWAP_SUCCESS; ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter ` (2 preceding siblings ...) 2008-01-30 2:29 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 2008-01-30 2:29 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter 2008-01-30 2:29 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter 5 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_invalidate_page_callbacks --] [-- Type: text/plain, Size: 2465 bytes --] These notifiers here use the Linux rmaps to perform the callbacks. In order to walk the rmaps locks must be held. Callbacks can therefore only operate in an atomic context. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/rmap.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-01-29 16:58:25.000000000 -0800 +++ linux-2.6/mm/rmap.c 2008-01-29 16:58:39.000000000 -0800 @@ -285,7 +285,8 @@ static int page_referenced_one(struct pa if (!pte) goto out; - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) referenced++; /* Pretend the page is referenced if the task has the @@ -435,6 +436,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -680,7 +682,8 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)))) { ret = SWAP_FAIL; goto out_unmap; } @@ -688,6 +691,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -812,12 +816,14 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte) | + mmu_notifier_age_page(mm, address)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter ` (3 preceding siblings ...) 2008-01-30 2:29 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 2008-01-30 2:29 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter 5 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_xip --] [-- Type: text/plain, Size: 1242 bytes --] Problem for external rmaps: There is no pagelock held on the page. Signed-off-by: Robin Holt <holt@sgi.com> --- mm/filemap_xip.c | 5 +++++ 1 file changed, 5 insertions(+) Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-01-25 19:39:04.000000000 -0800 +++ linux-2.6/mm/filemap_xip.c 2008-01-25 19:39:06.000000000 -0800 @@ -13,6 +13,7 @@ #include <linux/module.h> #include <linux/uio.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <linux/sched.h> #include <asm/tlbflush.h> @@ -183,6 +184,9 @@ __xip_unmap (struct address_space * mapp if (!page) return; + if (PageExternalRmap(page)) + mmu_rmap_notifier(invalidate_page, page); + spin_lock(&mapping->i_mmap_lock); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { mm = vma->vm_mm; @@ -194,6 +198,7 @@ __xip_unmap (struct address_space * mapp /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mmu_notifier(invalidate_page, mm, address); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 6/6] mmu_notifier: Add invalidate_all() 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter ` (4 preceding siblings ...) 2008-01-30 2:29 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter @ 2008-01-30 2:29 ` Christoph Lameter 5 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_all --] [-- Type: text/plain, Size: 1542 bytes --] when a task exits we can remove all external pts at once. At that point the extern mmu may also unregister itself from the mmu notifier chain to avoid future calls. Note the complications because of RCU. Other processors may not see that the notifier was unlinked until a quiescent period has passed! Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/mmu_notifier.h | 4 ++++ mm/mmap.c | 1 + 2 files changed, 5 insertions(+) Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-28 14:02:18.000000000 -0800 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 14:15:49.000000000 -0800 @@ -62,6 +62,10 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* Dummy needed because the mmu_notifier() macro requires it */ + void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm, + int dummy); + /* * lock indicates that the function is called under spinlock. */ Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-28 14:15:49.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-28 14:15:49.000000000 -0800 @@ -2034,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm) unsigned long end; /* mm's last user has gone, and its about to be pulled down */ + mmu_notifier(invalidate_all, mm, 0); arch_exit_mmap(mm); lru_add_drain(); -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 0/6] MMU Notifiers V7 @ 2008-02-15 6:48 Christoph Lameter 2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-02-15 6:48 UTC (permalink / raw) To: akpm Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel (an external MMU). MMU notifiers allow us to get rid of the page pinning for RDMA and various other purposes. It gets rid of the broken use of mlock for page pinning and avoids having to lock pages by increasing the refcount. (mlock really does *not* pin pages....) More information on the rationale and the technical details can be found in the first patch and the README provided by that patch in Documentation/mmu_notifiers. The known immediate users are KVM - Establishes a refcount to the page via get_user_pages(). - External references are called spte. - Has page tables to track pages whose refcount was elevated but no reverse maps. GRU - Simple additional hardware TLB (possibly covering multiple instances of Linux) - Needs TLB shootdown when the VM unmaps pages. - Determines page address via follow_page (from interrupt context) but can fall back to get_user_pages(). - No page reference possible since no page status is kept.. XPmem - Allows use of a processes memory by remote instances of Linux. - Provides its own reverse mappings to track remote pte. - Established refcounts on the exported pages. - Must sleep in order to wait for remote acks of ptes that are being cleared. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. V3->V4: - Drop locking and synchronize_rcu() on ->release since we know on release that we are the only executing thread. This is also true for invalidate_all() so we could drop off the mmu_notifier there early. Use hlist_del_init instead of hlist_del_rcu. - Do the invalidation as begin/end pairs with the requirement that the driver holds off new references in between. - Fixup filemap_xip.c - Figure out a potential way in which XPmem can deal with locks that are held. - Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit. - Strip cc list down a bit. - Drop Peters new rcu list macro - Add description to the core patch V4->V5: - Provide missing callouts for mremap. - Provide missing callouts for copy_page_range. - Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out structure contents. - Get rid of the invalidate_all() callback by moving ->release in place of invalidate_all. - Require holding mmap_sem on register/unregister instead of acquiring it ourselves. In some contexts where we want to register/unregister we are already holding mmap_sem. - Split out the rmap support patch so that there is no need to apply all patches for KVM and GRU. V5->V6: - Provide missing range callouts for mprotect - Fix do_wp_page control path sequencing - Clarify locking conventions - GRU and XPmem confirmed to work with this patchset. - Provide skeleton code for GRU/KVM type callback and for XPmem type. - Rework documentation and put it into Documentation/mmu_notifier. V6->V7: - Code our own page table traversal in the skeletons so that we can perform the insertion of a remote pte under pte lock. - Discuss page pinning by increasing page refcount -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 1/6] mmu_notifier: Core code 2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter @ 2008-02-15 6:49 ` Christoph Lameter 2008-02-16 3:37 ` Andrew Morton 2008-02-18 22:33 ` Roland Dreier 0 siblings, 2 replies; 48+ messages in thread From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw) To: akpm Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman [-- Attachment #1: mmu_core --] [-- Type: text/plain, Size: 19064 bytes --] MMU notifiers are used for hardware and software that establishes external references to pages managed by the Linux kernel. These are page table entriews or tlb entries or something else that allows hardware (such as DMA engines, scatter gather devices, networking, sharing of address spaces across operating system boundaries) and software (Virtualization solutions such as KVM, Xen etc) to access memory managed by the Linux kernel. The MMU notifier will notify the device driver that subscribes to such a notifier that the VM is going to do something with the memory mapped by that device. The device must then drop references for the indicated memory area. The references may be reestablished later. The notification scheme is much better than the current schemes of avoiding the danger of the VM removing pages that are externally mapped. We currently either mlock pages used for RDMA, XPmem etc in memory or increase the refcount to pin the pages. Increasing the refcount makes it impossible for the VM to reclaim the page. Mlock causes problems with reclaim and may lead to OOM if too many pages are pinned in memory. It is also incorrect in terms what the POSIX specificies for what role mlock should play. Mlock does *not* pin pages in memory. Mlock just means do not allow the page to be moved to swap. Linux can move pages in memory (for example through the page migration mechanism). These pages can be moved even if they are mlocked(!!!!). The current approach of page pinning in use by RDMA etc is conceptually broken but there are currently no other easy solutions. The alternate of increasing the page count to pin pages is also not that enticing since there will be continual attempts to reclaim or migrate these pages. The solution here allows us to finally fix this issue by requiring such devices to subscribe to a notification chain that will allow them to work without pinning. The VM gains control of its memory again and the memory that has external references can be managed like regular memory. This patch: Core portion Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> --- Documentation/mmu_notifier/README | 105 ++++++++++++++++++++++ include/linux/mm_types.h | 7 + include/linux/mmu_notifier.h | 180 ++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 mm/Kconfig | 4 mm/Makefile | 1 mm/mmap.c | 2 mm/mmu_notifier.c | 76 ++++++++++++++++ 8 files changed, 377 insertions(+) Index: linux-2.6/Documentation/mmu_notifier/README =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/mmu_notifier/README 2008-02-14 22:27:19.000000000 -0800 @@ -0,0 +1,105 @@ +Linux MMU Notifiers +------------------- + +MMU notifiers are used for hardware and software that establishes +external references to pages managed by the Linux kernel. These are +page table entriews or tlb entries or something else that allows +hardware (such as DMA engines, scatter gather devices, networking, +sharing of address spaces across operating system boundaries) and +software (Virtualization solutions such as KVM, Xen etc) to +access memory managed by the Linux kernel. + +The MMU notifier will notify the device driver that subscribes to such +a notifier that the VM is going to do something with the memory +mapped by that device. The device must then drop references for the +indicated memory area. The references may be reestablished later. + +The notification scheme is much better than the current schemes of +dealing with the danger of the VM removing pages. +We currently mlock pages used for RDMA, XPmem etc in memory or +increase the refcount of the pages. + +Both cause problems with reclaim and may lead to OOM if too many +pages are pinned in memory. Mlock is also incorrect in terms of the POSIX +specification of the role of mlock. Mlock does *not* pin pages in +memory. It just does not allow the page to be moved to swap. +The page refcount is used to track current users of a page struct. +Artificially inflating the refcount means that the VM cannot track +down all references to a page. It will not be able to reclaim or +move a page. However, the core code will try again and again because +the assumption is that an elevated refcount is a temporary situation. + +Linux can move pages in memory (for example through the page migration +mechanism). These pages can be moved even if they are mlocked(!!!!). +So the current approach in use by RDMA etc etc is conceptually broken +but there are currently no other easy solutions. + +The solution here allows us to finally fix this issue by requiring +such devices to subscribe to a notification chain that will allow +them to work without pinning. + +The notifier chains provide two callback mechanisms. The +first one is required for any device that establishes external mappings. +The second (rmap) mechanism is required if a device needs to be +able to sleep when invalidating references. Sleeping may be necessary +if we are mapping across a network or to different Linux instances +in the same address space. + +mmu_notifier mechanism (for KVM/GRU etc) +---------------------------------------- +Callbacks are registered with an mm_struct from a device driver using +mmu_notifier_register(). When the VM removes pages (or changes +permissions on pages etc) then callbacks are triggered. + +The invalidation function for a single page (*invalidate_page) +is called with spinlocks (in particular the pte lock) held. This allow +for an easy implementation of external ptes that are on the local system. + +The invalidation mechanism for a range (*invalidate_range_begin/end*) is +called most of the time without any locks held. It is only called with +locks held for file backed mappings that are truncated. A flag indicates +in which mode we are. A driver can use that mechanism to f.e. +delay the freeing of the pages during truncate until no locks are held. + +Pages must be marked dirty if dirty bits are found to be set in +the external ptes during unmap. + +The *release* method is called when a Linux process exits. It is run before +the pages and mappings of a process are torn down and gives the device driver +a chance to zap all the external mappings in one go. + +An example for a code that can be used to build a notifier mechanism into +a device driver can be found in the file +Documentation/mmu_notifier/skeleton.c + +mmu_rmap_notifier mechanism (XPMEM etc) +--------------------------------------- +The mmu_rmap_notifier allows the device driver to implement their own rmap +and allows the device driver to sleep during page eviction. This is necessary +for complex drivers that f.e. allow the sharing of memory between processes +running on different Linux instances (typically over a network or in a +partitioned NUMA system). + +The mmu_rmap_notifier adds another invalidate_page() callout that is called +*before* the Linux rmaps are walked. At that point only the page lock is +held. The invalidate_page() function must walk the driver rmaps and evict +all the references to the page. + +There is no process information available before the rmaps are consulted. +The notifier mechanism can therefore not be attached to an mm_struct. Instead +it is a global callback list. Having to perform a callback for each and every +page that is reclaimed would be inefficient. Therefore we add an additional +page flag: PageRmapExternal(). Only pages that are marked with this bit can +be exported and the rmap callbacks will only be performed for pages marked +that way. + +The required additional Page flag is only availabe in 64 bit mode and +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms. + +An example of code to build a mmu_notifier mechanism with rmap capabilty +can be found in Documentation/mmu_notifier/skeleton_rmap.c + +February 9, 2008, + Christoph Lameter <clameter@sgi.com + +Index: linux-2.6/include/linux/mm_types.h Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-02-14 20:59:01.000000000 -0800 +++ linux-2.6/include/linux/mm_types.h 2008-02-14 21:17:51.000000000 -0800 @@ -159,6 +159,12 @@ struct vm_area_struct { #endif }; +struct mmu_notifier_head { +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head head; +#endif +}; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; @@ -228,6 +234,7 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_CONT struct mem_cgroup *mem_cgroup; #endif + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-14 22:42:28.000000000 -0800 @@ -0,0 +1,180 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +/* + * MMU motifier + * + * Notifier functions for hardware and software that establishes external + * references to pages of a Linux system. The notifier calls ensure that + * external mappings are removed when the Linux VM removes memory ranges + * or individual pages from a process. + * + * These fall into two classes: + * + * 1. mmu_notifier + * + * These are callbacks registered with an mm_struct. If pages are + * removed from an address space then callbacks are performed. + * + * Spinlocks must be held in order to walk reverse maps. The + * invalidate_page() callbacks are performed with spinlocks held. + * + * The invalidate_range_start/end callbacks can be performed in contexts + * where sleeping is allowed or in atomic contexts. A flag is passed + * to indicate an atomic context. + * + * Pages must be marked dirty if dirty bits are found to be set in + * the external ptes. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/rcupdate.h> +#include <linux/mm_types.h> + +struct mmu_notifier_ops; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +struct mmu_notifier_ops { + /* + * The release notifier is called when no other execution threads + * are left. Synchronization is not necessary. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * age_page is called from contexts where the pte_lock is held + */ + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_page is called from contexts where the pte_lock is held. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_begin() and invalidate_range_end() must be paired. + * + * Multiple invalidate_range_begin/ends may be nested or called + * concurrently. That is legit. However, no new external references + * may be established as long as any invalidate_xxx is running or + * any invalidate_range_begin() and has not been completed through a + * corresponding call to invalidate_range_end(). + * + * Locking within the notifier needs to serialize events correspondingly. + * + * invalidate_range_begin() must clear all references in the range + * and stop the establishment of new references. + * + * invalidate_range_end() reenables the establishment of references. + * + * atomic indicates that the function is called in an atomic context. + * We can sleep if atomic == 0. + * + * invalidate_range_begin() must remove all external references. + * There will be no retries as with invalidate_page(). + */ + void (*invalidate_range_begin)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); + + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); +}; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads + */ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); + +/* + * Must hold mmap_sem for write. + * + * A quiescent period needs to pass before the mmu_notifier structure + * can be released. mmu_notifier_release() will wait for a quiescent period + * after calling the ->release callback. So it is safe to call + * mmu_notifier_unregister from the ->release function. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); + + +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +static inline void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_release(struct mm_struct *mm) {} +static inline int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-02-14 20:59:01.000000000 -0800 +++ linux-2.6/mm/Kconfig 2008-02-14 21:17:51.000000000 -0800 @@ -193,3 +193,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile 2008-02-14 20:59:01.000000000 -0800 +++ linux-2.6/mm/Makefile 2008-02-14 21:17:51.000000000 -0800 @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/mmu_notifier.c 2008-02-14 22:41:55.000000000 -0800 @@ -0,0 +1,76 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mmu_notifier.h> + +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *t; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + hlist_for_each_entry_safe(mn, n, t, + &mm->mmu_notifier.head, hlist) { + hlist_del_init(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +/* + * Note that all notifiers use RCU. The updates are only guaranteed to be + * visible to other processes after a RCU quiescent period! + * + * Must hold mmap_sem writably when calling registration functions. + */ +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_del_rcu(&mn->hlist); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); + Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-02-14 20:59:01.000000000 -0800 +++ linux-2.6/kernel/fork.c 2008-02-14 21:17:51.000000000 -0800 @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-02-14 20:59:01.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-02-14 22:42:02.000000000 -0800 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm) unsigned long end; /* mm's last user has gone, and its about to be pulled down */ + mmu_notifier_release(mm); arch_exit_mmap(mm); lru_add_drain(); -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter @ 2008-02-16 3:37 ` Andrew Morton 2008-02-16 8:45 ` Avi Kivity ` (3 more replies) 2008-02-18 22:33 ` Roland Dreier 1 sibling, 4 replies; 48+ messages in thread From: Andrew Morton @ 2008-02-16 3:37 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Thu, 14 Feb 2008 22:49:00 -0800 Christoph Lameter <clameter@sgi.com> wrote: > MMU notifiers are used for hardware and software that establishes > external references to pages managed by the Linux kernel. These are > page table entriews or tlb entries or something else that allows > hardware (such as DMA engines, scatter gather devices, networking, > sharing of address spaces across operating system boundaries) and > software (Virtualization solutions such as KVM, Xen etc) to > access memory managed by the Linux kernel. > > The MMU notifier will notify the device driver that subscribes to such > a notifier that the VM is going to do something with the memory > mapped by that device. The device must then drop references for the > indicated memory area. The references may be reestablished later. > > The notification scheme is much better than the current schemes of > avoiding the danger of the VM removing pages that are externally > mapped. We currently either mlock pages used for RDMA, XPmem etc > in memory or increase the refcount to pin the pages. Increasing > the refcount makes it impossible for the VM to reclaim the page. > > Mlock causes problems with reclaim and may lead to OOM if too many > pages are pinned in memory. It is also incorrect in terms what the POSIX > specificies for what role mlock should play. Mlock does *not* pin pages in > memory. Mlock just means do not allow the page to be moved to swap. > > Linux can move pages in memory (for example through the page migration > mechanism). These pages can be moved even if they are mlocked(!!!!). > The current approach of page pinning in use by RDMA etc is conceptually > broken but there are currently no other easy solutions. > > The alternate of increasing the page count to pin pages is also not > that enticing since there will be continual attempts to reclaim > or migrate these pages. > > The solution here allows us to finally fix this issue by requiring > such devices to subscribe to a notification chain that will allow > them to work without pinning. The VM gains control of its memory again > and the memory that has external references can be managed like regular > memory. > > This patch: Core portion > What is the status of getting infiniband to use this facility? How important is this feature to KVM? To xpmem? Which other potential clients have been identified and how important it it to those? > Index: linux-2.6/Documentation/mmu_notifier/README > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6/Documentation/mmu_notifier/README 2008-02-14 22:27:19.000000000 -0800 > @@ -0,0 +1,105 @@ > +Linux MMU Notifiers > +------------------- > + > +MMU notifiers are used for hardware and software that establishes > +external references to pages managed by the Linux kernel. These are > +page table entriews or tlb entries or something else that allows > +hardware (such as DMA engines, scatter gather devices, networking, > +sharing of address spaces across operating system boundaries) and > +software (Virtualization solutions such as KVM, Xen etc) to > +access memory managed by the Linux kernel. > + > +The MMU notifier will notify the device driver that subscribes to such > +a notifier that the VM is going to do something with the memory > +mapped by that device. The device must then drop references for the > +indicated memory area. The references may be reestablished later. > + > +The notification scheme is much better than the current schemes of > +dealing with the danger of the VM removing pages. > +We currently mlock pages used for RDMA, XPmem etc in memory or > +increase the refcount of the pages. > + > +Both cause problems with reclaim and may lead to OOM if too many > +pages are pinned in memory. Mlock is also incorrect in terms of the POSIX > +specification of the role of mlock. Mlock does *not* pin pages in > +memory. It just does not allow the page to be moved to swap. > +The page refcount is used to track current users of a page struct. > +Artificially inflating the refcount means that the VM cannot track > +down all references to a page. It will not be able to reclaim or > +move a page. However, the core code will try again and again because > +the assumption is that an elevated refcount is a temporary situation. > + > +Linux can move pages in memory (for example through the page migration > +mechanism). These pages can be moved even if they are mlocked(!!!!). > +So the current approach in use by RDMA etc etc is conceptually broken > +but there are currently no other easy solutions. > + > +The solution here allows us to finally fix this issue by requiring > +such devices to subscribe to a notification chain that will allow > +them to work without pinning. > + > +The notifier chains provide two callback mechanisms. The > +first one is required for any device that establishes external mappings. > +The second (rmap) mechanism is required if a device needs to be > +able to sleep when invalidating references. Sleeping may be necessary > +if we are mapping across a network or to different Linux instances > +in the same address space. I'd have thought that a major reason for sleeping would be to wait for IO to complete. Worth mentioning here? > +mmu_notifier mechanism (for KVM/GRU etc) > +---------------------------------------- > +Callbacks are registered with an mm_struct from a device driver using > +mmu_notifier_register(). When the VM removes pages (or changes > +permissions on pages etc) then callbacks are triggered. > + > +The invalidation function for a single page (*invalidate_page) We already have an invalidatepage. Ho hum. > +is called with spinlocks (in particular the pte lock) held. This allow > +for an easy implementation of external ptes that are on the local system. > Why is that "easy"? I's have thought that it would only be easy if the driver happened to be using those same locks for its own purposes. Otherwise it is "awkward"? > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is > +called most of the time without any locks held. It is only called with > +locks held for file backed mappings that are truncated. A flag indicates > +in which mode we are. A driver can use that mechanism to f.e. > +delay the freeing of the pages during truncate until no locks are held. That sucks big time. What do we need to do to make get the callback functions called in non-atomic context? > +Pages must be marked dirty if dirty bits are found to be set in > +the external ptes during unmap. That sentence is too vague. Define "marked dirty"? > +The *release* method is called when a Linux process exits. It is run before We'd conventionally use a notation such as "->release()" here, rather than the asterisks. > +the pages and mappings of a process are torn down and gives the device driver > +a chance to zap all the external mappings in one go. I assume what you mean here is that ->release() is called during exit() when the final reference to an mm is being dropped. > +An example for a code that can be used to build a notifier mechanism into > +a device driver can be found in the file > +Documentation/mmu_notifier/skeleton.c Should that be in samples/? > +mmu_rmap_notifier mechanism (XPMEM etc) > +--------------------------------------- > +The mmu_rmap_notifier allows the device driver to implement their own rmap s/their/its/ > +and allows the device driver to sleep during page eviction. This is necessary > +for complex drivers that f.e. allow the sharing of memory between processes > +running on different Linux instances (typically over a network or in a > +partitioned NUMA system). > + > +The mmu_rmap_notifier adds another invalidate_page() callout that is called > +*before* the Linux rmaps are walked. At that point only the page lock is > +held. The invalidate_page() function must walk the driver rmaps and evict > +all the references to the page. What happens if it cannot do so? > +There is no process information available before the rmaps are consulted. Not sure what that sentence means. I guess "available to the core VM"? > +The notifier mechanism can therefore not be attached to an mm_struct. Instead > +it is a global callback list. Having to perform a callback for each and every > +page that is reclaimed would be inefficient. Therefore we add an additional > +page flag: PageRmapExternal(). How many page flags are left? Is this feature important enough to justfy consumption of another one? > Only pages that are marked with this bit can > +be exported and the rmap callbacks will only be performed for pages marked > +that way. "exported": new term, unclear what it means. > +The required additional Page flag is only availabe in 64 bit mode and > +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms. whoa. Is that good? You just made your feature unavailable on the great majority of Linux systems. > +An example of code to build a mmu_notifier mechanism with rmap capabilty > +can be found in Documentation/mmu_notifier/skeleton_rmap.c > + > +February 9, 2008, > + Christoph Lameter <clameter@sgi.com > + > +Index: linux-2.6/include/linux/mm_types.h > Index: linux-2.6/include/linux/mm_types.h > =================================================================== > --- linux-2.6.orig/include/linux/mm_types.h 2008-02-14 20:59:01.000000000 -0800 > +++ linux-2.6/include/linux/mm_types.h 2008-02-14 21:17:51.000000000 -0800 > @@ -159,6 +159,12 @@ struct vm_area_struct { > #endif > }; > > +struct mmu_notifier_head { > +#ifdef CONFIG_MMU_NOTIFIER > + struct hlist_head head; > +#endif > +}; > + > struct mm_struct { > struct vm_area_struct * mmap; /* list of VMAs */ > struct rb_root mm_rb; > @@ -228,6 +234,7 @@ struct mm_struct { > #ifdef CONFIG_CGROUP_MEM_CONT > struct mem_cgroup *mem_cgroup; > #endif > + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ > }; > > #endif /* _LINUX_MM_TYPES_H */ > Index: linux-2.6/include/linux/mmu_notifier.h > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-14 22:42:28.000000000 -0800 > @@ -0,0 +1,180 @@ > +#ifndef _LINUX_MMU_NOTIFIER_H > +#define _LINUX_MMU_NOTIFIER_H > + > +/* > + * MMU motifier typo > + * Notifier functions for hardware and software that establishes external > + * references to pages of a Linux system. The notifier calls ensure that > + * external mappings are removed when the Linux VM removes memory ranges > + * or individual pages from a process. So the callee cannot fail. hm. If it can't block, it's likely screwed in that case. In other cases it might be screwed anyway. I suspect we'll need to be able to handle callee failure. > + * These fall into two classes: > + * > + * 1. mmu_notifier > + * > + * These are callbacks registered with an mm_struct. If pages are > + * removed from an address space then callbacks are performed. "to be removed", I guess. It's called before the page is actually removed? > + * Spinlocks must be held in order to walk reverse maps. The > + * invalidate_page() callbacks are performed with spinlocks held. hm, yes, problem. Permitting callee failure might be good enough. > + * The invalidate_range_start/end callbacks can be performed in contexts > + * where sleeping is allowed or in atomic contexts. A flag is passed > + * to indicate an atomic context. We generally would prefer separate callbacks, rather than a unified callback with a mode flag. > + * Pages must be marked dirty if dirty bits are found to be set in > + * the external ptes. > + */ > + > +#include <linux/list.h> > +#include <linux/spinlock.h> > +#include <linux/rcupdate.h> > +#include <linux/mm_types.h> > + > +struct mmu_notifier_ops; > + > +struct mmu_notifier { > + struct hlist_node hlist; > + const struct mmu_notifier_ops *ops; > +}; > + > +struct mmu_notifier_ops { > + /* > + * The release notifier is called when no other execution threads > + * are left. Synchronization is not necessary. "and the mm is about to be destroyed"? > + */ > + void (*release)(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > + /* > + * age_page is called from contexts where the pte_lock is held > + */ > + int (*age_page)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); This wasn't documented. > + /* > + * invalidate_page is called from contexts where the pte_lock is held. > + */ > + void (*invalidate_page)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); > + > + /* > + * invalidate_range_begin() and invalidate_range_end() must be paired. > + * > + * Multiple invalidate_range_begin/ends may be nested or called > + * concurrently. Under what circumstances would they be nested? > That is legit. However, no new external references references to what? > + * may be established as long as any invalidate_xxx is running or > + * any invalidate_range_begin() and has not been completed through a stray "and". > + * corresponding call to invalidate_range_end(). > + * > + * Locking within the notifier needs to serialize events correspondingly. > + * > + * invalidate_range_begin() must clear all references in the range > + * and stop the establishment of new references. and stop the establishment of new references within the range, I assume? If so, that's putting a heck of a lot of complexity into the driver, isn't it? It needs to temporarily remember an arbitrarily large number of regions in this mm against which references may not be taken? > + * invalidate_range_end() reenables the establishment of references. within the range? > + * atomic indicates that the function is called in an atomic context. > + * We can sleep if atomic == 0. > + * > + * invalidate_range_begin() must remove all external references. > + * There will be no retries as with invalidate_page(). > + */ > + void (*invalidate_range_begin)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int atomic); > + > + void (*invalidate_range_end)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int atomic); > +}; > + > +#ifdef CONFIG_MMU_NOTIFIER > + > +/* > + * Must hold the mmap_sem for write. > + * > + * RCU is used to traverse the list. A quiescent period needs to pass > + * before the notifier is guaranteed to be visible to all threads > + */ > +extern void mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > +/* > + * Must hold mmap_sem for write. > + * > + * A quiescent period needs to pass before the mmu_notifier structure > + * can be released. mmu_notifier_release() will wait for a quiescent period > + * after calling the ->release callback. So it is safe to call > + * mmu_notifier_unregister from the ->release function. > + */ > +extern void mmu_notifier_unregister(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > + > +extern void mmu_notifier_release(struct mm_struct *mm); > +extern int mmu_notifier_age_page(struct mm_struct *mm, > + unsigned long address); There's the mysterious age_page again. > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) > +{ > + INIT_HLIST_HEAD(&mnh->head); > +} > + > +#define mmu_notifier(function, mm, args...) \ > + do { \ > + struct mmu_notifier *__mn; \ > + struct hlist_node *__n; \ > + \ > + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ > + rcu_read_lock(); \ > + hlist_for_each_entry_rcu(__mn, __n, \ > + &(mm)->mmu_notifier.head, \ > + hlist) \ > + if (__mn->ops->function) \ > + __mn->ops->function(__mn, \ > + mm, \ > + args); \ > + rcu_read_unlock(); \ > + } \ > + } while (0) The macro references its args more than once. Anyone who does mmu_notifier(function, some_function_which_has_side_effects()) will get a surprise. Use temporaries. > +#else /* CONFIG_MMU_NOTIFIER */ > + > +/* > + * Notifiers that use the parameters that they were passed so that the > + * compiler does not complain about unused variables but does proper > + * parameter checks even if !CONFIG_MMU_NOTIFIER. > + * Macros generate no code. > + */ > +#define mmu_notifier(function, mm, args...) \ > + do { \ > + if (0) { \ > + struct mmu_notifier *__mn; \ > + \ > + __mn = (struct mmu_notifier *)(0x00ff); \ > + __mn->ops->function(__mn, mm, args); \ > + }; \ > + } while (0) That's a bit weird. Can't we do the old (void)function; (void)mm; trick? Or make it a staic inline function? > +static inline void mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm) {} > +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, > + struct mm_struct *mm) {} > +static inline void mmu_notifier_release(struct mm_struct *mm) {} > +static inline int mmu_notifier_age_page(struct mm_struct *mm, > + unsigned long address) > +{ > + return 0; > +} > + > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} > + > +#endif /* CONFIG_MMU_NOTIFIER */ > + > +#endif /* _LINUX_MMU_NOTIFIER_H */ > Index: linux-2.6/mm/Kconfig > =================================================================== > --- linux-2.6.orig/mm/Kconfig 2008-02-14 20:59:01.000000000 -0800 > +++ linux-2.6/mm/Kconfig 2008-02-14 21:17:51.000000000 -0800 > @@ -193,3 +193,7 @@ config NR_QUICK > config VIRT_TO_BUS > def_bool y > depends on !ARCH_NO_VIRT_TO_BUS > + > +config MMU_NOTIFIER > + def_bool y > + bool "MMU notifier, for paging KVM/RDMA" Why is this not selectable? The help seems a bit brief. Does this cause 32-bit systems to drag in a bunch of code they're not allowed to ever use? > Index: linux-2.6/mm/Makefile > =================================================================== > --- linux-2.6.orig/mm/Makefile 2008-02-14 20:59:01.000000000 -0800 > +++ linux-2.6/mm/Makefile 2008-02-14 21:17:51.000000000 -0800 > @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_SMP) += allocpercpu.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o > +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o > > Index: linux-2.6/mm/mmu_notifier.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6/mm/mmu_notifier.c 2008-02-14 22:41:55.000000000 -0800 > @@ -0,0 +1,76 @@ > +/* > + * linux/mm/mmu_notifier.c > + * > + * Copyright (C) 2008 Qumranet, Inc. > + * Copyright (C) 2008 SGI > + * Christoph Lameter <clameter@sgi.com> > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + */ > + > +#include <linux/module.h> > +#include <linux/mm.h> > +#include <linux/mmu_notifier.h> > + > +/* > + * No synchronization. This function can only be called when only a single > + * process remains that performs teardown. > + */ > +void mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n, *t; > + > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > + hlist_for_each_entry_safe(mn, n, t, > + &mm->mmu_notifier.head, hlist) { > + hlist_del_init(&mn->hlist); > + if (mn->ops->release) > + mn->ops->release(mn, mm); We do this a lot, but back in the old days people didn't like optional callbacks which can be NULL. If we expect that mmu_notifier_ops.release is usually implemented, the just unconditionally call it and require that all clients implement it. Perhaps provide an exported-to-modules stuv in core kernel for clients which didn't want to implement ->release(). > + } > + } > +} > + > +/* > + * If no young bitflag is supported by the hardware, ->age_page can > + * unmap the address and return 1 or 0 depending if the mapping previously > + * existed or not. > + */ > +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + int young = 0; > + > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, > + &mm->mmu_notifier.head, hlist) { > + if (mn->ops->age_page) > + young |= mn->ops->age_page(mn, mm, address); > + } > + rcu_read_unlock(); > + } > + > + return young; > +} should the rcu_read_lock() cover the hlist_empty() test? This function looks like it was tossed in at the last minute. It's mysterious, undocumented, poorly commented, poorly named. A better name would be one which has some correlation with the return value. Because anyone who looks at some code which does if (mmu_notifier_age_page(mm, address)) ... has to go and reverse-engineer the implementation of mmu_notifier_age_page() to work out under which circumstances the "..." will be executed. But this should be apparent just from reading the callee implementation. This function *really* does need some documentation. What does it *mean* when the ->age_page() from some of the notifiers returned "1" and the ->age_page() from some other notifiers returned zero? Dunno. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 3:37 ` Andrew Morton @ 2008-02-16 8:45 ` Avi Kivity 2008-02-16 8:56 ` Andrew Morton 2008-02-16 10:41 ` Brice Goglin ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: Avi Kivity @ 2008-02-16 8:45 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman Andrew Morton wrote: > How important is this feature to KVM? > Very. kvm pins pages that are referenced by the guest; a 64-bit guest will easily pin its entire memory with the kernel map. So this is critical for guest swapping to actually work. Other nice features like page migration are also enabled by this patch. -- Any sufficiently difficult bug is indistinguishable from a feature. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 8:45 ` Avi Kivity @ 2008-02-16 8:56 ` Andrew Morton 2008-02-16 9:21 ` Avi Kivity 0 siblings, 1 reply; 48+ messages in thread From: Andrew Morton @ 2008-02-16 8:56 UTC (permalink / raw) To: Avi Kivity Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Sat, 16 Feb 2008 10:45:50 +0200 Avi Kivity <avi@qumranet.com> wrote: > Andrew Morton wrote: > > How important is this feature to KVM? > > > > Very. kvm pins pages that are referenced by the guest; hm. Why does it do that? > a 64-bit guest > will easily pin its entire memory with the kernel map. > So this is > critical for guest swapping to actually work. Curious. If KVM can release guest pages at the request of this notifier so that they can be swapped out, why can't it release them by default, and allow swapping to proceed? > > Other nice features like page migration are also enabled by this patch. > We already have page migration. Do you mean page-migration-when-using-kvm? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 8:56 ` Andrew Morton @ 2008-02-16 9:21 ` Avi Kivity 0 siblings, 0 replies; 48+ messages in thread From: Avi Kivity @ 2008-02-16 9:21 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman Andrew Morton wrote: >> Very. kvm pins pages that are referenced by the guest; >> > > hm. Why does it do that? > > It was deemed best not to allow the guest to write to a page that has been swapped out and assigned to an unrelated host process. One way to view the kvm shadow page tables is as hardware dma descriptors. kvm pins pages for the same reason that drivers pin pages that are being dma'ed. It's also the reason why mmu notifiers are useful for such a wide range of dma capable hardware. >> a 64-bit guest >> will easily pin its entire memory with the kernel map. >> > > >> So this is >> critical for guest swapping to actually work. >> > > Curious. If KVM can release guest pages at the request of this notifier so > that they can be swapped out, why can't it release them by default, and > allow swapping to proceed? > > If kvm releases a page, it must also zap any shadow ptes pointing at the page and flush the tlb. If you do that for all of memory you can't reference any of it. Releasing a page has costs, both at the time of the release and when the guest eventually refers to the page again. >> Other nice features like page migration are also enabled by this patch. >> >> > > We already have page migration. Do you mean page-migration-when-using-kvm? > Yes, I'm obviously writing from a kvm-centric point of view. This is an important feature, as the virtualization future seems to be NUMA hosts (2- or 4- way, 4 cores per socket) running moderately sized guests. The ability to load-balance guests among the NUMA nodes is important for performance. (btw, I'm also looking forward to memory defragmentation. large pages are important for virtualization workloads and mmu notifiers are again critical to getting it to work while running kvm). -- Any sufficiently difficult bug is indistinguishable from a feature. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 3:37 ` Andrew Morton 2008-02-16 8:45 ` Avi Kivity @ 2008-02-16 10:41 ` Brice Goglin 2008-02-16 10:58 ` Andrew Morton 2008-02-16 19:21 ` Christoph Lameter 2008-02-17 5:04 ` Doug Maxey 3 siblings, 1 reply; 48+ messages in thread From: Brice Goglin @ 2008-02-16 10:41 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm Andrew Morton wrote: > What is the status of getting infiniband to use this facility? > > How important is this feature to KVM? > > To xpmem? > > Which other potential clients have been identified and how important it it > to those? > As I said when Andrea posted the first patch series, I used something very similar for non-RDMA-based HPC about 4 years ago. I haven't had time yet to look in depth and try the latest proposed API but my feeling is that it looks good. Brice ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 10:41 ` Brice Goglin @ 2008-02-16 10:58 ` Andrew Morton 2008-02-16 19:31 ` Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Andrew Morton @ 2008-02-16 10:58 UTC (permalink / raw) To: Brice Goglin; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm On Sat, 16 Feb 2008 11:41:35 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote: > Andrew Morton wrote: > > What is the status of getting infiniband to use this facility? > > > > How important is this feature to KVM? > > > > To xpmem? > > > > Which other potential clients have been identified and how important it it > > to those? > > > > As I said when Andrea posted the first patch series, I used something > very similar for non-RDMA-based HPC about 4 years ago. I haven't had > time yet to look in depth and try the latest proposed API but my feeling > is that it looks good. > "looks good" maybe. But it's in the details where I fear this will come unstuck. The likelihood that some callbacks really will want to be able to block in places where this interface doesn't permit that - either to wait for IO to complete or to wait for other threads to clear critical regions. >From that POV it doesn't look like a sufficiently general and useful design. Looks like it was grafted onto the current VM implementation in a way which just about suits two particular clients if they try hard enough. Which is all perfectly understandable - it would be hard to rework core MM to be able to make this interface more general. But I do think it's half-baked and there is a decent risk that future (or present) code which _could_ use something like this won't be able to use this one, and will continue to futz with mlock, page-pinning, etc. Not that I know what the fix to that is.. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 10:58 ` Andrew Morton @ 2008-02-16 19:31 ` Christoph Lameter 0 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-02-16 19:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Brice Goglin, Andrea Arcangeli, linux-kernel, linux-mm On Sat, 16 Feb 2008, Andrew Morton wrote: > "looks good" maybe. But it's in the details where I fear this will come > unstuck. The likelihood that some callbacks really will want to be able to > block in places where this interface doesn't permit that - either to wait > for IO to complete or to wait for other threads to clear critical regions. We can get the invalidate_range to always be called without spinlocks if we deal with the case of the inode_mmap_lock being held in truncate case. If you always want to be able to sleep then we could drop the invalidate_page() that is called while pte locks held and require the use of a device driver rmap? > >From that POV it doesn't look like a sufficiently general and useful > design. Looks like it was grafted onto the current VM implementation in a > way which just about suits two particular clients if they try hard enough. You missed KVM. We did the best we could being as least invasive as possible. > Which is all perfectly understandable - it would be hard to rework core MM > to be able to make this interface more general. But I do think it's > half-baked and there is a decent risk that future (or present) code which > _could_ use something like this won't be able to use this one, and will > continue to futz with mlock, page-pinning, etc. > > Not that I know what the fix to that is.. You do not see a chance of this being okay if we adopt the two measures that I mentioned above? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 3:37 ` Andrew Morton 2008-02-16 8:45 ` Avi Kivity 2008-02-16 10:41 ` Brice Goglin @ 2008-02-16 19:21 ` Christoph Lameter 2008-02-17 3:01 ` Andrea Arcangeli 2008-02-17 5:04 ` Doug Maxey 3 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-02-16 19:21 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Fri, 15 Feb 2008, Andrew Morton wrote: > What is the status of getting infiniband to use this facility? Well we are talking about this it seems. > > How important is this feature to KVM? Andrea can answer this. > To xpmem? Without this feature we are stuck with page pinning by increasing refcounts which leads to endless lru scanning and other misbehavior. Also applications that use XPmem will not be able to swap or be able to use things like remap. > Which other potential clients have been identified and how important it it > to those? It is likely important to various DMA engines, framebuffers devices etc etc. Seems to be a generally useful feature. > > +The notifier chains provide two callback mechanisms. The > > +first one is required for any device that establishes external mappings. > > +The second (rmap) mechanism is required if a device needs to be > > +able to sleep when invalidating references. Sleeping may be necessary > > +if we are mapping across a network or to different Linux instances > > +in the same address space. > > I'd have thought that a major reason for sleeping would be to wait for IO > to complete. Worth mentioning here? Right. > Why is that "easy"? I's have thought that it would only be easy if the > driver happened to be using those same locks for its own purposes. > Otherwise it is "awkward"? Its relatively easy because it is tied directly to a process and can use external tlb shootdown / external page table clearing directly. The other method requires an rmap in the device driver where it can lookup the processes that are mapping the page. > > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is > > +called most of the time without any locks held. It is only called with > > +locks held for file backed mappings that are truncated. A flag indicates > > +in which mode we are. A driver can use that mechanism to f.e. > > +delay the freeing of the pages during truncate until no locks are held. > > That sucks big time. What do we need to do to make get the callback > functions called in non-atomic context? We would have to drop the inode_mmap_lock. Could be done with some minor work. > > +Pages must be marked dirty if dirty bits are found to be set in > > +the external ptes during unmap. > > That sentence is too vague. Define "marked dirty"? Call set_page_dirty(). > > +The *release* method is called when a Linux process exits. It is run before > > We'd conventionally use a notation such as "->release()" here, rather than > the asterisks. Ok. > > > +the pages and mappings of a process are torn down and gives the device driver > > +a chance to zap all the external mappings in one go. > > I assume what you mean here is that ->release() is called during exit() > when the final reference to an mm is being dropped. Right. > > +An example for a code that can be used to build a notifier mechanism into > > +a device driver can be found in the file > > +Documentation/mmu_notifier/skeleton.c > > Should that be in samples/? Oh. We have that? > > +The mmu_rmap_notifier adds another invalidate_page() callout that is called > > +*before* the Linux rmaps are walked. At that point only the page lock is > > +held. The invalidate_page() function must walk the driver rmaps and evict > > +all the references to the page. > > What happens if it cannot do so? The page is not reclaimed if we were called from try_to_unmap(). From page_mkclean() we must always evict the page to switch off the write protect bit. > > +There is no process information available before the rmaps are consulted. > > Not sure what that sentence means. I guess "available to the core VM"? At that point we only have the page. We do not know which processes map the page. In order to find out we need to take a spinlock. > > +The notifier mechanism can therefore not be attached to an mm_struct. Instead > > +it is a global callback list. Having to perform a callback for each and every > > +page that is reclaimed would be inefficient. Therefore we add an additional > > +page flag: PageRmapExternal(). > > How many page flags are left? 30 or so. Its only available on 64bit. > Is this feature important enough to justfy consumption of another one? > > > Only pages that are marked with this bit can > > +be exported and the rmap callbacks will only be performed for pages marked > > +that way. > > "exported": new term, unclear what it means. Something external to the kernel references the page. > > +The required additional Page flag is only availabe in 64 bit mode and > > +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms. > > whoa. Is that good? You just made your feature unavailable on the great > majority of Linux systems. rmaps are usually used by complex drivers that are typically used in large systems. > > + * Notifier functions for hardware and software that establishes external > > + * references to pages of a Linux system. The notifier calls ensure that > > + * external mappings are removed when the Linux VM removes memory ranges > > + * or individual pages from a process. > > So the callee cannot fail. hm. If it can't block, it's likely screwed in > that case. In other cases it might be screwed anyway. I suspect we'll > need to be able to handle callee failure. Probably. > > > + * These fall into two classes: > > + * > > + * 1. mmu_notifier > > + * > > + * These are callbacks registered with an mm_struct. If pages are > > + * removed from an address space then callbacks are performed. > > "to be removed", I guess. It's called before the page is actually removed? Its called after the pte was cleared while holding the pte lock. > > + * The invalidate_range_start/end callbacks can be performed in contexts > > + * where sleeping is allowed or in atomic contexts. A flag is passed > > + * to indicate an atomic context. > > We generally would prefer separate callbacks, rather than a unified > callback with a mode flag. We could drop the inode_mmap_lock when doing truncate. That would make this work but its a kind of invasive thing for the VM. > > +struct mmu_notifier_ops { > > + /* > > + * The release notifier is called when no other execution threads > > + * are left. Synchronization is not necessary. > > "and the mm is about to be destroyed"? Right. > > + /* > > + * invalidate_range_begin() and invalidate_range_end() must be paired. > > + * > > + * Multiple invalidate_range_begin/ends may be nested or called > > + * concurrently. > > Under what circumstances would they be nested? Hmmmm.. Right they cannot be nested. Multiple processors can have invalidates() concurrently in progress. > > That is legit. However, no new external references > > references to what? To the ranges that are in the process of being invalidated. > > + * invalidate_range_begin() must clear all references in the range > > + * and stop the establishment of new references. > > and stop the establishment of new references within the range, I assume? Right. > If so, that's putting a heck of a lot of complexity into the driver, isn't > it? It needs to temporarily remember an arbitrarily large number of > regions in this mm against which references may not be taken? That is one implementation (XPmem does that). The other is to simply stop all references when any invalidate_range is in progress (KVM and GRU do that). > > + * invalidate_range_end() reenables the establishment of references. > > within the range? Right. > > +extern void mmu_notifier_release(struct mm_struct *mm); > > +extern int mmu_notifier_age_page(struct mm_struct *mm, > > + unsigned long address); > > There's the mysterious age_page again. Andrea put this in to check the reference status of a page. It functions like the accessed bit. > > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) > > +{ > > + INIT_HLIST_HEAD(&mnh->head); > > +} > > + > > +#define mmu_notifier(function, mm, args...) \ > > + do { \ > > + struct mmu_notifier *__mn; \ > > + struct hlist_node *__n; \ > > + \ > > + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ > > + rcu_read_lock(); \ > > + hlist_for_each_entry_rcu(__mn, __n, \ > > + &(mm)->mmu_notifier.head, \ > > + hlist) \ > > + if (__mn->ops->function) \ > > + __mn->ops->function(__mn, \ > > + mm, \ > > + args); \ > > + rcu_read_unlock(); \ > > + } \ > > + } while (0) > > The macro references its args more than once. Anyone who does > > mmu_notifier(function, some_function_which_has_side_effects()) > > will get a surprise. Use temporaries. Ok. > > +#define mmu_notifier(function, mm, args...) \ > > + do { \ > > + if (0) { \ > > + struct mmu_notifier *__mn; \ > > + \ > > + __mn = (struct mmu_notifier *)(0x00ff); \ > > + __mn->ops->function(__mn, mm, args); \ > > + }; \ > > + } while (0) > > That's a bit weird. Can't we do the old > > (void)function; > (void)mm; > > trick? Or make it a staic inline function? Static inline wont allow the checking of the parameters. (void) may be a good thing here. > > +config MMU_NOTIFIER > > + def_bool y > > + bool "MMU notifier, for paging KVM/RDMA" > > Why is this not selectable? The help seems a bit brief. > > Does this cause 32-bit systems to drag in a bunch of code they're not > allowed to ever use? I have selected it a number of times. We could make that a bit longer right. > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + hlist_for_each_entry_safe(mn, n, t, > > + &mm->mmu_notifier.head, hlist) { > > + hlist_del_init(&mn->hlist); > > + if (mn->ops->release) > > + mn->ops->release(mn, mm); > > We do this a lot, but back in the old days people didn't like optional > callbacks which can be NULL. If we expect that mmu_notifier_ops.release is > usually implemented, the just unconditionally call it and require that all > clients implement it. Perhaps provide an exported-to-modules stuv in core > kernel for clients which didn't want to implement ->release(). Ok. > > +{ > > + struct mmu_notifier *mn; > > + struct hlist_node *n; > > + int young = 0; > > + > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + rcu_read_lock(); > > + hlist_for_each_entry_rcu(mn, n, > > + &mm->mmu_notifier.head, hlist) { > > + if (mn->ops->age_page) > > + young |= mn->ops->age_page(mn, mm, address); > > + } > > + rcu_read_unlock(); > > + } > > + > > + return young; > > +} > > should the rcu_read_lock() cover the hlist_empty() test? > > This function looks like it was tossed in at the last minute. It's > mysterious, undocumented, poorly commented, poorly named. A better name > would be one which has some correlation with the return value. > > Because anyone who looks at some code which does > > if (mmu_notifier_age_page(mm, address)) > ... > > has to go and reverse-engineer the implementation of > mmu_notifier_age_page() to work out under which circumstances the "..." > will be executed. But this should be apparent just from reading the callee > implementation. > > This function *really* does need some documentation. What does it *mean* > when the ->age_page() from some of the notifiers returned "1" and the > ->age_page() from some other notifiers returned zero? Dunno. Andrea: Could you provide some more detail here? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 19:21 ` Christoph Lameter @ 2008-02-17 3:01 ` Andrea Arcangeli 2008-02-17 12:24 ` Robin Holt 0 siblings, 1 reply; 48+ messages in thread From: Andrea Arcangeli @ 2008-02-17 3:01 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote: > On Fri, 15 Feb 2008, Andrew Morton wrote: > > > What is the status of getting infiniband to use this facility? > > Well we are talking about this it seems. It seems the IB folks think allowing RDMA over virtual memory is not interesting, their argument seem to be that RDMA is only interesting on RAM (and they seem not interested in allowing RDMA over a ram+swap backed _virtual_ memory allocation). They've just to decide if ram+swap allocation for RDMA is useful or not. > > How important is this feature to KVM? > > Andrea can answer this. I think I already did in separate email. > > That sucks big time. What do we need to do to make get the callback > > functions called in non-atomic context? I sure agree given I also asked to drop the lock param and enforce the invalidate_range_* to always be called in non atomic context. > We would have to drop the inode_mmap_lock. Could be done with some minor > work. The invalidate may be deferred after releasing the lock, the lock may not have to be dropped to cleanup the API (and make xpmem life easier). > That is one implementation (XPmem does that). The other is to simply stop > all references when any invalidate_range is in progress (KVM and GRU do > that). KVM doesn't stop new references. It doesn't need to because it holds a reference on the page (GRU doesn't). KVM can invalidate the spte and flush the tlb only after the linux pte has been cleared and after the page has been released by the VM (because the page doesn't go in the freelist and it remains pinned for a little while, until the spte is dropped too inside invalidate_range_end). GRU has to invalidate _before_ the linux pte is cleared so it has to stop new references from being established in the invalidate_range_start/end critical section. > Andrea put this in to check the reference status of a page. It functions > like the accessed bit. In short each pte can have some spte associated to it. So whenever we do a ptep_clear_flush protected by the PT lock, we also have to run invalidate_page that will internally invoke a sort-of sptep_clear_flush protected by a kvm->mmu_lock (equivalent of page_table_lock/PT-lock). sptes just like ptes maps virtual addresses to physical addresses, so you can read/write to RAM either through a pte or through a spte. Just like it would be insane to have any requirement that ptep_clear_flush has to run in not-atomic context (forcing a conversion of the PT lock to a mutex), it's also weird require the invalidate_page/age_page to run in atomic context. All troubles start with the xpmem requirements of having to schedule in its equivalent of the sptep_clear_flush because it's not a gigaherz-in-cpu thing but a gigabit thing where the network stack is involved with its own software linux driven skb memory allocations, schedules waiting for network I/O, etc... Imagine ptes allocated in a remote node, no surprise its brings a new set of problems (assuming it can work reliably during oom given its memory requirements in the try_to_unmap path, no page can ever be freed until the skbs have been allocated and sent and allocated again to receive the ack). Furthermore xpmem doesn't associate any pte to a spte, it associates a page_t to certain remote references, or it would be in trouble with invalidate_page that corresponds to ptep_clear_flush on a virtual address that exists thanks to the anon_vma/i_mmap lock held (and not thanks to the mmap_sem like in all invalidate_range calls). Christoph's patch is a mix of two entirely separated features. KVM can live with V7 just fine, but it's a lot more than what is needed by KVM. I don't think that invalidate_page/age_page must be allowed to sleep because invalidate_range also can sleep. You've to just ask yourself if the VM locks shall remain spinlocks, for the VM own good (not for the mmu notifiers good). It'd be bad to make the VM underperform with mutex protecting tiny critical sections to please some mmu notifier user. But if they're spinlocks, then clearly invalidate_page/age_page based on virtual addresses can't sleep or the virtual address wouldn't make sense anymore by the time the spinlock is released. > > This function looks like it was tossed in at the last minute. It's > > mysterious, undocumented, poorly commented, poorly named. A better name > > would be one which has some correlation with the return value. > > > > Because anyone who looks at some code which does > > > > if (mmu_notifier_age_page(mm, address)) > > ... > > > > has to go and reverse-engineer the implementation of > > mmu_notifier_age_page() to work out under which circumstances the "..." > > will be executed. But this should be apparent just from reading the callee > > implementation. > > > > This function *really* does need some documentation. What does it *mean* > > when the ->age_page() from some of the notifiers returned "1" and the > > ->age_page() from some other notifiers returned zero? Dunno. > > Andrea: Could you provide some more detail here? age_page is simply the ptep_clear_flush_young equivalent for sptes. It's meant to provide aging to the pages mapped by secondary mmus. Its return value is the same one of ptep_clear_flush_young but it represents the sptes associated with the pte, ptep_clear_flush_young instead only takes care of the pte itself. For KVM the below would be all that is needed, the fact invalidate_range can sleep and invalidate_page/age_page can't, is because their users are very different. With my approach the mmu notifiers callback are always protected by the PT lock (just like ptep_clear_flush and the other pte+tlb manglings) and they're called after the pte is cleared and before the VM reference on the page has been dropped. That makes it safe for GRU too, so for my initial approach _none_ of the callbacks was allowed to sleep, and that was a feature that allows GRU not to block its tlb miss interrupt with any further locking (the PT-lock taken by follow_page automatically serialized the GRU interrupt against the MMU notifiers and the linux page fault). For KVM the invalidate_pages of my patch is converted to invalidate_range_end because it doesn't matter for KVM if it's called after the PT lock has been dropped. In the try_to_unmap case invalidate_page is called by atomic context in Christoph's patch too, because a virtual address and in turn a pte and in turn certain sptes, can only exist thanks to the spinlocks taken by the VM. Changing the VM to make mmu notifiers sleepable in the try_to_unmap path sounds bad to me, especially given not even xpmem needs this. You can see how everything looks simpler and more symmetric by assuming the secondary mmu-references are established and dropped like ptes, like in the KVM case where infact sptes are a pure cpu thing exact like the ptes. XPMEM adds the requirement that sptes are infact remote entities that are mangled by a message passing protocol over the network, it's the same as ptep_clear_flush being required to schedule and send skbs to be successful and allowing try_to_unmap to do its work. Same problem. No wonder patch gets more complicated then. Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -46,6 +46,7 @@ __young = ptep_test_and_clear_young(__vma, __address, __ptep); \ if (__young) \ flush_tlb_page(__vma, __address); \ + __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \ __young; \ }) #endif @@ -86,6 +87,7 @@ do { \ pte_t __pte; \ __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ flush_tlb_page(__vma, __address); \ + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ __pte; \ }) #endif diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h --- a/include/asm-s390/pgtable.h +++ b/include/asm-s390/pgtable.h @@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str { pte_t pte = *ptep; ptep_invalidate(address, ptep); + mmu_notifier(invalidate_page, vma->vm_mm, address); return pte; } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/mmu.h> @@ -219,6 +220,8 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,132 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> + +struct mmu_notifier; + +struct mmu_notifier_ops { + /* + * Called when nobody can register any more notifier in the mm + * and after the "mn" notifier has been disarmed already. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * invalidate_page[s] is called in atomic context + * after any pte has been updated and before + * dropping the PT lock required to update any Linux pte. + * Once the PT lock will be released the pte will have its + * final value to export through the secondary MMU. + * Before this is invoked any secondary MMU is still ok + * to read/write to the page previously pointed by the + * Linux pte because the old page hasn't been freed yet. + * If required set_page_dirty has to be called internally + * to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + void (*invalidate_pages)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + + /* + * Age page is called in atomic context inside the PT lock + * right after the VM is test-and-clearing the young/accessed + * bitflag in the pte. This way the VM will provide proper aging + * to the accesses to the page through the secondary MMUs + * and not only to the ones through the Linux pte. + */ + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); +}; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier_head { + struct hlist_head head; + spinlock_t lock; +}; + +#include <linux/mm_types.h> + +/* + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads. + */ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* + * RCU is used to traverse the list. A quiescent period needs to pass + * before the "struct mmu_notifier" can be freed. Alternatively it + * can be synchronously freed inside ->release when the list can't + * change anymore and nobody could possibly walk it. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); + spin_lock_init(&mnh->lock); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +struct mmu_notifier_head {}; + +#define mmu_notifier_register(mn, mm) do {} while(0) +#define mmu_notifier_unregister(mn, mm) do {} while (0) +#define mmu_notifier_release(mm) do {} while (0) +#define mmu_notifier_age_page(mm, address) ({ 0; }) +#define mmu_notifier_head_init(mmh) do {} while (0) + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } free_mm(mm); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar if (pte_none(pte)) continue; + mmu_notifier(invalidate_page, mm, address); page = pte_page(pte); if (pte_dirty(pte)) set_page_dirty(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -494,6 +494,7 @@ static int copy_pte_range(struct mm_stru spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + unsigned long start; again: rss[1] = rss[0] = 0; @@ -505,6 +506,7 @@ again: spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); arch_enter_lazy_mmu_mode(); + start = addr; do { /* * We are holding two locks at this point - either of them @@ -525,6 +527,8 @@ again: } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier(invalidate_pages, vma->vm_mm, start, addr); spin_unlock(src_ptl); pte_unmap_nested(src_pte - 1); add_mm_rss(dst_mm, rss[0], rss[1]); @@ -660,6 +664,7 @@ static unsigned long zap_pte_range(struc } ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + mmu_notifier(invalidate_page, mm, addr); tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; @@ -1248,6 +1253,7 @@ static int remap_pte_range(struct mm_str { pte_t *pte; spinlock_t *ptl; + unsigned long start = addr; pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) @@ -1259,6 +1265,7 @@ static int remap_pte_range(struct mm_str pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); + mmu_notifier(invalidate_pages, mm, start, addr); pte_unmap_unlock(pte - 1, ptl); return 0; } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2044,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + mmu_notifier_release(mm); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,73 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/rcupdate.h> + +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *tmp; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + hlist_for_each_entry_safe(mn, n, tmp, + &mm->mmu_notifier.head, hlist) { + hlist_del(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + spin_lock(&mm->mmu_notifier.lock); + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); + spin_unlock(&mm->mmu_notifier.lock); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + spin_lock(&mm->mmu_notifier.lock); + hlist_del_rcu(&mn->hlist); + spin_unlock(&mm->mmu_notifier.lock); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s { pte_t *pte, oldpte; spinlock_t *ptl; + unsigned long start = addr; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); arch_enter_lazy_mmu_mode(); @@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); + mmu_notifier(invalidate_pages, mm, start, addr); pte_unmap_unlock(pte - 1, ptl); } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-17 3:01 ` Andrea Arcangeli @ 2008-02-17 12:24 ` Robin Holt 0 siblings, 0 replies; 48+ messages in thread From: Robin Holt @ 2008-02-17 12:24 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Lameter, Andrew Morton, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman On Sun, Feb 17, 2008 at 04:01:20AM +0100, Andrea Arcangeli wrote: > On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote: > > On Fri, 15 Feb 2008, Andrew Morton wrote: > > > > > What is the status of getting infiniband to use this facility? > > > > Well we are talking about this it seems. > > It seems the IB folks think allowing RDMA over virtual memory is not > interesting, their argument seem to be that RDMA is only interesting > on RAM (and they seem not interested in allowing RDMA over a ram+swap > backed _virtual_ memory allocation). They've just to decide if > ram+swap allocation for RDMA is useful or not. I don't think that is a completely fair characterization. It would be more fair to say that the changes required to their library/user api would be too significant to allow an adaptation to any scheme which allowed removal of physical memory below a virtual mapping. I agree with the IB folks when they say it is impossible with their current scheme. The fact that any consumer of their endpoint identifier can use any identifier without notifying the kernel prior to its use certainly makes any implementation under any scheme impossible. I guess we could possibly make things work for IB if we did some heavy work. Let's assume, instead of passing around the physical endpoint identifiers, they passed around a handle. In order for any IB endpoint to commuicate, it would need to request the kernel translate a handle into an endpoint identifier. In order for the kernel to put a TLB entry into the processes address space allowing the process access to the _CARD_, it would need to ensure all the current endpoint identifiers for this process were "active" meaning we have verified with the other endpoint that all pages are faulted and TLB/PFN information is in the owning card's TLB/PFN tables. Once all of a processes endoints are "active" we would drop in the PFN for the adapter into the pages tables. Any time pages are being revoked from under an active handle, we would shoot-down the IB adapter card TLB entries for all the remote users of this handle and quiesce the cards state to ensure transfers are either complete or terminated. When their are no active transfers, we would respond back to the owner and they could complete the source process page table cleaning. Any time all of the pages for a handle can not be mapped from virtual to physical, the remote process would be SIGBUS'd instead of having it IB adapter TLB installed. This is essentially how XPMEM does it except we have the benefit of working on individual pages. Again, not knowing what I am talking about, but under the assumption that MPI IB use is contained to a library, I would hope the changes could be contained under the MPI-to-IB library interface and would not need any changes at the MPI-user library interface. We do keep track of the virtual address ranges within a handle that are being used. I assume the IB folks will find that helpful as well. Otherwise, I think they could make things operate this way. XPMEM has the advantage of not needing to have virtual-to-physical at all times, but otherwise it is essentially the same. Thanks, Robin ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-16 3:37 ` Andrew Morton ` (2 preceding siblings ...) 2008-02-16 19:21 ` Christoph Lameter @ 2008-02-17 5:04 ` Doug Maxey 3 siblings, 0 replies; 48+ messages in thread From: Doug Maxey @ 2008-02-17 5:04 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman, Ben Herrenschmidt, Jan-Bernd Themann On Fri, 15 Feb 2008 19:37:19 PST, Andrew Morton wrote: > Which other potential clients have been identified and how important it it > to those? The powerpc ehea utilizes its own mmu. Not sure about the importance to the driver. (But will investigate :) ++doug ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-02-16 3:37 ` Andrew Morton @ 2008-02-18 22:33 ` Roland Dreier 1 sibling, 0 replies; 48+ messages in thread From: Roland Dreier @ 2008-02-18 22:33 UTC (permalink / raw) To: Christoph Lameter Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise, Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman It seems that we've come up with two reasonable cases where it makes sense to use these notifiers for InfiniBand/RDMA: First, the ability to safely to DMA to/from userspace memory with the memory regions mlock()ed but the pages not pinned. In this case the notifiers here would seem to suit us well: > + void (*invalidate_range_begin)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int atomic); > + > + void (*invalidate_range_end)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int atomic); If I understand correctly, the IB stack would have to get the hardware driver to shoot down translation entries and suspend access to the region when an invalidate_range_begin notifier is called, and wait for the invalidate_range_end notifier to repopulate the adapter translation tables. This will probably work OK as long as the interval between the invalidate_range_begin and invalidate_range_end calls is not "too long." Also, using this effectively requires us to figure out how we want to mlock() regions that are going to be used for RDMA. We could require userspace to do it, but it's not clear to me that we're safe in the case where userspace decides not to... what happens if some pages get swapped out after the invalidate_range_begin notifier? The second case where some form of notifiers are useful is for userspace to know when a memory registration is still valid, ie Pete Wyckoff's work: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf http://www.osc.edu/~pw/dreg/ however these MMU notifiers seem orthogonal to that: the registration cache is concerned with address spaces, not page mapping, and hence the existing vma operations seem to be a better fit. - R. ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 0/6] MMU Notifiers V6 @ 2008-02-08 22:06 Christoph Lameter 2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel (an external MMU). MMU notifiers allow us to get rid of the page pinning for RDMA and various other purposes. It gets rid of the broken use of mlock for page pinning. (mlock really does *not* pin pages....) More information on the rationale and the technical details can be found in the first patch and the README provided by that patch in Documentation/mmu_notifiers. The known immediate users are KVM - Establishes a refcount to the page via get_user_pages(). - External references are called spte. - Has page tables to track pages whose refcount was elevated but no reverse maps. GRU - Simple additional hardware TLB (possibly covering multiple instances of Linux) - Needs TLB shootdown when the VM unmaps pages. - Determines page address via follow_page (from interrupt context) but can fall back to get_user_pages(). - No page reference possible since no page status is kept.. XPmem - Allows use of a processes memory by remote instances of Linux. - Provides its own reverse mappings to track remote pte. - Established refcounts on the exported pages. - Must sleep in order to wait for remote acks of ptes that are being cleared. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() V2->V3: - Further RCU fixes - Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page and sys_remap_file_pages() after the pte clearing. V3->V4: - Drop locking and synchronize_rcu() on ->release since we know on release that we are the only executing thread. This is also true for invalidate_all() so we could drop off the mmu_notifier there early. Use hlist_del_init instead of hlist_del_rcu. - Do the invalidation as begin/end pairs with the requirement that the driver holds off new references in between. - Fixup filemap_xip.c - Figure out a potential way in which XPmem can deal with locks that are held. - Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit. - Strip cc list down a bit. - Drop Peters new rcu list macro - Add description to the core patch V4->V5: - Provide missing callouts for mremap. - Provide missing callouts for copy_page_range. - Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out structure contents. - Get rid of the invalidate_all() callback by moving ->release in place of invalidate_all. - Require holding mmap_sem on register/unregister instead of acquiring it ourselves. In some contexts where we want to register/unregister we are already holding mmap_sem. - Split out the rmap support patch so that there is no need to apply all patches for KVM and GRU. V5->V6: - Provide missing range callouts for mprotect - Fix do_wp_page control path sequencing - Clarify locking conventions - GRU and XPmem confirmed to work with this patchset. - Provide skeleton code for GRU/KVM type callback and for XPmem type. - Rework documentation and put it into Documentation/mmu_notifier. -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 1/6] mmu_notifier: Core code 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter @ 2008-02-08 22:06 ` Christoph Lameter 0 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw) To: akpm Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman [-- Attachment #1: mmu_core --] [-- Type: text/plain, Size: 18134 bytes --] MMU notifiers are used for hardware and software that establishes external references to pages managed by the Linux kernel. These are page table entriews or tlb entries or something else that allows hardware (such as DMA engines, scatter gather devices, networking, sharing of address spaces across operating system boundaries) and software (Virtualization solutions such as KVM, Xen etc) to access memory managed by the Linux kernel. The MMU notifier will notify the device driver that subscribes to such a notifier that the VM is going to do something with the memory mapped by that device. The device must then drop references for the indicated memory area. The references may be reestablished later. The notification scheme is much better than the current scheme of avoiding the danger of the VM removing pages that are externally mapped. We currently mlock pages used for RDMA, XPmem etc in memory. Mlock causes problems with reclaim and may lead to OOM if too many pages are pinned in memory. It is also incorrect in terms what the POSIX specificies for what role mlock should play. Mlock does *not* pin pages in memory. Mlock just means do not allow the page to be moved to swap. Linux can move pages in memory (for example through the page migration mechanism). These pages can be moved even if they are mlocked(!!!!). The current approach of page pinning in use by RDMA etc is conceptually broken but there are currently no other easy solutions. The solution here allows us to finally fix this issue by requiring such devices to subscribe to a notification chain that will allow them to work without pinning. This patch: Core portion Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> --- Documentation/mmu_notifier/README | 99 +++++++++++++++++++++ include/linux/mm_types.h | 7 + include/linux/mmu_notifier.h | 175 ++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 mm/Kconfig | 4 mm/Makefile | 1 mm/mmap.c | 2 mm/mmu_notifier.c | 76 ++++++++++++++++ 8 files changed, 366 insertions(+) Index: linux-2.6/Documentation/mmu_notifier/README =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/Documentation/mmu_notifier/README 2008-02-08 12:30:47.000000000 -0800 @@ -0,0 +1,99 @@ +Linux MMU Notifiers +------------------- + +MMU notifiers are used for hardware and software that establishes +external references to pages managed by the Linux kernel. These are +page table entriews or tlb entries or something else that allows +hardware (such as DMA engines, scatter gather devices, networking, +sharing of address spaces across operating system boundaries) and +software (Virtualization solutions such as KVM, Xen etc) to +access memory managed by the Linux kernel. + +The MMU notifier will notify the device driver that subscribes to such +a notifier that the VM is going to do something with the memory +mapped by that device. The device must then drop references for the +indicated memory area. The references may be reestablished later. + +The notification scheme is much better than the current scheme of +dealing with the danger of the VM removing pages. +We currently mlock pages used for RDMA, XPmem etc in memory. + +Mlock causes problems with reclaim and may lead to OOM if too many +pages are pinned in memory. It is also incorrect in terms of the POSIX +specification of the role of mlock. Mlock does *not* pin pages in +memory. It just does not allow the page to be moved to swap. + +Linux can move pages in memory (for example through the page migration +mechanism). These pages can be moved even if they are mlocked(!!!!). +So the current approach in use by RDMA etc etc is conceptually broken +but there are currently no other easy solutions. + +The solution here allows us to finally fix this issue by requiring +such devices to subscribe to a notification chain that will allow +them to work without pinning. + +The notifier chains provide two callback mechanisms. The +first one is required for any device that establishes external mappings. +The second (rmap) mechanism is required if a device needs to be +able to sleep when invalidating references. Sleeping may be necessary +if we are mapping across a network or to different Linux instances +in the same address space. + +mmu_notifier mechanism (for KVM/GRU etc) +---------------------------------------- +Callbacks are registered with an mm_struct from a device driver using +mmu_notifier_register(). When the VM removes pages (or changes +permissions on pages etc) then callbacks are triggered. + +The invalidation function for a single page (*invalidate_page) +is called with spinlocks (in particular the pte lock) held. This allow +for an easy implementation of external ptes that are on the local system. + +The invalidation mechanism for a range (*invalidate_range_begin/end*) is +called most of the time without any locks held. It is only called with +locks held for file backed mappings that are truncated. A flag indicates +in which mode we are. A driver can use that mechanism to f.e. +delay the freeing of the pages during truncate until no locks are held. + +Pages must be marked dirty if dirty bits are found to be set in +the external ptes during unmap. + +The *release* method is called when a Linux process exits. It is run before +the pages and mappings of a process are torn down and gives the device driver +a chance to zap all the external mappings in one go. + +An example for a code that can be used to build a notifier mechanism into +a device driver can be found in the file +Documentation/mmu_notifier/skeleton.c + +mmu_rmap_notifier mechanism (XPMEM etc) +--------------------------------------- +The mmu_rmap_notifier allows the device driver to implement their own rmap +and allows the device driver to sleep during page eviction. This is necessary +for complex drivers that f.e. allow the sharing of memory between processes +running on different Linux instances (typically over a network or in a +partitioned NUMA system). + +The mmu_rmap_notifier adds another invalidate_page() callout that is called +*before* the Linux rmaps are walked. At that point only the page lock is +held. The invalidate_page() function must walk the driver rmaps and evict +all the references to the page. + +There is no process information available before the rmaps are consulted. +The notifier mechanism can therefore not be attached to an mm_struct. Instead +it is a global callback list. Having to perform a callback for each and every +page that is reclaimed would be inefficient. Therefore we add an additional +page flag: PageRmapExternal(). Only pages that are marked with this bit can +be exported and the rmap callbacks will only be performed for pages marked +that way. + +The required additional Page flag is only availabe in 64 bit mode and +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms. + +An example of code to build a mmu_notifier mechanism with rmap capabilty +can be found in Documentation/mmu_notifier/skeleton_rmap.c + +February 9, 2008, + Christoph Lameter <clameter@sgi.com + +Index: linux-2.6/include/linux/mm_types.h Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/include/linux/mm_types.h 2008-02-08 12:30:47.000000000 -0800 @@ -159,6 +159,12 @@ struct vm_area_struct { #endif }; +struct mmu_notifier_head { +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head head; +#endif +}; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; @@ -228,6 +234,7 @@ struct mm_struct { #ifdef CONFIG_CGROUP_MEM_CONT struct mem_cgroup *mem_cgroup; #endif + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-08 12:35:14.000000000 -0800 @@ -0,0 +1,175 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +/* + * MMU motifier + * + * Notifier functions for hardware and software that establishes external + * references to pages of a Linux system. The notifier calls ensure that + * external mappings are removed when the Linux VM removes memory ranges + * or individual pages from a process. + * + * These fall into two classes: + * + * 1. mmu_notifier + * + * These are callbacks registered with an mm_struct. If pages are + * removed from an address space then callbacks are performed. + * + * Spinlocks must be held in order to walk reverse maps. The + * invalidate_page() callbacks are performed with spinlocks held. + * + * The invalidate_range_start/end callbacks can be performed in contexts + * where sleeping is allowed or in atomic contexts. A flag is passed + * to indicate an atomic context. + * + * Pages must be marked dirty if dirty bits are found to be set in + * the external ptes. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/rcupdate.h> +#include <linux/mm_types.h> + +struct mmu_notifier_ops; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +struct mmu_notifier_ops { + /* + * The release notifier is called when no other execution threads + * are left. Synchronization is not necessary. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * age_page is called from contexts where the pte_lock is held + */ + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* invalidate_page is called from contexts where the pte_lock is held */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_begin() and invalidate_range_end() must paired. + * + * Multiple invalidate_range_begin/ends may be nested or called + * concurrently. That is legit. However, no new external references + * may be established as long as any invalidate_xxx is running or + * any invalidate_range_begin() and has not been completed through a + * corresponding call to invalidate_range_end(). + * + * Locking within the notifier needs to serialize events correspondingly. + * + * invalidate_range_begin() must clear all references in the range + * and stop the establishment of new references. + * + * invalidate_range_end() reenables the establishment of references. + * + * atomic indicates that the function is called in an atomic context. + * We can sleep if atomic == 0. + */ + void (*invalidate_range_begin)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); + + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int atomic); +}; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads + */ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); + +/* + * Must hold mmap_sem for write. + * + * A quiescent period needs to pass before the mmu_notifier structure + * can be released. mmu_notifier_release() will wait for a quiescent period + * after calling the ->release callback. So it is safe to call + * mmu_notifier_unregister from the ->release function. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); + + +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +static inline void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_release(struct mm_struct *mm) {} +static inline int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/Kconfig 2008-02-08 12:30:47.000000000 -0800 @@ -193,3 +193,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/Makefile 2008-02-08 12:30:47.000000000 -0800 @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/mmu_notifier.c 2008-02-08 12:44:24.000000000 -0800 @@ -0,0 +1,76 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mmu_notifier.h> + +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *t; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + hlist_for_each_entry_safe(mn, n, t, + &mm->mmu_notifier.head, hlist) { + hlist_del_init(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +/* + * Note that all notifiers use RCU. The updates are only guaranteed to be + * visible to other processes after a RCU quiescent period! + * + * Must hold mmap_sem writably when calling registration functions. + */ +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_del_rcu(&mn->hlist); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); + Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/kernel/fork.c 2008-02-08 12:30:47.000000000 -0800 @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-02-08 12:28:06.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-02-08 12:43:59.000000000 -0800 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm) unsigned long end; /* mm's last user has gone, and its about to be pulled down */ + mmu_notifier_release(mm); arch_exit_mmap(mm); lru_add_drain();` -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 0/6] [RFC] MMU Notifiers V2 @ 2008-01-28 20:28 Christoph Lameter 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 0 siblings, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins This is a patchset implementing MMU notifier callbacks based on Andrea's earlier work. These are needed if Linux pages are referenced from something else than tracked by the rmaps of the kernel. Issues: - Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU - RCU quiescent periods are required on registering and unregistering notifiers to guarantee visibility to other processors. Currently only mmu_notifier_release() does the correct thing. It is up to the user to provide RCU quiescent periods for register/unregister functions if they are called outside of the ->release method. Andrea's mmu_notifier #4 -> RFC V1 - Merge subsystem rmap based with Linux rmap based approach - Move Linux rmap based notifiers out of macro - Try to account for what locks are held while the notifiers are called. - Develop a patch sequence that separates out the different types of hooks so that we can review their use. - Avoid adding include to linux/mm_types.h - Integrate RCU logic suggested by Peter. V1->V2: - Improve RCU support - Use mmap_sem for mmu_notifier register / unregister - Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we already have invalidate_range() callbacks there. - Clean compile for !MMU_NOTIFIER - Isolate filemap_xip strangeness into its own diff - Pass a the flag to invalidate_range to indicate if a spinlock is held. - Add invalidate_all() -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter @ 2008-01-28 20:28 ` Christoph Lameter 2008-01-28 22:06 ` Christoph Lameter ` (4 more replies) 0 siblings, 5 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins [-- Attachment #1: mmu_core --] [-- Type: text/plain, Size: 15333 bytes --] Core code for mmu notifiers. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> --- include/linux/list.h | 14 ++ include/linux/mm_types.h | 6 + include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++ include/linux/page-flags.h | 10 ++ kernel/fork.c | 2 mm/Kconfig | 4 mm/Makefile | 1 mm/mmap.c | 2 mm/mmu_notifier.c | 101 ++++++++++++++++++++ 9 files changed, 350 insertions(+) Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/include/linux/mm_types.h 2008-01-28 11:35:22.000000000 -0800 @@ -153,6 +153,10 @@ struct vm_area_struct { #endif }; +struct mmu_notifier_head { + struct hlist_head head; +}; + struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; @@ -219,6 +223,8 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/mmu_notifier.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 11:43:03.000000000 -0800 @@ -0,0 +1,210 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +/* + * MMU motifier + * + * Notifier functions for hardware and software that establishes external + * references to pages of a Linux system. The notifier calls ensure that + * the external mappings are removed when the Linux VM removes memory ranges + * or individual pages from a process. + * + * These fall into two classes + * + * 1. mmu_notifier + * + * These are callbacks registered with an mm_struct. If mappings are + * removed from an address space then callbacks are performed. + * Spinlocks must be held in order to the walk reverse maps and the + * notifications are performed while the spinlock is held. + * + * + * 2. mmu_rmap_notifier + * + * Callbacks for subsystems that provide their own rmaps. These + * need to walk their own rmaps for a page. The invalidate_page + * callback is outside of locks so that we are not in a strictly + * atomic context (but we may be in a PF_MEMALLOC context if the + * notifier is called from reclaim code) and are able to sleep. + * Rmap notifiers need an extra page bit and are only available + * on 64 bit platforms. It is up to the subsystem to mark pags + * as PageExternalRmap as needed to trigger the callbacks. Pages + * must be marked dirty if dirty bits are set in the external + * pte. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/rcupdate.h> +#include <linux/mm_types.h> + +struct mmu_notifier_ops; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +struct mmu_notifier_ops { + /* + * Note: The mmu_notifier structure must be released with + * call_rcu() since other processors are only guaranteed to + * see the changes after a quiescent period. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + int (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * lock indicates that the function is called under spinlock. + */ + void (*invalidate_range)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end, + int lock); +}; + +struct mmu_rmap_notifier_ops; + +struct mmu_rmap_notifier { + struct hlist_node hlist; + const struct mmu_rmap_notifier_ops *ops; +}; + +struct mmu_rmap_notifier_ops { + /* + * Called with the page lock held after ptes are modified or removed + * so that a subsystem with its own rmap's can remove remote ptes + * mapping a page. + */ + void (*invalidate_page)(struct mmu_rmap_notifier *mrn, + struct page *page); +}; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * Must hold the mmap_sem for write. + * + * RCU is used to traverse the list. A quiescent period needs to pass + * before the notifier is guaranteed to be visible to all threads + */ +extern void __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* Will acquire mmap_sem for write*/ +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +/* + * Will acquire mmap_sem for write. + * + * A quiescent period needs to pass before the mmu_notifier structure + * can be released. mmu_notifier_release() will wait for a quiescent period + * after calling the ->release callback. So it is safe to call + * mmu_notifier_unregister from the ->release function. + */ +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); + + +extern void mmu_notifier_release(struct mm_struct *mm); +extern int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + rcu_read_unlock(); \ + } \ + } while (0) + +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); + +extern struct hlist_head mmu_rmap_notifier_list; + +#define mmu_rmap_notifier(function, args...) \ + do { \ + struct mmu_rmap_notifier *__mrn; \ + struct hlist_node *__n; \ + \ + rcu_read_lock(); \ + hlist_for_each_entry_rcu(__mrn, __n, \ + &mmu_rmap_notifier_list, \ + hlist) \ + if (__mrn->ops->function) \ + __mrn->ops->function(__mrn, args); \ + rcu_read_unlock(); \ + } while (0); + +#else /* CONFIG_MMU_NOTIFIER */ + +/* + * Notifiers that use the parameters that they were passed so that the + * compiler does not complain about unused variables but does proper + * parameter checks even if !CONFIG_MMU_NOTIFIER. + * Macros generate no code. + */ +#define mmu_notifier(function, mm, args...) \ + do { \ + if (0) { \ + struct mmu_notifier *__mn; \ + \ + __mn = (struct mmu_notifier *)(0x00ff); \ + __mn->ops->function(__mn, mm, args); \ + }; \ + } while (0) + +#define mmu_rmap_notifier(function, args...) \ + do { \ + if (0) { \ + struct mmu_rmap_notifier *__mrn; \ + \ + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \ + __mrn->ops->function(__mrn, args); \ + } \ + } while (0); + +static inline void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm) {} +static inline void mmu_notifier_release(struct mm_struct *mm) {} +static inline int mmu_notifier_age_page(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} + +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) + {} +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) + {} + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/include/linux/page-flags.h 2008-01-28 11:35:22.000000000 -0800 @@ -105,6 +105,7 @@ * 64 bit | FIELDS | ?????? FLAGS | * 63 32 0 */ +#define PG_external_rmap 30 /* Page has external rmap */ #define PG_uncached 31 /* Page has been mapped as uncached */ #endif @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT) +#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags) +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags) +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \ + &(page)->flags) +#else +#define PageExternalRmap(page) 0 +#endif + struct page; /* forward declaration */ extern void cancel_dirty_page(struct page *page, unsigned int account_size); Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/mm/Kconfig 2008-01-28 11:35:22.000000000 -0800 @@ -193,3 +193,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" Index: linux-2.6/mm/Makefile =================================================================== --- linux-2.6.orig/mm/Makefile 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/mm/Makefile 2008-01-28 11:35:22.000000000 -0800 @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6/mm/mmu_notifier.c 2008-01-28 11:35:22.000000000 -0800 @@ -0,0 +1,101 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <clameter@sgi.com> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> + +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *t; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_safe_rcu(mn, n, t, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->release) + mn->ops->release(mn, mm); + hlist_del(&mn->hlist); + } + rcu_read_unlock(); + synchronize_rcu(); + } +} + +/* + * If no young bitflag is supported by the hardware, ->age_page can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->age_page) + young |= mn->ops->age_page(mn, mm, address); + } + rcu_read_unlock(); + } + + return young; +} + +/* + * Note that all notifiers use RCU. The updates are only guaranteed to be + * visible to other processes after a RCU quiescent period! + */ +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); + +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + down_write(&mm->mmap_sem); + __mmu_notifier_register(mn, mm); + up_write(&mm->mmap_sem); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + down_write(&mm->mmap_sem); + hlist_del_rcu(&mn->hlist); + up_write(&mm->mmap_sem); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); + +static DEFINE_SPINLOCK(mmu_notifier_list_lock); +HLIST_HEAD(mmu_rmap_notifier_list); + +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_register); + +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) +{ + spin_lock(&mmu_notifier_list_lock); + hlist_del_rcu(&mrn->hlist); + spin_unlock(&mmu_notifier_list_lock); +} +EXPORT_SYMBOL(mmu_rmap_notifier_unregister); + Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/kernel/fork.c 2008-01-28 11:35:22.000000000 -0800 @@ -51,6 +51,7 @@ #include <linux/random.h> #include <linux/tty.h> #include <linux/proc_fs.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } free_mm(mm); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/mm/mmap.c 2008-01-28 11:37:53.000000000 -0800 @@ -26,6 +26,7 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + mmu_notifier_release(mm); /* * Walk the list again, actually closing and freeing it, Index: linux-2.6/include/linux/list.h =================================================================== --- linux-2.6.orig/include/linux/list.h 2008-01-28 11:35:20.000000000 -0800 +++ linux-2.6/include/linux/list.h 2008-01-28 11:35:22.000000000 -0800 @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ pos = pos->next) +/** + * hlist_for_each_entry_safe_rcu - iterate over list of given type + * @tpos: the type * to use as a loop cursor. + * @pos: the &struct hlist_node to use as a loop cursor. + * @n: temporary pointer + * @head: the head for your list. + * @member: the name of the hlist_node within the struct. + */ +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \ + for (pos = (head)->first; \ + rcu_dereference(pos) && ({ n = pos->next; 1;}) && \ + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ + pos = n) + #else #warning "don't include kernel headers in userspace" #endif /* __KERNEL__ */ -- ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter @ 2008-01-28 22:06 ` Christoph Lameter 2008-01-29 0:05 ` Robin Holt ` (3 subsequent siblings) 4 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-28 22:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins mmu core: Need to use hlist_del Wrong type of list del in mmu_notifier_release() Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/mmu_notifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 14:02:18.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-01-28 14:02:30.000000000 -0800 @@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru &mm->mmu_notifier.head, hlist) { if (mn->ops->release) mn->ops->release(mn, mm); - hlist_del(&mn->hlist); + hlist_del_rcu(&mn->hlist); } rcu_read_unlock(); synchronize_rcu(); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-28 22:06 ` Christoph Lameter @ 2008-01-29 0:05 ` Robin Holt 2008-01-29 1:19 ` Christoph Lameter 2008-01-29 13:59 ` Andrea Arcangeli ` (2 subsequent siblings) 4 siblings, 1 reply; 48+ messages in thread From: Robin Holt @ 2008-01-29 0:05 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins > +void mmu_notifier_release(struct mm_struct *mm) ... > + hlist_for_each_entry_safe_rcu(mn, n, t, > + &mm->mmu_notifier.head, hlist) { > + if (mn->ops->release) > + mn->ops->release(mn, mm); > + hlist_del(&mn->hlist); USE_AFTER_FREE!!! I made this same comment as well as other relavent comments last week. Robin ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-29 0:05 ` Robin Holt @ 2008-01-29 1:19 ` Christoph Lameter 0 siblings, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-01-29 1:19 UTC (permalink / raw) To: Robin Holt Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Mon, 28 Jan 2008, Robin Holt wrote: > USE_AFTER_FREE!!! I made this same comment as well as other relavent > comments last week. Must have slipped somehow. Patch needs to be applied after the rcu fix. Please repeat the other relevant comments if they are still relevant.... I thought I had worked through them. mmu_notifier_release: remove mmu_notifier struct from list before calling ->release Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/mmu_notifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/mm/mmu_notifier.c =================================================================== --- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 17:17:05.000000000 -0800 +++ linux-2.6/mm/mmu_notifier.c 2008-01-28 17:17:10.000000000 -0800 @@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru rcu_read_lock(); hlist_for_each_entry_safe_rcu(mn, n, t, &mm->mmu_notifier.head, hlist) { + hlist_del_rcu(&mn->hlist); if (mn->ops->release) mn->ops->release(mn, mm); - hlist_del_rcu(&mn->hlist); } rcu_read_unlock(); synchronize_rcu(); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-28 22:06 ` Christoph Lameter 2008-01-29 0:05 ` Robin Holt @ 2008-01-29 13:59 ` Andrea Arcangeli 2008-01-29 14:34 ` Andrea Arcangeli 2008-01-29 19:49 ` Christoph Lameter 2008-01-29 16:07 ` Robin Holt 2008-02-05 18:05 ` Andy Whitcroft 4 siblings, 2 replies; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-29 13:59 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote: > +struct mmu_notifier_head { > + struct hlist_head head; > +}; > + > struct mm_struct { > struct vm_area_struct * mmap; /* list of VMAs */ > struct rb_root mm_rb; > @@ -219,6 +223,8 @@ struct mm_struct { > /* aio bits */ > rwlock_t ioctx_list_lock; > struct kioctx *ioctx_list; > + > + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ > }; Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a regression (a minor one though). > + /* > + * lock indicates that the function is called under spinlock. > + */ > + void (*invalidate_range)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int lock); > +}; It's out of my reach how can you be ok with lock=1. You said you have to block, if you can deal with lock=1 once, why can't you deal with lock=1 _always_? > +/* > + * Note that all notifiers use RCU. The updates are only guaranteed to be > + * visible to other processes after a RCU quiescent period! > + */ > +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); > +} > +EXPORT_SYMBOL_GPL(__mmu_notifier_register); > + > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + down_write(&mm->mmap_sem); > + __mmu_notifier_register(mn, mm); > + up_write(&mm->mmap_sem); > +} > +EXPORT_SYMBOL_GPL(mmu_notifier_register); The down_write is garbage. The caller should put it around mmu_notifier_register if something. The same way the caller should call synchronize_rcu after mmu_notifier_register if it needs synchronous behavior from the notifiers. The default version of mmu_notifier_register shouldn't be cluttered with unnecessary locking. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-29 13:59 ` Andrea Arcangeli @ 2008-01-29 14:34 ` Andrea Arcangeli 2008-01-29 19:49 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Andrea Arcangeli @ 2008-01-29 14:34 UTC (permalink / raw) To: Christoph Lameter Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, Jan 29, 2008 at 02:59:14PM +0100, Andrea Arcangeli wrote: > The down_write is garbage. The caller should put it around > mmu_notifier_register if something. The same way the caller should > call synchronize_rcu after mmu_notifier_register if it needs > synchronous behavior from the notifiers. The default version of > mmu_notifier_register shouldn't be cluttered with unnecessary locking. Ooops my spinlock was gone from the notifier head.... so the above comment is wrong sorry! I thought down_write was needed to serialize against some _external_ event, not to serialize the list updates in place of my explicit lock. The critical section is so small that a semaphore is the wrong locking choice, that's why I assumed it was for an external event. Anyway RCU won't be optimal for a huge flood of register/unregister, I agree the down_write shouldn't create much contention and it saves 4 bytes from each mm_struct, and we can always change it to a proper spinlock later if needed. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-29 13:59 ` Andrea Arcangeli 2008-01-29 14:34 ` Andrea Arcangeli @ 2008-01-29 19:49 ` Christoph Lameter 2008-01-29 20:41 ` Avi Kivity 1 sibling, 1 reply; 48+ messages in thread From: Christoph Lameter @ 2008-01-29 19:49 UTC (permalink / raw) To: Andrea Arcangeli Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ > > }; > > Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a > regression (a minor one though). Andrew does not like #ifdefs and it makes it possible to verify calling conventions if !CONFIG_MMU_NOTIFIER. > It's out of my reach how can you be ok with lock=1. You said you have > to block, if you can deal with lock=1 once, why can't you deal with > lock=1 _always_? Not sure yet. We may have to do more in that area. Need to have feedback from Robin. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-29 19:49 ` Christoph Lameter @ 2008-01-29 20:41 ` Avi Kivity 0 siblings, 0 replies; 48+ messages in thread From: Avi Kivity @ 2008-01-29 20:41 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins Christoph Lameter wrote: > On Tue, 29 Jan 2008, Andrea Arcangeli wrote: > > >>> + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ >>> }; >>> >> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a >> regression (a minor one though). >> > > Andrew does not like #ifdefs and it makes it possible to verify calling > conventions if !CONFIG_MMU_NOTIFIER. > > You could define mmu_notifier_head as an empty struct in that case. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter ` (2 preceding siblings ...) 2008-01-29 13:59 ` Andrea Arcangeli @ 2008-01-29 16:07 ` Robin Holt 2008-02-05 18:05 ` Andy Whitcroft 4 siblings, 0 replies; 48+ messages in thread From: Robin Holt @ 2008-01-29 16:07 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins I am going to seperate my comments into individual replies to help reduce the chance they are lost. > +void mmu_notifier_release(struct mm_struct *mm) ... > + hlist_for_each_entry_safe_rcu(mn, n, t, > + &mm->mmu_notifier.head, hlist) { > + if (mn->ops->release) > + mn->ops->release(mn, mm); > + hlist_del(&mn->hlist); This is a use-after-free issue. The hlist_del_rcu needs to be done before the callout as the structure containing the mmu_notifier structure will need to be freed from within the ->release callout. Thanks, Robin ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter ` (3 preceding siblings ...) 2008-01-29 16:07 ` Robin Holt @ 2008-02-05 18:05 ` Andy Whitcroft 2008-02-05 18:17 ` Peter Zijlstra 2008-02-05 18:19 ` Christoph Lameter 4 siblings, 2 replies; 48+ messages in thread From: Andy Whitcroft @ 2008-02-05 18:05 UTC (permalink / raw) To: Christoph Lameter Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote: > Core code for mmu notifiers. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> > > --- > include/linux/list.h | 14 ++ > include/linux/mm_types.h | 6 + > include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++ > include/linux/page-flags.h | 10 ++ > kernel/fork.c | 2 > mm/Kconfig | 4 > mm/Makefile | 1 > mm/mmap.c | 2 > mm/mmu_notifier.c | 101 ++++++++++++++++++++ > 9 files changed, 350 insertions(+) > > Index: linux-2.6/include/linux/mm_types.h > =================================================================== > --- linux-2.6.orig/include/linux/mm_types.h 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/include/linux/mm_types.h 2008-01-28 11:35:22.000000000 -0800 > @@ -153,6 +153,10 @@ struct vm_area_struct { > #endif > }; > > +struct mmu_notifier_head { > + struct hlist_head head; > +}; > + > struct mm_struct { > struct vm_area_struct * mmap; /* list of VMAs */ > struct rb_root mm_rb; > @@ -219,6 +223,8 @@ struct mm_struct { > /* aio bits */ > rwlock_t ioctx_list_lock; > struct kioctx *ioctx_list; > + > + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ > }; > > #endif /* _LINUX_MM_TYPES_H */ > Index: linux-2.6/include/linux/mmu_notifier.h > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 11:43:03.000000000 -0800 > @@ -0,0 +1,210 @@ > +#ifndef _LINUX_MMU_NOTIFIER_H > +#define _LINUX_MMU_NOTIFIER_H > + > +/* > + * MMU motifier > + * > + * Notifier functions for hardware and software that establishes external > + * references to pages of a Linux system. The notifier calls ensure that > + * the external mappings are removed when the Linux VM removes memory ranges > + * or individual pages from a process. > + * > + * These fall into two classes > + * > + * 1. mmu_notifier > + * > + * These are callbacks registered with an mm_struct. If mappings are > + * removed from an address space then callbacks are performed. > + * Spinlocks must be held in order to the walk reverse maps and the > + * notifications are performed while the spinlock is held. > + * > + * > + * 2. mmu_rmap_notifier > + * > + * Callbacks for subsystems that provide their own rmaps. These > + * need to walk their own rmaps for a page. The invalidate_page > + * callback is outside of locks so that we are not in a strictly > + * atomic context (but we may be in a PF_MEMALLOC context if the > + * notifier is called from reclaim code) and are able to sleep. > + * Rmap notifiers need an extra page bit and are only available > + * on 64 bit platforms. It is up to the subsystem to mark pags > + * as PageExternalRmap as needed to trigger the callbacks. Pages > + * must be marked dirty if dirty bits are set in the external > + * pte. > + */ > + > +#include <linux/list.h> > +#include <linux/spinlock.h> > +#include <linux/rcupdate.h> > +#include <linux/mm_types.h> > + > +struct mmu_notifier_ops; > + > +struct mmu_notifier { > + struct hlist_node hlist; > + const struct mmu_notifier_ops *ops; > +}; > + > +struct mmu_notifier_ops { > + /* > + * Note: The mmu_notifier structure must be released with > + * call_rcu() since other processors are only guaranteed to > + * see the changes after a quiescent period. > + */ > + void (*release)(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > + int (*age_page)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); > + > + void (*invalidate_page)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); > + > + /* > + * lock indicates that the function is called under spinlock. > + */ > + void (*invalidate_range)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end, > + int lock); > +}; > + > +struct mmu_rmap_notifier_ops; > + > +struct mmu_rmap_notifier { > + struct hlist_node hlist; > + const struct mmu_rmap_notifier_ops *ops; > +}; > + > +struct mmu_rmap_notifier_ops { > + /* > + * Called with the page lock held after ptes are modified or removed > + * so that a subsystem with its own rmap's can remove remote ptes > + * mapping a page. > + */ > + void (*invalidate_page)(struct mmu_rmap_notifier *mrn, > + struct page *page); > +}; > + > +#ifdef CONFIG_MMU_NOTIFIER > + > +/* > + * Must hold the mmap_sem for write. > + * > + * RCU is used to traverse the list. A quiescent period needs to pass > + * before the notifier is guaranteed to be visible to all threads > + */ > +extern void __mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm); > +/* Will acquire mmap_sem for write*/ > +extern void mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm); > +/* > + * Will acquire mmap_sem for write. > + * > + * A quiescent period needs to pass before the mmu_notifier structure > + * can be released. mmu_notifier_release() will wait for a quiescent period > + * after calling the ->release callback. So it is safe to call > + * mmu_notifier_unregister from the ->release function. > + */ > +extern void mmu_notifier_unregister(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > + > +extern void mmu_notifier_release(struct mm_struct *mm); > +extern int mmu_notifier_age_page(struct mm_struct *mm, > + unsigned long address); > + > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) > +{ > + INIT_HLIST_HEAD(&mnh->head); > +} > + > +#define mmu_notifier(function, mm, args...) \ > + do { \ > + struct mmu_notifier *__mn; \ > + struct hlist_node *__n; \ > + \ > + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ > + rcu_read_lock(); \ > + hlist_for_each_entry_rcu(__mn, __n, \ > + &(mm)->mmu_notifier.head, \ > + hlist) \ > + if (__mn->ops->function) \ > + __mn->ops->function(__mn, \ > + mm, \ > + args); \ > + rcu_read_unlock(); \ > + } \ > + } while (0) > + > +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn); > +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn); > + > +extern struct hlist_head mmu_rmap_notifier_list; > + > +#define mmu_rmap_notifier(function, args...) \ > + do { \ > + struct mmu_rmap_notifier *__mrn; \ > + struct hlist_node *__n; \ > + \ > + rcu_read_lock(); \ > + hlist_for_each_entry_rcu(__mrn, __n, \ > + &mmu_rmap_notifier_list, \ > + hlist) \ > + if (__mrn->ops->function) \ > + __mrn->ops->function(__mrn, args); \ > + rcu_read_unlock(); \ > + } while (0); > + > +#else /* CONFIG_MMU_NOTIFIER */ > + > +/* > + * Notifiers that use the parameters that they were passed so that the > + * compiler does not complain about unused variables but does proper > + * parameter checks even if !CONFIG_MMU_NOTIFIER. > + * Macros generate no code. > + */ > +#define mmu_notifier(function, mm, args...) \ > + do { \ > + if (0) { \ > + struct mmu_notifier *__mn; \ > + \ > + __mn = (struct mmu_notifier *)(0x00ff); \ > + __mn->ops->function(__mn, mm, args); \ > + }; \ > + } while (0) > + > +#define mmu_rmap_notifier(function, args...) \ > + do { \ > + if (0) { \ > + struct mmu_rmap_notifier *__mrn; \ > + \ > + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \ > + __mrn->ops->function(__mrn, args); \ > + } \ > + } while (0); > + > +static inline void mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm) {} > +static inline void mmu_notifier_unregister(struct mmu_notifier *mn, > + struct mm_struct *mm) {} > +static inline void mmu_notifier_release(struct mm_struct *mm) {} > +static inline int mmu_notifier_age_page(struct mm_struct *mm, > + unsigned long address) > +{ > + return 0; > +} > + > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {} > + > +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) > + {} > +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) > + {} > + > +#endif /* CONFIG_MMU_NOTIFIER */ > + > +#endif /* _LINUX_MMU_NOTIFIER_H */ > Index: linux-2.6/include/linux/page-flags.h > =================================================================== > --- linux-2.6.orig/include/linux/page-flags.h 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/include/linux/page-flags.h 2008-01-28 11:35:22.000000000 -0800 > @@ -105,6 +105,7 @@ > * 64 bit | FIELDS | ?????? FLAGS | > * 63 32 0 > */ > +#define PG_external_rmap 30 /* Page has external rmap */ > #define PG_uncached 31 /* Page has been mapped as uncached */ > #endif > > @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc > #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) > #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) > > +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT) > +#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags) > +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags) > +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \ > + &(page)->flags) > +#else > +#define PageExternalRmap(page) 0 > +#endif > + > struct page; /* forward declaration */ > > extern void cancel_dirty_page(struct page *page, unsigned int account_size); > Index: linux-2.6/mm/Kconfig > =================================================================== > --- linux-2.6.orig/mm/Kconfig 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/mm/Kconfig 2008-01-28 11:35:22.000000000 -0800 > @@ -193,3 +193,7 @@ config NR_QUICK > config VIRT_TO_BUS > def_bool y > depends on !ARCH_NO_VIRT_TO_BUS > + > +config MMU_NOTIFIER > + def_bool y > + bool "MMU notifier, for paging KVM/RDMA" > Index: linux-2.6/mm/Makefile > =================================================================== > --- linux-2.6.orig/mm/Makefile 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/mm/Makefile 2008-01-28 11:35:22.000000000 -0800 > @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o > obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_SMP) += allocpercpu.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o > > Index: linux-2.6/mm/mmu_notifier.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6/mm/mmu_notifier.c 2008-01-28 11:35:22.000000000 -0800 > @@ -0,0 +1,101 @@ > +/* > + * linux/mm/mmu_notifier.c > + * > + * Copyright (C) 2008 Qumranet, Inc. > + * Copyright (C) 2008 SGI > + * Christoph Lameter <clameter@sgi.com> > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + */ > + > +#include <linux/mmu_notifier.h> > +#include <linux/module.h> > + > +void mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n, *t; > + > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > + rcu_read_lock(); > + hlist_for_each_entry_safe_rcu(mn, n, t, > + &mm->mmu_notifier.head, hlist) { > + if (mn->ops->release) > + mn->ops->release(mn, mm); Does this ->release actually release the 'nm' and its associated hlist? I see in this thread that this ordering is deemed "use after free" which implies so. If it does that seems wrong. This is an RCU hlist, therefore the list integrity must be maintained through the next grace period in case there are parallell readers using the element, in particular its forward pointer for traversal. > + hlist_del(&mn->hlist); For this to be updating the list, you must have some form of "write-side" exclusion as these primatives are not "parallel write safe". It would be helpful for this routine to state what that write side exclusion is. > + } > + rcu_read_unlock(); > + synchronize_rcu(); > + } > +} > + > +/* > + * If no young bitflag is supported by the hardware, ->age_page can > + * unmap the address and return 1 or 0 depending if the mapping previously > + * existed or not. > + */ > +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + int young = 0; > + > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, > + &mm->mmu_notifier.head, hlist) { > + if (mn->ops->age_page) > + young |= mn->ops->age_page(mn, mm, address); > + } > + rcu_read_unlock(); > + } > + > + return young; > +} > + > +/* > + * Note that all notifiers use RCU. The updates are only guaranteed to be > + * visible to other processes after a RCU quiescent period! > + */ > +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head); > +} > +EXPORT_SYMBOL_GPL(__mmu_notifier_register); > + > +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + down_write(&mm->mmap_sem); > + __mmu_notifier_register(mn, mm); > + up_write(&mm->mmap_sem); > +} > +EXPORT_SYMBOL_GPL(mmu_notifier_register); > + > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + down_write(&mm->mmap_sem); > + hlist_del_rcu(&mn->hlist); > + up_write(&mm->mmap_sem); > +} > +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); > + > +static DEFINE_SPINLOCK(mmu_notifier_list_lock); > +HLIST_HEAD(mmu_rmap_notifier_list); > + > +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn) > +{ > + spin_lock(&mmu_notifier_list_lock); > + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list); > + spin_unlock(&mmu_notifier_list_lock); > +} > +EXPORT_SYMBOL(mmu_rmap_notifier_register); > + > +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn) > +{ > + spin_lock(&mmu_notifier_list_lock); > + hlist_del_rcu(&mrn->hlist); > + spin_unlock(&mmu_notifier_list_lock); > +} > +EXPORT_SYMBOL(mmu_rmap_notifier_unregister); > + > Index: linux-2.6/kernel/fork.c > =================================================================== > --- linux-2.6.orig/kernel/fork.c 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/kernel/fork.c 2008-01-28 11:35:22.000000000 -0800 > @@ -51,6 +51,7 @@ > #include <linux/random.h> > #include <linux/tty.h> > #include <linux/proc_fs.h> > +#include <linux/mmu_notifier.h> > > #include <asm/pgtable.h> > #include <asm/pgalloc.h> > @@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct > > if (likely(!mm_alloc_pgd(mm))) { > mm->def_flags = 0; > + mmu_notifier_head_init(&mm->mmu_notifier); > return mm; > } > free_mm(mm); > Index: linux-2.6/mm/mmap.c > =================================================================== > --- linux-2.6.orig/mm/mmap.c 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/mm/mmap.c 2008-01-28 11:37:53.000000000 -0800 > @@ -26,6 +26,7 @@ > #include <linux/mount.h> > #include <linux/mempolicy.h> > #include <linux/rmap.h> > +#include <linux/mmu_notifier.h> > > #include <asm/uaccess.h> > #include <asm/cacheflush.h> > @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) > vm_unacct_memory(nr_accounted); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); > tlb_finish_mmu(tlb, 0, end); > + mmu_notifier_release(mm); > > /* > * Walk the list again, actually closing and freeing it, > Index: linux-2.6/include/linux/list.h > =================================================================== > --- linux-2.6.orig/include/linux/list.h 2008-01-28 11:35:20.000000000 -0800 > +++ linux-2.6/include/linux/list.h 2008-01-28 11:35:22.000000000 -0800 > @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s > ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ > pos = pos->next) > > +/** > + * hlist_for_each_entry_safe_rcu - iterate over list of given type > + * @tpos: the type * to use as a loop cursor. > + * @pos: the &struct hlist_node to use as a loop cursor. > + * @n: temporary pointer > + * @head: the head for your list. > + * @member: the name of the hlist_node within the struct. > + */ > +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \ > + for (pos = (head)->first; \ > + rcu_dereference(pos) && ({ n = pos->next; 1;}) && \ > + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \ > + pos = n) > + > #else > #warning "don't include kernel headers in userspace" > #endif /* __KERNEL__ */ I am not sure it makes sense to add a _safe_rcu variant. As I understand things an _safe variant is used where we are going to remove the current list element in the middle of a list walk. However the key feature of an RCU data structure is that it will always be in a "safe" state until any parallel readers have completed. For an hlist this means that the removed entry and its forward link must remain valid for as long as there may be a parallel reader traversing this list, ie. until the next grace period. If this link is valid for the parallel reader, then it must be valid for us, and if so it feels that hlist_for_each_entry_rcu should be sufficient to cope in the face of entries being unlinked as we traverse the list. -apw ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-05 18:05 ` Andy Whitcroft @ 2008-02-05 18:17 ` Peter Zijlstra 2008-02-05 18:19 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2008-02-05 18:17 UTC (permalink / raw) To: Andy Whitcroft Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, 2008-02-05 at 18:05 +0000, Andy Whitcroft wrote: > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + rcu_read_lock(); > > + hlist_for_each_entry_safe_rcu(mn, n, t, > > + &mm->mmu_notifier.head, hlist) { > > + if (mn->ops->release) > > + mn->ops->release(mn, mm); > > Does this ->release actually release the 'nm' and its associated hlist? > I see in this thread that this ordering is deemed "use after free" which > implies so. > > If it does that seems wrong. This is an RCU hlist, therefore the list > integrity must be maintained through the next grace period in case there > are parallell readers using the element, in particular its forward > pointer for traversal. That is not quite so, list elements must be preserved, not the list order. > > > + hlist_del(&mn->hlist); > > For this to be updating the list, you must have some form of "write-side" > exclusion as these primatives are not "parallel write safe". It would > be helpful for this routine to state what that write side exclusion is. Yeah, has been noticed, read on in the thread :-) > I am not sure it makes sense to add a _safe_rcu variant. As I understand > things an _safe variant is used where we are going to remove the current > list element in the middle of a list walk. However the key feature of an > RCU data structure is that it will always be in a "safe" state until any > parallel readers have completed. For an hlist this means that the removed > entry and its forward link must remain valid for as long as there may be > a parallel reader traversing this list, ie. until the next grace period. > If this link is valid for the parallel reader, then it must be valid for > us, and if so it feels that hlist_for_each_entry_rcu should be sufficient > to cope in the face of entries being unlinked as we traverse the list. It does make sense, hlist_del_rcu() maintains the fwd reference, but it does unlink it from the list proper. As long as there is a write side exclusion around the actual removal as you noted. rcu_read_lock(); hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) { if (foo) { spin_lock(write_lock); hlist_del_rcu(tpos); spin_unlock(write_unlock); } } rcu_read_unlock(); is a safe construct in that the list itself stays a proper list, and even items that might be caught in the to-be-deleted entries will have a fwd way out. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code 2008-02-05 18:05 ` Andy Whitcroft 2008-02-05 18:17 ` Peter Zijlstra @ 2008-02-05 18:19 ` Christoph Lameter 1 sibling, 0 replies; 48+ messages in thread From: Christoph Lameter @ 2008-02-05 18:19 UTC (permalink / raw) To: Andy Whitcroft Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins On Tue, 5 Feb 2008, Andy Whitcroft wrote: > > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { > > + rcu_read_lock(); > > + hlist_for_each_entry_safe_rcu(mn, n, t, > > + &mm->mmu_notifier.head, hlist) { > > + if (mn->ops->release) > > + mn->ops->release(mn, mm); > > Does this ->release actually release the 'nm' and its associated hlist? > I see in this thread that this ordering is deemed "use after free" which > implies so. Right that was fixed in a later release and discussed extensively later. See V5. > I am not sure it makes sense to add a _safe_rcu variant. As I understand > things an _safe variant is used where we are going to remove the current It was dropped in V5. ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2008-02-18 22:33 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter 2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-30 15:37 ` Andrea Arcangeli 2008-01-30 15:53 ` Jack Steiner 2008-01-30 16:38 ` Andrea Arcangeli 2008-01-30 19:19 ` Christoph Lameter 2008-01-30 22:20 ` Robin Holt 2008-01-30 23:38 ` Andrea Arcangeli 2008-01-30 23:55 ` Christoph Lameter 2008-01-31 0:12 ` [kvm-devel] " Andrea Arcangeli 2008-01-31 1:27 ` Christoph Lameter 2008-01-30 17:10 ` Peter Zijlstra 2008-01-30 19:28 ` Christoph Lameter 2008-01-30 18:02 ` Robin Holt 2008-01-30 19:08 ` Christoph Lameter 2008-01-30 19:14 ` Christoph Lameter 2008-01-30 2:29 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter 2008-01-30 2:29 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter 2008-01-30 18:03 ` Robin Holt 2008-01-30 2:29 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter 2008-01-30 2:29 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter 2008-01-30 2:29 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter -- strict thread matches above, loose matches on Subject: below -- 2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter 2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-02-16 3:37 ` Andrew Morton 2008-02-16 8:45 ` Avi Kivity 2008-02-16 8:56 ` Andrew Morton 2008-02-16 9:21 ` Avi Kivity 2008-02-16 10:41 ` Brice Goglin 2008-02-16 10:58 ` Andrew Morton 2008-02-16 19:31 ` Christoph Lameter 2008-02-16 19:21 ` Christoph Lameter 2008-02-17 3:01 ` Andrea Arcangeli 2008-02-17 12:24 ` Robin Holt 2008-02-17 5:04 ` Doug Maxey 2008-02-18 22:33 ` Roland Dreier 2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter 2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter 2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter 2008-01-28 22:06 ` Christoph Lameter 2008-01-29 0:05 ` Robin Holt 2008-01-29 1:19 ` Christoph Lameter 2008-01-29 13:59 ` Andrea Arcangeli 2008-01-29 14:34 ` Andrea Arcangeli 2008-01-29 19:49 ` Christoph Lameter 2008-01-29 20:41 ` Avi Kivity 2008-01-29 16:07 ` Robin Holt 2008-02-05 18:05 ` Andy Whitcroft 2008-02-05 18:17 ` Peter Zijlstra 2008-02-05 18:19 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).