LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/6] MMU Notifiers V7
@ 2008-02-15 6:48 Christoph Lameter
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
` (6 more replies)
0 siblings, 7 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:48 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU). MMU
notifiers allow us to get rid of the page pinning for RDMA and various
other purposes. It gets rid of the broken use of mlock for page pinning and
avoids having to lock pages by increasing the refcount.
(mlock really does *not* pin pages....)
More information on the rationale and the technical details can be found in
the first patch and the README provided by that patch in
Documentation/mmu_notifiers.
The known immediate users are
KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated but
no reverse maps.
GRU
- Simple additional hardware TLB (possibly covering multiple instances of
Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
fall back to get_user_pages().
- No page reference possible since no page status is kept..
XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
cleared.
Andrea's mmu_notifier #4 -> RFC V1
- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
called.
- Develop a patch sequence that separates out the different types of
hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.
V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
is held.
- Add invalidate_all()
V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
and sys_remap_file_pages() after the pte clearing.
V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
we are the only executing thread. This is also true for invalidate_all() so
we could drop off the mmu_notifier there early. Use hlist_del_init instead
of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch
V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
ourselves. In some contexts where we want to register/unregister we are
already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
all patches for KVM and GRU.
V5->V6:
- Provide missing range callouts for mprotect
- Fix do_wp_page control path sequencing
- Clarify locking conventions
- GRU and XPmem confirmed to work with this patchset.
- Provide skeleton code for GRU/KVM type callback and for XPmem type.
- Rework documentation and put it into Documentation/mmu_notifier.
V6->V7:
- Code our own page table traversal in the skeletons so that we can perform
the insertion of a remote pte under pte lock.
- Discuss page pinning by increasing page refcount
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 1/6] mmu_notifier: Core code
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-18 22:33 ` Roland Dreier
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
` (5 subsequent siblings)
6 siblings, 2 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 19064 bytes --]
MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.
The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.
The notification scheme is much better than the current schemes of
avoiding the danger of the VM removing pages that are externally
mapped. We currently either mlock pages used for RDMA, XPmem etc
in memory or increase the refcount to pin the pages. Increasing
the refcount makes it impossible for the VM to reclaim the page.
Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.
Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.
The alternate of increasing the page count to pin pages is also not
that enticing since there will be continual attempts to reclaim
or migrate these pages.
The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning. The VM gains control of its memory again
and the memory that has external references can be managed like regular
memory.
This patch: Core portion
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
Documentation/mmu_notifier/README | 105 ++++++++++++++++++++++
include/linux/mm_types.h | 7 +
include/linux/mmu_notifier.h | 180 ++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 2
mm/mmu_notifier.c | 76 ++++++++++++++++
8 files changed, 377 insertions(+)
Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README 2008-02-14 22:27:19.000000000 -0800
@@ -0,0 +1,105 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current schemes of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory or
+increase the refcount of the pages.
+
+Both cause problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+The page refcount is used to track current users of a page struct.
+Artificially inflating the refcount means that the VM cannot track
+down all references to a page. It will not be able to reclaim or
+move a page. However, the core code will try again and again because
+the assumption is that an elevated refcount is a temporary situation.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+ Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-02-14 21:17:51.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
#endif
};
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+ struct hlist_head head;
+#endif
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
#ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
#endif
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-02-14 22:42:28.000000000 -0800
@@ -0,0 +1,180 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If pages are
+ * removed from an address space then callbacks are performed.
+ *
+ * Spinlocks must be held in order to walk reverse maps. The
+ * invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * The invalidate_range_start/end callbacks can be performed in contexts
+ * where sleeping is allowed or in atomic contexts. A flag is passed
+ * to indicate an atomic context.
+ *
+ * Pages must be marked dirty if dirty bits are found to be set in
+ * the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * The release notifier is called when no other execution threads
+ * are left. Synchronization is not necessary.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /*
+ * age_page is called from contexts where the pte_lock is held
+ */
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * invalidate_page is called from contexts where the pte_lock is held.
+ */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * invalidate_range_begin() and invalidate_range_end() must be paired.
+ *
+ * Multiple invalidate_range_begin/ends may be nested or called
+ * concurrently. That is legit. However, no new external references
+ * may be established as long as any invalidate_xxx is running or
+ * any invalidate_range_begin() and has not been completed through a
+ * corresponding call to invalidate_range_end().
+ *
+ * Locking within the notifier needs to serialize events correspondingly.
+ *
+ * invalidate_range_begin() must clear all references in the range
+ * and stop the establishment of new references.
+ *
+ * invalidate_range_end() reenables the establishment of references.
+ *
+ * atomic indicates that the function is called in an atomic context.
+ * We can sleep if atomic == 0.
+ *
+ * invalidate_range_begin() must remove all external references.
+ * There will be no retries as with invalidate_page().
+ */
+ void (*invalidate_range_begin)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int atomic);
+
+ void (*invalidate_range_end)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Kconfig 2008-02-14 21:17:51.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Makefile 2008-02-14 21:17:51.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c 2008-02-14 22:41:55.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *t;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ hlist_for_each_entry_safe(mn, n, t,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del_init(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/kernel/fork.c 2008-02-14 21:17:51.000000000 -0800
@@ -53,6 +53,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-02-14 22:42:02.000000000 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;
/* mm's last user has gone, and its about to be pulled down */
+ mmu_notifier_release(mm);
arch_exit_mmap(mm);
lru_add_drain();
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
` (2 more replies)
2008-02-15 6:49 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
` (4 subsequent siblings)
6 siblings, 3 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 11235 bytes --]
The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible. Locks are only held for truncate and huge pages.
In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).
do_wp_page(): We hold off new references while we update the pte.
xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/filemap_xip.c | 5 +++++
mm/fremap.c | 3 +++
mm/hugetlb.c | 3 +++
mm/memory.c | 35 +++++++++++++++++++++++++++++------
mm/mmap.c | 2 ++
mm/mprotect.c | 3 +++
mm/mremap.c | 7 ++++++-
7 files changed, 51 insertions(+), 7 deletions(-)
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
+ mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
err = populate_range(mm, vma, start, size, pgoff);
+ mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.000000000 -0800
@@ -51,6 +51,7 @@
#include <linux/init.h>
#include <linux/writeback.h>
#include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+ if (is_cow_mapping(vma->vm_flags))
+ mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
@@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
vma, addr, next))
return -ENOMEM;
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+ if (is_cow_mapping(vma->vm_flags))
+ mmu_notifier(invalidate_range_end, src_mm,
+ vma->vm_start, end, 0);
+
return 0;
}
@@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
+ int atomic = details ? (details->i_mmap_lock != 0) : 0;
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+ mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
if (tlb)
tlb_finish_mmu(tlb, address, end);
+ mmu_notifier(invalidate_range_end, mm, address, end, atomic);
return end;
}
@@ -1339,7 +1351,7 @@ int remap_pfn_range(struct vm_area_struc
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + PAGE_ALIGN(size);
+ unsigned long start = addr, end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
int err;
@@ -1373,6 +1385,7 @@ int remap_pfn_range(struct vm_area_struc
pfn -= addr >> PAGE_SHIFT;
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
+ mmu_notifier(invalidate_range_begin, mm, start, end, 0);
do {
next = pgd_addr_end(addr, end);
err = remap_pud_range(mm, pgd, addr, next,
@@ -1380,6 +1393,7 @@ int remap_pfn_range(struct vm_area_struc
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range_end, mm, start, end, 0);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
@@ -1463,10 +1477,11 @@ int apply_to_page_range(struct mm_struct
{
pgd_t *pgd;
unsigned long next;
- unsigned long end = addr + size;
+ unsigned long start = addr, end = addr + size;
int err;
BUG_ON(addr >= end);
+ mmu_notifier(invalidate_range_begin, mm, start, end, 0);
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
@@ -1474,6 +1489,7 @@ int apply_to_page_range(struct mm_struct
if (err)
break;
} while (pgd++, addr = next, addr != end);
+ mmu_notifier(invalidate_range_end, mm, start, end, 0);
return err;
}
EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1614,8 +1630,10 @@ static int do_wp_page(struct mm_struct *
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
page_cache_release(old_page);
- if (!pte_same(*page_table, orig_pte))
- goto unlock;
+ if (!pte_same(*page_table, orig_pte)) {
+ pte_unmap_unlock(page_table, ptl);
+ goto check_dirty;
+ }
page_mkwrite = 1;
}
@@ -1631,7 +1649,8 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
- goto unlock;
+ pte_unmap_unlock(page_table, ptl);
+ goto check_dirty;
}
/*
@@ -1653,6 +1672,8 @@ gotten:
if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;
+ mmu_notifier(invalidate_range_begin, mm, address,
+ address + PAGE_SIZE, 0);
/*
* Re-check the pte - we dropped the lock
*/
@@ -1691,8 +1712,10 @@ gotten:
page_cache_release(new_page);
if (old_page)
page_cache_release(old_page);
-unlock:
pte_unmap_unlock(page_table, ptl);
+ mmu_notifier(invalidate_range_end, mm,
+ address, address + PAGE_SIZE, 0);
+check_dirty:
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-02-14 18:44:56.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-02-14 18:45:07.000000000 -0800
@@ -1748,11 +1748,13 @@ static void unmap_region(struct mm_struc
lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
+ mmu_notifier(invalidate_range_begin, mm, start, end, 0);
unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mmu_notifier(invalidate_range_end, mm, start, end, 0);
}
/*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/hugetlb.c 2008-02-14 18:45:07.000000000 -0800
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -755,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
BUG_ON(start & ~HPAGE_MASK);
BUG_ON(end & ~HPAGE_MASK);
+ mmu_notifier(invalidate_range_begin, mm, start, end, 1);
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
@@ -775,6 +777,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mmu_notifier(invalidate_range_end, mm, start, end, 1);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c 2008-02-14 18:45:07.000000000 -0800
@@ -13,6 +13,7 @@
#include <linux/module.h>
#include <linux/uio.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
@@ -190,6 +191,8 @@ __xip_unmap (struct address_space * mapp
address = vma->vm_start +
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+ mmu_notifier(invalidate_range_begin, mm, address,
+ address + PAGE_SIZE, 1);
pte = page_check_address(page, mm, address, &ptl);
if (pte) {
/* Nuke the page table entry. */
@@ -201,6 +204,8 @@ __xip_unmap (struct address_space * mapp
pte_unmap_unlock(pte, ptl);
page_cache_release(page);
}
+ mmu_notifier(invalidate_range_end, mm,
+ address, address + PAGE_SIZE, 1);
}
spin_unlock(&mapping->i_mmap_lock);
}
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/mremap.c 2008-02-14 18:45:07.000000000 -0800
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len)
{
- unsigned long extent, next, old_end;
+ unsigned long extent, next, old_start, old_end;
pmd_t *old_pmd, *new_pmd;
+ old_start = old_addr;
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
+ mmu_notifier(invalidate_range_begin, vma->vm_mm,
+ old_addr, old_end, 0);
for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
cond_resched();
next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm
move_ptes(vma, old_pmd, old_addr, old_addr + extent,
new_vma, new_pmd, new_addr);
}
+ mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0);
return len + old_addr - old_end; /* how much done */
}
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2008-02-14 18:43:31.000000000 -0800
+++ linux-2.6/mm/mprotect.c 2008-02-14 18:45:07.000000000 -0800
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
dirty_accountable = 1;
}
+ mmu_notifier(invalidate_range_begin, mm, start, end, 0);
if (is_vm_hugetlb_page(vma))
hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+ mmu_notifier(invalidate_range_end, mm, start, end, 0);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-15 6:49 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
` (3 subsequent siblings)
6 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_invalidate_page --]
[-- Type: text/plain, Size: 3026 bytes --]
Two callbacks to remove individual pages as done in rmap code
invalidate_page()
Called from the inner loop of rmap walks to invalidate pages.
age_page()
Called for the determination of the page referenced status.
If we do not care about page referenced status then an age_page callback
may be be omitted. PageLock and pte lock are held when either of the
functions is called.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/rmap.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-02-07 16:49:32.000000000 -0800
+++ linux-2.6/mm/rmap.c 2008-02-07 17:25:25.000000000 -0800
@@ -49,6 +49,7 @@
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
#include <asm/tlbflush.h>
@@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
if (vma->vm_flags & VM_LOCKED) {
referenced++;
*mapcount = 1; /* break early from loop */
- } else if (ptep_clear_flush_young(vma, address, pte))
+ } else if (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address))
referenced++;
/* Pretend the page is referenced if the task has the
@@ -455,6 +457,7 @@ static int page_mkclean_one(struct page
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -712,7 +715,8 @@ static int try_to_unmap_one(struct page
* skipped over this mm) then we should reactivate it.
*/
if (!migration && ((vma->vm_flags & VM_LOCKED) ||
- (ptep_clear_flush_young(vma, address, pte)))) {
+ (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address)))) {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -720,6 +724,7 @@ static int try_to_unmap_one(struct page
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -844,12 +849,14 @@ static void try_to_unmap_cluster(unsigne
page = vm_normal_page(vma, address, *pte);
BUG_ON(!page || PageAnon(page));
- if (ptep_clear_flush_young(vma, address, pte))
+ if (ptep_clear_flush_young(vma, address, pte) |
+ mmu_notifier_age_page(mm, address))
continue;
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mmu_notifier(invalidate_page, mm, address);
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
` (2 preceding siblings ...)
2008-02-15 6:49 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-15 6:49 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
` (2 subsequent siblings)
6 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_skeleton --]
[-- Type: text/plain, Size: 7685 bytes --]
This is example code for a simple device driver interface to unmap
pages that were externally mapped.
Locking is simple through a single lock that is used to protect the
device drivers data structures as well as a counter that tracks the
active invalidates on a single address space.
The invalidation of extern ptes must be possible with code that does
not require sleeping. The lock is taken for all driver operations on
the mmu that the driver manages. Locking could be made more sophisticated
but I think this is going to be okay for most uses.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
Documentation/mmu_notifier/skeleton.c | 267 ++++++++++++++++++++++++++++++++++
1 file changed, 267 insertions(+)
Index: linux-2.6/Documentation/mmu_notifier/skeleton.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/skeleton.c 2008-02-14 22:23:18.000000000 -0800
@@ -0,0 +1,267 @@
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+
+/*
+ * Skeleton for an mmu notifier without rmap callbacks and no need to slepp
+ * during invalidate_page().
+ *
+ * (C) 2008 Silicon Graphics, Inc.
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * Note that the locking is fairly basic. One can add various optimizations
+ * here and there. There is a single lock for an address space which should be
+ * satisfactory for most cases. If not then the lock can be split like the
+ * pte_lock in Linux. It is most likely best to place the locks in the
+ * page table structure or into whatever the external mmu uses to
+ * track the mappings.
+ */
+
+struct my_mmu {
+ /* MMU notifier specific fields */
+ struct mmu_notifier notifier;
+ spinlock_t lock; /* Protects counter and invidual zaps */
+ int invalidates; /* Number of active range_invalidates */
+};
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_insert_page(struct my_mmu *m,
+ unsigned long address, unsigned long pfn)
+{
+ /* Must be provided */
+ printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n",
+ m, address, pfn);
+}
+
+/*
+ * Called with m->lock held (optional but usually required to
+ * protect data structures of the driver).
+ */
+static void my_mmu_zap_page(struct my_mmu *m, unsigned long address)
+{
+ /* Must be provided */
+ printk(KERN_INFO "zap page %p address=%lx\n", m, address);
+}
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_zap_range(struct my_mmu *m,
+ unsigned long start, unsigned long end, int atomic)
+{
+ /* Must be provided */
+ printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n",
+ m, start, end, atomic);
+}
+
+/*
+ * Zap an individual page.
+ *
+ * Serialization with establishment of a new external pte occurs
+ * through the pte lock. The m->lock is taken to serialize access
+ * to the driver private data. If the driver does not need this
+ * serialization then the lock can be omitted.
+ */
+static void my_mmu_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long address)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ spin_lock(&m->lock);
+ my_mmu_zap_page(m, address);
+ spin_unlock(&m->lock);
+}
+
+/*
+ * Increment and decrement of the number of range invalidates
+ */
+static inline void inc_active(struct my_mmu *m)
+{
+ spin_lock(&m->lock);
+ m->invalidates++;
+ spin_unlock(&m->lock);
+}
+
+static inline void dec_active(struct my_mmu *m)
+{
+ spin_lock(&m->lock);
+ m->invalidates--;
+ spin_unlock(&m->lock);
+}
+
+static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ int atomic)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ inc_active(m); /* Holds off new references */
+ my_mmu_zap_range(m, start, end, atomic);
+}
+
+static void my_mmu_invalidate_range_end(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ int atomic)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ dec_active(m); /* Enables new references */
+}
+
+/*
+ * Populate a page.
+ *
+ * A return value of-EAGAIN means please retry this operation.
+ *
+ * Acquisition of mmap_sem can be omitted if the caller already holds
+ * the semaphore.
+ */
+struct page *my_mmu_populate_page(struct my_mmu *m,
+ struct vm_area_struct *vma,
+ unsigned long address, int atomic, int write)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page = ERR_PTR(-EAGAIN);
+ int err;
+ int done = 0;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;
+ spinlock_t *ptl;
+
+ /* No need to do anything if a range invalidate is running */
+ if (m->invalidates)
+ return ERR_PTR(-EAGAIN);
+
+ if (atomic) {
+ if (!down_read_trylock(&mm->mmap_sem))
+ return ERR_PTR(-EAGAIN);
+ } else
+ down_read(&mm->mmap_sem);
+
+ do {
+ page = ERR_PTR(-EAGAIN);
+
+ if (m->invalidates)
+ break;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ goto check;
+
+ pud = pud_offset(pgd, address);
+ if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+ goto check;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto check;
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ goto check;
+
+ pte = *ptep;
+ if (!pte_present(pte))
+ goto pte_unlock;
+ if (write && !pte_write(pte))
+ goto pte_unlock;
+
+ page = vm_normal_page(vma, address, pte);
+ if (page) {
+ done = 1;
+ /*
+ * The m->lock is held to ensure that the count of
+ * current invalidates stays constant.
+ * invalidate_page() is held off by the pte lock.
+ */
+ spin_lock(&m->lock);
+
+ if (!m->invalidates)
+ my_mmu_insert_page(m, address, page_to_pfn(page));
+ else
+ page = ERR_PTR(-EAGAIN);
+
+ spin_unlock(&m->lock);
+ }
+pte_unlock:
+ pte_unmap_unlock(ptep, ptl);
+check:
+
+ if (done || atomic)
+ break;
+
+ /*
+ * Need to run the page fault handler to get the pte entry
+ * setup right.
+ */
+ err = get_user_pages(current, vma->vm_mm, address, 1,
+ write, 1, NULL, NULL);
+
+ if (err < 0) {
+ page = ERR_PTR(err);
+ break;
+ }
+
+ } while (!done);
+
+ up_read(&vma->vm_mm->mmap_sem);
+ return page;
+}
+
+/*
+ * All other threads accessing this mm_struct must have terminated by now.
+ */
+static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ my_mmu_zap_range(m, 0, TASK_SIZE, 0);
+ kfree(m);
+ printk(KERN_INFO "MMU Notifier detaching\n");
+}
+
+static struct mmu_notifier_ops my_mmu_ops = {
+ my_mmu_release,
+ NULL, /* No aging function */
+ my_mmu_invalidate_page,
+ my_mmu_invalidate_range_begin,
+ my_mmu_invalidate_range_end
+};
+
+/*
+ * This function must be called to activate callbacks from a process
+ */
+int my_mmu_attach_to_process(struct mm_struct *mm)
+{
+ struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL);
+
+ if (!m)
+ return -ENOMEM;
+
+ m->notifier.ops = &my_mmu_ops;
+ spin_lock_init(&m->lock);
+
+ /*
+ * mmap_sem handling can be omitted if it is guaranteed that
+ * the context from which my_mmu_attach_to_process is called
+ * is already holding a writelock on mmap_sem.
+ */
+ down_write(&mm->mmap_sem);
+ mmu_notifier_register(&m->notifier, mm);
+ up_write(&mm->mmap_sem);
+
+ /*
+ * RCU sync is expensive but necessary if we need to guarantee
+ * that multiple threads running on other cpus have seen the
+ * notifier changes.
+ */
+ synchronize_rcu();
+ return 0;
+}
+
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
` (3 preceding siblings ...)
2008-02-15 6:49 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-19 23:55 ` Nick Piggin
2008-02-15 6:49 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
2008-02-16 10:48 ` [PATCH] KVM swapping with MMU Notifiers V7 Andrea Arcangeli
6 siblings, 2 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_rmap_support --]
[-- Type: text/plain, Size: 8290 bytes --]
These special additional callbacks are required because XPmem (and likely
other mechanisms) do use their own rmap (multiple processes on a series
of remote Linux instances may be accessing the memory of a process).
F.e. XPmem may have to send out notifications to remote Linux instances
and receive confirmation before a page can be freed.
So we handle this like an additional Linux reverse map that is walked after
the existing rmaps have been walked. We leave the walking to the driver that
is then able to use something else than a spinlock to walk its reverse
maps. So we can actually call the driver without holding spinlocks while
we hold the Pagelock.
However, we cannot determine the mm_struct that a page belongs to at
that point. The mm_struct can only be determined from the rmaps by the
device driver.
We add another pageflag (PageExternalRmap) that is set if a page has
been remotely mapped (f.e. by a process from another Linux instance).
We can then only perform the callbacks for pages that are actually in
remote use.
Rmap notifiers need an extra page bit and are only available
on 64 bit platforms. This functionality is not available on 32 bit!
A notifier that uses the reverse maps callbacks does not need to provide
the invalidate_page() method that is called when locks are held.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mmu_notifier.h | 65 +++++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 11 +++++++
mm/mmu_notifier.c | 34 ++++++++++++++++++++++
mm/rmap.c | 9 +++++
4 files changed, 119 insertions(+)
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-02-14 20:58:17.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h 2008-02-14 21:21:04.000000000 -0800
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_external_rmap 30 /* Page has external rmap */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif
@@ -296,6 +297,16 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+ &(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page) 0
+#endif
+
struct page; /* forward declaration */
extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-02-14 21:20:55.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h 2008-02-14 21:21:04.000000000 -0800
@@ -23,6 +23,18 @@
* where sleeping is allowed or in atomic contexts. A flag is passed
* to indicate an atomic context.
*
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ *
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms.
+ *
* Pages must be marked dirty if dirty bits are found to be set in
* the external ptes.
*/
@@ -96,6 +108,23 @@ struct mmu_notifier_ops {
int atomic);
};
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+ struct hlist_node hlist;
+ const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed
+ * so that a subsystem with its own rmap's can remove remote ptes
+ * mapping a page.
+ */
+ void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+ struct page *page);
+};
+
#ifdef CONFIG_MMU_NOTIFIER
/*
@@ -146,6 +175,27 @@ static inline void mmu_notifier_head_ini
} \
} while (0)
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+/* Must hold PageLock */
+extern void mmu_rmap_export_page(struct page *page);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ struct mmu_rmap_notifier *__mrn; \
+ struct hlist_node *__n; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mrn, __n, \
+ &mmu_rmap_notifier_list, hlist) \
+ if (__mrn->ops->function) \
+ __mrn->ops->function(__mrn, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
#else /* CONFIG_MMU_NOTIFIER */
/*
@@ -164,6 +214,16 @@ static inline void mmu_notifier_head_ini
}; \
} while (0)
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ if (0) { \
+ struct mmu_rmap_notifier *__mrn; \
+ \
+ __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
+ __mrn->ops->function(__mrn, args); \
+ } \
+ } while (0);
+
static inline void mmu_notifier_register(struct mmu_notifier *mn,
struct mm_struct *mm) {}
static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
@@ -177,6 +237,11 @@ static inline int mmu_notifier_age_page(
static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+ {}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+ {}
+
#endif /* CONFIG_MMU_NOTIFIER */
#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-02-14 21:17:51.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-02-14 21:21:04.000000000 -0800
@@ -74,3 +74,37 @@ void mmu_notifier_unregister(struct mmu_
}
EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+#ifdef CONFIG_64BIT
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_del_rcu(&mrn->hlist);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
+
+#endif
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-02-14 21:21:00.000000000 -0800
+++ linux-2.6/mm/rmap.c 2008-02-14 21:21:04.000000000 -0800
@@ -497,6 +497,10 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
+ if (unlikely(PageExternalRmap(page))) {
+ mmu_rmap_notifier(invalidate_page, page);
+ ClearPageExternalRmap(page);
+ }
if (page_test_dirty(page)) {
page_clear_dirty(page);
ret = 1;
@@ -1013,6 +1017,11 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);
+ if (unlikely(PageExternalRmap(page))) {
+ mmu_rmap_notifier(invalidate_page, page);
+ ClearPageExternalRmap(page);
+ }
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
` (4 preceding siblings ...)
2008-02-15 6:49 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
@ 2008-02-15 6:49 ` Christoph Lameter
2008-02-16 10:48 ` [PATCH] KVM swapping with MMU Notifiers V7 Andrea Arcangeli
6 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-15 6:49 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_rmap_skeleton --]
[-- Type: text/plain, Size: 8619 bytes --]
The skeleton for the rmap notifier leaves the invalidate_page method of
the mmu_notifier empty and hooks a new invalidate_page callback into the
global chain for mmu_rmap_notifiers.
There are seveal simplifications in here to avoid making this too complex.
The reverse maps need to consit of references to vma f.e.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
Documentation/mmu_notifier/skeleton_rmap.c | 311 +++++++++++++++++++++++++++++
1 file changed, 311 insertions(+)
Index: linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c 2008-02-14 22:23:01.000000000 -0800
@@ -0,0 +1,311 @@
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+
+/*
+ * Skeleton for an mmu notifier with rmap callbacks and sleeping during
+ * invalidate_page.
+ *
+ * (C) 2008 Silicon Graphics, Inc.
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * Note that the locking is fairly basic. One can add various optimizations
+ * here and there. There is a single lock for an address space which should be
+ * satisfactory for most cases. If not then the lock can be split like the
+ * pte_lock in Linux. It is most likely best to place the locks in the
+ * page table structure or into whatever the external mmu uses to
+ * track the mappings.
+ */
+
+struct my_mmu {
+ /* MMU notifier specific fields */
+ struct mmu_notifier notifier;
+ spinlock_t lock; /* Protects counter and invidual zaps */
+ int invalidates; /* Number of active range_invalidate */
+
+ /* Rmap support */
+ struct list_head list; /* rmap list of my_mmu structs */
+ unsigned long base;
+};
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_insert_page(struct my_mmu *m,
+ unsigned long address, unsigned long pfn)
+{
+ /* Must be provided */
+ printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n",
+ m, address, pfn);
+}
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_zap_range(struct my_mmu *m,
+ unsigned long start, unsigned long end, int atomic)
+{
+ /* Must be provided */
+ printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n",
+ m, start, end, atomic);
+}
+
+/*
+ * Called with m->lock held (optional but usually required to
+ * protect data structures of the driver).
+ */
+static void my_mmu_zap_page(struct my_mmu *m, unsigned long address)
+{
+ /* Must be provided */
+ printk(KERN_INFO "zap page %p address=%lx\n", m, address);
+}
+
+/*
+ * Increment and decrement of the number of range invalidates
+ */
+static inline void inc_active(struct my_mmu *m)
+{
+ spin_lock(&m->lock);
+ m->invalidates++;
+ spin_unlock(&m->lock);
+}
+
+static inline void dec_active(struct my_mmu *m)
+{
+ spin_lock(&m->lock);
+ m->invalidates--;
+ spin_unlock(&m->lock);
+}
+
+static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ int atomic)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ inc_active(m); /* Holds off new references */
+ my_mmu_zap_range(m, start, end, atomic);
+}
+
+static void my_mmu_invalidate_range_end(struct mmu_notifier *mn,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ int atomic)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ dec_active(m); /* Enables new references */
+}
+
+/*
+ * Populate a page.
+ *
+ * A return value of-EAGAIN means please retry this operation.
+ *
+ * Acuisition of mmap_sem can be omitted if the caller already holds
+ * the semaphore.
+ */
+struct page *my_mmu_populate_page(struct my_mmu *m,
+ struct vm_area_struct *vma,
+ unsigned long address, int write)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page;
+ int err;
+ int done = 0;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;
+ spinlock_t *ptl;
+
+ /* No need to do anything if a range invalidate is running */
+ if (m->invalidates)
+ return ERR_PTR(-EAGAIN);
+
+ down_read(&mm->mmap_sem);
+ do {
+ page = ERR_PTR(-EAGAIN);
+
+ if (m->invalidates)
+ break;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ goto check;
+
+ pud = pud_offset(pgd, address);
+ if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+ goto check;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto check;
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ goto check;
+
+ pte = *ptep;
+ if (!pte_present(pte))
+ goto pte_unlock;
+ if (write && !pte_write(pte))
+ goto pte_unlock;
+
+ page = vm_normal_page(vma, address, pte);
+ if (page) {
+ done = 1;
+ /*
+ * The m->lock is held to ensure that the count of
+ * current invalidates stays constant.
+ * invalidate_page() is held off by the pte lock.
+ */
+ spin_lock(&m->lock);
+
+ if (!m->invalidates)
+ my_mmu_insert_page(m, address, page_to_pfn(page));
+ else
+ page = ERR_PTR(-EAGAIN);
+
+ spin_unlock(&m->lock);
+ }
+pte_unlock:
+ pte_unmap_unlock(ptep, ptl);
+check:
+
+ if (done)
+ break;
+
+ /*
+ * Need to run the page fault handler to get the pte entry
+ * setup right.
+ */
+ err = get_user_pages(current, vma->vm_mm, address, 1,
+ write, 1, NULL, NULL);
+
+ if (err < 0) {
+ page = ERR_PTR(err);
+ break;
+ }
+
+ } while (!done);
+
+ up_read(&vma->vm_mm->mmap_sem);
+ return page;
+}
+
+/*
+ * All other threads accessing this mm_struct must have terminated by now.
+ */
+static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+ my_mmu_zap_range(m, 0, TASK_SIZE, 0);
+ /* No concurrent processes thus no worries about RCU */
+ list_del(&m->list);
+ kfree(m);
+ printk(KERN_INFO "MMU Notifier terminating\n");
+}
+
+static struct mmu_notifier_ops my_mmu_ops = {
+ my_mmu_release,
+ NULL, /* No aging function */
+ NULL, /* No atomic invalidate_page function */
+ my_mmu_invalidate_range_begin,
+ my_mmu_invalidate_range_end
+};
+
+/* Rmap specific fields */
+static LIST_HEAD(my_mmu_list);
+static struct rw_semaphore listlock;
+
+/*
+ * This function must be called to activate callbacks from a process
+ */
+int my_mmu_attach_to_process(struct mm_struct *mm)
+{
+ struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL);
+
+ if (!m)
+ return -ENOMEM;
+
+ m->notifier.ops = &my_mmu_ops;
+ spin_lock_init(&m->lock);
+
+ /*
+ * mmap_sem handling can be omitted if it is guaranteed that
+ * the context from which my_mmu_attach_to_process is called
+ * is already holding a writelock on mmap_sem.
+ */
+ down_write(&mm->mmap_sem);
+ mmu_notifier_register(&m->notifier, mm);
+ up_write(&mm->mmap_sem);
+ down_write(&listlock);
+ list_add(&m->list, &my_mmu_list);
+ up_write(&listlock);
+
+ /*
+ * RCU sync is expensive but necessary if we need to guarantee
+ * that multiple threads running on other cpus have seen the
+ * notifier changes.
+ */
+ synchronize_rcu();
+ return 0;
+}
+
+
+static void my_sleeping_invalidate_page(struct my_mmu *m, unsigned long address)
+{
+ /* Must be provided */
+ spin_lock(&m->lock); /* Only taken to ensure mmu data integrity */
+ my_mmu_zap_page(m, address);
+ spin_unlock(&m->lock);
+ printk(KERN_INFO "Sleeping invalidate_page %p address=%lx\n",
+ m, address);
+}
+
+static unsigned long my_mmu_find_addr(struct my_mmu *m, struct page *page)
+{
+ /* Determine the address of a page in a mmu segment */
+ return -EFAULT;
+}
+
+/*
+ * A reference must be held on the page passed and the page passed
+ * must be locked. No spinlocks are held. invalidate_page() is held
+ * off by us holding the page lock.
+ */
+static void my_mmu_rmap_invalidate_page(struct mmu_rmap_notifier *mrn,
+ struct page *page)
+{
+ struct my_mmu *m;
+
+ BUG_ON(!PageLocked(page));
+ down_read(&listlock);
+ list_for_each_entry(m, &my_mmu_list, list) {
+ unsigned long address = my_mmu_find_addr(m, page);
+
+ if (address != -EFAULT)
+ my_sleeping_invalidate_page(m, address);
+ }
+ up_read(&listlock);
+}
+
+static struct mmu_rmap_notifier_ops my_mmu_rmap_ops = {
+ .invalidate_page = my_mmu_rmap_invalidate_page
+};
+
+static struct mmu_rmap_notifier my_mmu_rmap_notifier = {
+ .ops = &my_mmu_rmap_ops
+};
+
+static int __init my_mmu_init(void)
+{
+ mmu_rmap_notifier_register(&my_mmu_rmap_notifier);
+ return 0;
+}
+
+late_initcall(my_mmu_init);
+
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-02-16 3:37 ` Andrew Morton
2008-02-16 8:45 ` Avi Kivity
` (3 more replies)
2008-02-18 22:33 ` Roland Dreier
1 sibling, 4 replies; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 3:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 14 Feb 2008 22:49:00 -0800 Christoph Lameter <clameter@sgi.com> wrote:
> MMU notifiers are used for hardware and software that establishes
> external references to pages managed by the Linux kernel. These are
> page table entriews or tlb entries or something else that allows
> hardware (such as DMA engines, scatter gather devices, networking,
> sharing of address spaces across operating system boundaries) and
> software (Virtualization solutions such as KVM, Xen etc) to
> access memory managed by the Linux kernel.
>
> The MMU notifier will notify the device driver that subscribes to such
> a notifier that the VM is going to do something with the memory
> mapped by that device. The device must then drop references for the
> indicated memory area. The references may be reestablished later.
>
> The notification scheme is much better than the current schemes of
> avoiding the danger of the VM removing pages that are externally
> mapped. We currently either mlock pages used for RDMA, XPmem etc
> in memory or increase the refcount to pin the pages. Increasing
> the refcount makes it impossible for the VM to reclaim the page.
>
> Mlock causes problems with reclaim and may lead to OOM if too many
> pages are pinned in memory. It is also incorrect in terms what the POSIX
> specificies for what role mlock should play. Mlock does *not* pin pages in
> memory. Mlock just means do not allow the page to be moved to swap.
>
> Linux can move pages in memory (for example through the page migration
> mechanism). These pages can be moved even if they are mlocked(!!!!).
> The current approach of page pinning in use by RDMA etc is conceptually
> broken but there are currently no other easy solutions.
>
> The alternate of increasing the page count to pin pages is also not
> that enticing since there will be continual attempts to reclaim
> or migrate these pages.
>
> The solution here allows us to finally fix this issue by requiring
> such devices to subscribe to a notification chain that will allow
> them to work without pinning. The VM gains control of its memory again
> and the memory that has external references can be managed like regular
> memory.
>
> This patch: Core portion
>
What is the status of getting infiniband to use this facility?
How important is this feature to KVM?
To xpmem?
Which other potential clients have been identified and how important it it
to those?
> Index: linux-2.6/Documentation/mmu_notifier/README
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/mmu_notifier/README 2008-02-14 22:27:19.000000000 -0800
> @@ -0,0 +1,105 @@
> +Linux MMU Notifiers
> +-------------------
> +
> +MMU notifiers are used for hardware and software that establishes
> +external references to pages managed by the Linux kernel. These are
> +page table entriews or tlb entries or something else that allows
> +hardware (such as DMA engines, scatter gather devices, networking,
> +sharing of address spaces across operating system boundaries) and
> +software (Virtualization solutions such as KVM, Xen etc) to
> +access memory managed by the Linux kernel.
> +
> +The MMU notifier will notify the device driver that subscribes to such
> +a notifier that the VM is going to do something with the memory
> +mapped by that device. The device must then drop references for the
> +indicated memory area. The references may be reestablished later.
> +
> +The notification scheme is much better than the current schemes of
> +dealing with the danger of the VM removing pages.
> +We currently mlock pages used for RDMA, XPmem etc in memory or
> +increase the refcount of the pages.
> +
> +Both cause problems with reclaim and may lead to OOM if too many
> +pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
> +specification of the role of mlock. Mlock does *not* pin pages in
> +memory. It just does not allow the page to be moved to swap.
> +The page refcount is used to track current users of a page struct.
> +Artificially inflating the refcount means that the VM cannot track
> +down all references to a page. It will not be able to reclaim or
> +move a page. However, the core code will try again and again because
> +the assumption is that an elevated refcount is a temporary situation.
> +
> +Linux can move pages in memory (for example through the page migration
> +mechanism). These pages can be moved even if they are mlocked(!!!!).
> +So the current approach in use by RDMA etc etc is conceptually broken
> +but there are currently no other easy solutions.
> +
> +The solution here allows us to finally fix this issue by requiring
> +such devices to subscribe to a notification chain that will allow
> +them to work without pinning.
> +
> +The notifier chains provide two callback mechanisms. The
> +first one is required for any device that establishes external mappings.
> +The second (rmap) mechanism is required if a device needs to be
> +able to sleep when invalidating references. Sleeping may be necessary
> +if we are mapping across a network or to different Linux instances
> +in the same address space.
I'd have thought that a major reason for sleeping would be to wait for IO
to complete. Worth mentioning here?
> +mmu_notifier mechanism (for KVM/GRU etc)
> +----------------------------------------
> +Callbacks are registered with an mm_struct from a device driver using
> +mmu_notifier_register(). When the VM removes pages (or changes
> +permissions on pages etc) then callbacks are triggered.
> +
> +The invalidation function for a single page (*invalidate_page)
We already have an invalidatepage. Ho hum.
> +is called with spinlocks (in particular the pte lock) held. This allow
> +for an easy implementation of external ptes that are on the local system.
>
Why is that "easy"? I's have thought that it would only be easy if the
driver happened to be using those same locks for its own purposes.
Otherwise it is "awkward"?
> +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> +called most of the time without any locks held. It is only called with
> +locks held for file backed mappings that are truncated. A flag indicates
> +in which mode we are. A driver can use that mechanism to f.e.
> +delay the freeing of the pages during truncate until no locks are held.
That sucks big time. What do we need to do to make get the callback
functions called in non-atomic context?
> +Pages must be marked dirty if dirty bits are found to be set in
> +the external ptes during unmap.
That sentence is too vague. Define "marked dirty"?
> +The *release* method is called when a Linux process exits. It is run before
We'd conventionally use a notation such as "->release()" here, rather than
the asterisks.
> +the pages and mappings of a process are torn down and gives the device driver
> +a chance to zap all the external mappings in one go.
I assume what you mean here is that ->release() is called during exit()
when the final reference to an mm is being dropped.
> +An example for a code that can be used to build a notifier mechanism into
> +a device driver can be found in the file
> +Documentation/mmu_notifier/skeleton.c
Should that be in samples/?
> +mmu_rmap_notifier mechanism (XPMEM etc)
> +---------------------------------------
> +The mmu_rmap_notifier allows the device driver to implement their own rmap
s/their/its/
> +and allows the device driver to sleep during page eviction. This is necessary
> +for complex drivers that f.e. allow the sharing of memory between processes
> +running on different Linux instances (typically over a network or in a
> +partitioned NUMA system).
> +
> +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> +*before* the Linux rmaps are walked. At that point only the page lock is
> +held. The invalidate_page() function must walk the driver rmaps and evict
> +all the references to the page.
What happens if it cannot do so?
> +There is no process information available before the rmaps are consulted.
Not sure what that sentence means. I guess "available to the core VM"?
> +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> +it is a global callback list. Having to perform a callback for each and every
> +page that is reclaimed would be inefficient. Therefore we add an additional
> +page flag: PageRmapExternal().
How many page flags are left?
Is this feature important enough to justfy consumption of another one?
> Only pages that are marked with this bit can
> +be exported and the rmap callbacks will only be performed for pages marked
> +that way.
"exported": new term, unclear what it means.
> +The required additional Page flag is only availabe in 64 bit mode and
> +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
whoa. Is that good? You just made your feature unavailable on the great
majority of Linux systems.
> +An example of code to build a mmu_notifier mechanism with rmap capabilty
> +can be found in Documentation/mmu_notifier/skeleton_rmap.c
> +
> +February 9, 2008,
> + Christoph Lameter <clameter@sgi.com
> +
> +Index: linux-2.6/include/linux/mm_types.h
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h 2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h 2008-02-14 21:17:51.000000000 -0800
> @@ -159,6 +159,12 @@ struct vm_area_struct {
> #endif
> };
>
> +struct mmu_notifier_head {
> +#ifdef CONFIG_MMU_NOTIFIER
> + struct hlist_head head;
> +#endif
> +};
> +
> struct mm_struct {
> struct vm_area_struct * mmap; /* list of VMAs */
> struct rb_root mm_rb;
> @@ -228,6 +234,7 @@ struct mm_struct {
> #ifdef CONFIG_CGROUP_MEM_CONT
> struct mem_cgroup *mem_cgroup;
> #endif
> + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> };
>
> #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h 2008-02-14 22:42:28.000000000 -0800
> @@ -0,0 +1,180 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier
typo
> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.
So the callee cannot fail. hm. If it can't block, it's likely screwed in
that case. In other cases it might be screwed anyway. I suspect we'll
need to be able to handle callee failure.
> + * These fall into two classes:
> + *
> + * 1. mmu_notifier
> + *
> + * These are callbacks registered with an mm_struct. If pages are
> + * removed from an address space then callbacks are performed.
"to be removed", I guess. It's called before the page is actually removed?
> + * Spinlocks must be held in order to walk reverse maps. The
> + * invalidate_page() callbacks are performed with spinlocks held.
hm, yes, problem. Permitting callee failure might be good enough.
> + * The invalidate_range_start/end callbacks can be performed in contexts
> + * where sleeping is allowed or in atomic contexts. A flag is passed
> + * to indicate an atomic context.
We generally would prefer separate callbacks, rather than a unified
callback with a mode flag.
> + * Pages must be marked dirty if dirty bits are found to be set in
> + * the external ptes.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> + struct hlist_node hlist;
> + const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> + /*
> + * The release notifier is called when no other execution threads
> + * are left. Synchronization is not necessary.
"and the mm is about to be destroyed"?
> + */
> + void (*release)(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> + /*
> + * age_page is called from contexts where the pte_lock is held
> + */
> + int (*age_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
This wasn't documented.
> + /*
> + * invalidate_page is called from contexts where the pte_lock is held.
> + */
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> +
> + /*
> + * invalidate_range_begin() and invalidate_range_end() must be paired.
> + *
> + * Multiple invalidate_range_begin/ends may be nested or called
> + * concurrently.
Under what circumstances would they be nested?
> That is legit. However, no new external references
references to what?
> + * may be established as long as any invalidate_xxx is running or
> + * any invalidate_range_begin() and has not been completed through a
stray "and".
> + * corresponding call to invalidate_range_end().
> + *
> + * Locking within the notifier needs to serialize events correspondingly.
> + *
> + * invalidate_range_begin() must clear all references in the range
> + * and stop the establishment of new references.
and stop the establishment of new references within the range, I assume?
If so, that's putting a heck of a lot of complexity into the driver, isn't
it? It needs to temporarily remember an arbitrarily large number of
regions in this mm against which references may not be taken?
> + * invalidate_range_end() reenables the establishment of references.
within the range?
> + * atomic indicates that the function is called in an atomic context.
> + * We can sleep if atomic == 0.
> + *
> + * invalidate_range_begin() must remove all external references.
> + * There will be no retries as with invalidate_page().
> + */
> + void (*invalidate_range_begin)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int atomic);
> +
> + void (*invalidate_range_end)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int atomic);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> +/*
> + * Must hold mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> + unsigned long address);
There's the mysterious age_page again.
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> + INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + struct mmu_notifier *__mn; \
> + struct hlist_node *__n; \
> + \
> + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> + rcu_read_lock(); \
> + hlist_for_each_entry_rcu(__mn, __n, \
> + &(mm)->mmu_notifier.head, \
> + hlist) \
> + if (__mn->ops->function) \
> + __mn->ops->function(__mn, \
> + mm, \
> + args); \
> + rcu_read_unlock(); \
> + } \
> + } while (0)
The macro references its args more than once. Anyone who does
mmu_notifier(function, some_function_which_has_side_effects())
will get a surprise. Use temporaries.
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + if (0) { \
> + struct mmu_notifier *__mn; \
> + \
> + __mn = (struct mmu_notifier *)(0x00ff); \
> + __mn->ops->function(__mn, mm, args); \
> + }; \
> + } while (0)
That's a bit weird. Can't we do the old
(void)function;
(void)mm;
trick? Or make it a staic inline function?
> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> + unsigned long address)
> +{
> + return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig 2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Kconfig 2008-02-14 21:17:51.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
> config VIRT_TO_BUS
> def_bool y
> depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> + def_bool y
> + bool "MMU notifier, for paging KVM/RDMA"
Why is this not selectable? The help seems a bit brief.
Does this cause 32-bit systems to drag in a bunch of code they're not
allowed to ever use?
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile 2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Makefile 2008-02-14 21:17:51.000000000 -0800
> @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c 2008-02-14 22:41:55.000000000 -0800
> @@ -0,0 +1,76 @@
> +/*
> + * linux/mm/mmu_notifier.c
> + *
> + * Copyright (C) 2008 Qumranet, Inc.
> + * Copyright (C) 2008 SGI
> + * Christoph Lameter <clameter@sgi.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +
> +/*
> + * No synchronization. This function can only be called when only a single
> + * process remains that performs teardown.
> + */
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n, *t;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + hlist_for_each_entry_safe(mn, n, t,
> + &mm->mmu_notifier.head, hlist) {
> + hlist_del_init(&mn->hlist);
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
We do this a lot, but back in the old days people didn't like optional
callbacks which can be NULL. If we expect that mmu_notifier_ops.release is
usually implemented, the just unconditionally call it and require that all
clients implement it. Perhaps provide an exported-to-modules stuv in core
kernel for clients which didn't want to implement ->release().
> + }
> + }
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> + int young = 0;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n,
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->age_page)
> + young |= mn->ops->age_page(mn, mm, address);
> + }
> + rcu_read_unlock();
> + }
> +
> + return young;
> +}
should the rcu_read_lock() cover the hlist_empty() test?
This function looks like it was tossed in at the last minute. It's
mysterious, undocumented, poorly commented, poorly named. A better name
would be one which has some correlation with the return value.
Because anyone who looks at some code which does
if (mmu_notifier_age_page(mm, address))
...
has to go and reverse-engineer the implementation of
mmu_notifier_age_page() to work out under which circumstances the "..."
will be executed. But this should be apparent just from reading the callee
implementation.
This function *really* does need some documentation. What does it *mean*
when the ->age_page() from some of the notifiers returned "1" and the
->age_page() from some other notifiers returned zero? Dunno.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-02-16 3:37 ` Andrew Morton
2008-02-16 19:26 ` Christoph Lameter
2008-02-19 8:54 ` Nick Piggin
2008-02-19 23:08 ` Nick Piggin
2 siblings, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 3:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 14 Feb 2008 22:49:01 -0800 Christoph Lameter <clameter@sgi.com> wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
hm. Do they? Why? If I'm in the process of zero-copy writing a hunk of
memory out to hardware then do I care if someone write-protects the ptes?
Spose so, but some fleshing-out of the various scenarios here would clarify
things.
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
This is so bad.
I supposed in the restricted couple of cases which you're focussed on it
works OK. But is it generally suitable? What if IO is in progress? What
if other cluster nodes need to be talked to? Does it suit RDMA?
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
Assuming that there is a missing "within the range" in this description, I
assume that all clients will just throw up theior hands in horror and will
disallow all references to all parts of the mm.
Of course, to do that they will need to take a sleeping lock to prevent
other threads from establishing new references. whoops.
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
What does "stands in" mean?
> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
> Signed-off-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
> mm/filemap_xip.c | 5 +++++
> mm/fremap.c | 3 +++
> mm/hugetlb.c | 3 +++
> mm/memory.c | 35 +++++++++++++++++++++++++++++------
> mm/mmap.c | 2 ++
> mm/mprotect.c | 3 +++
> mm/mremap.c | 7 ++++++-
> 7 files changed, 51 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c 2008-02-14 18:43:31.000000000 -0800
> +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.000000000 -0800
> @@ -15,6 +15,7 @@
> #include <linux/rmap.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/mmu_context.h>
> #include <asm/cacheflush.h>
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
To avoid off-by-one confusion the changelogs, documentation and comments
should be very careful to tell the reader whether the range includes the
byte at start+size. I don't thik that was done?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-15 6:49 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
@ 2008-02-16 3:37 ` Andrew Morton
2008-02-16 11:07 ` Andrea Arcangeli
` (2 more replies)
0 siblings, 3 replies; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 3:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <clameter@sgi.com> wrote:
> Two callbacks to remove individual pages as done in rmap code
>
> invalidate_page()
>
> Called from the inner loop of rmap walks to invalidate pages.
>
> age_page()
>
> Called for the determination of the page referenced status.
>
> If we do not care about page referenced status then an age_page callback
> may be be omitted. PageLock and pte lock are held when either of the
> functions is called.
The age_page mystery shallows.
It would be useful to have some rationale somewhere in the patchset for the
existence of this callback.
> #include <asm/tlbflush.h>
>
> @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
> if (vma->vm_flags & VM_LOCKED) {
> referenced++;
> *mapcount = 1; /* break early from loop */
> - } else if (ptep_clear_flush_young(vma, address, pte))
> + } else if (ptep_clear_flush_young(vma, address, pte) |
> + mmu_notifier_age_page(mm, address))
> referenced++;
The "|" is obviously deliberate. But no explanation is provided telling us
why we still call the callback if ptep_clear_flush_young() said the page
was recently referenced. People who read your code will want to understand
this.
> /* Pretend the page is referenced if the task has the
> @@ -455,6 +457,7 @@ static int page_mkclean_one(struct page
>
> flush_cache_page(vma, address, pte_pfn(*pte));
> entry = ptep_clear_flush(vma, address, pte);
> + mmu_notifier(invalidate_page, mm, address);
I just don't see how ths can be done if the callee has another thread in
the middle of establishing IO against this region of memory.
->invalidate_page() _has_ to be able to block. Confused.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-15 6:49 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
@ 2008-02-16 3:37 ` Andrew Morton
2008-02-16 19:28 ` Christoph Lameter
2008-02-19 23:55 ` Nick Piggin
1 sibling, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 3:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 14 Feb 2008 22:49:04 -0800 Christoph Lameter <clameter@sgi.com> wrote:
> These special additional callbacks are required because XPmem (and likely
> other mechanisms) do use their own rmap (multiple processes on a series
> of remote Linux instances may be accessing the memory of a process).
> F.e. XPmem may have to send out notifications to remote Linux instances
> and receive confirmation before a page can be freed.
>
> So we handle this like an additional Linux reverse map that is walked after
> the existing rmaps have been walked. We leave the walking to the driver that
> is then able to use something else than a spinlock to walk its reverse
> maps. So we can actually call the driver without holding spinlocks while
> we hold the Pagelock.
>
> However, we cannot determine the mm_struct that a page belongs to at
> that point. The mm_struct can only be determined from the rmaps by the
> device driver.
>
> We add another pageflag (PageExternalRmap) that is set if a page has
> been remotely mapped (f.e. by a process from another Linux instance).
> We can then only perform the callbacks for pages that are actually in
> remote use.
>
> Rmap notifiers need an extra page bit and are only available
> on 64 bit platforms. This functionality is not available on 32 bit!
>
> A notifier that uses the reverse maps callbacks does not need to provide
> the invalidate_page() method that is called when locks are held.
>
hrm.
> +#define mmu_rmap_notifier(function, args...) \
> + do { \
> + struct mmu_rmap_notifier *__mrn; \
> + struct hlist_node *__n; \
> + \
> + rcu_read_lock(); \
> + hlist_for_each_entry_rcu(__mrn, __n, \
> + &mmu_rmap_notifier_list, hlist) \
> + if (__mrn->ops->function) \
> + __mrn->ops->function(__mrn, args); \
> + rcu_read_unlock(); \
> + } while (0);
> +
buggy macro: use locals.
> +#define mmu_rmap_notifier(function, args...) \
> + do { \
> + if (0) { \
> + struct mmu_rmap_notifier *__mrn; \
> + \
> + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
> + __mrn->ops->function(__mrn, args); \
> + } \
> + } while (0);
> +
Same observation as in the other patch.
> ===================================================================
> --- linux-2.6.orig/mm/mmu_notifier.c 2008-02-14 21:17:51.000000000 -0800
> +++ linux-2.6/mm/mmu_notifier.c 2008-02-14 21:21:04.000000000 -0800
> @@ -74,3 +74,37 @@ void mmu_notifier_unregister(struct mmu_
> }
> EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
>
> +#ifdef CONFIG_64BIT
> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);
> +HLIST_HEAD(mmu_rmap_notifier_list);
> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(&mmu_notifier_list_lock);
> + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
> + spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(&mmu_notifier_list_lock);
> + hlist_del_rcu(&mrn->hlist);
> + spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
>
> +/*
> + * Export a page.
> + *
> + * Pagelock must be held.
> + * Must be called before a page is put on an external rmap.
> + */
> +void mmu_rmap_export_page(struct page *page)
> +{
> + BUG_ON(!PageLocked(page));
> + SetPageExternalRmap(page);
> +}
> +EXPORT_SYMBOL(mmu_rmap_export_page);
The other patch used EXPORT_SYMBOL_GPL.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-16 8:45 ` Avi Kivity
2008-02-16 8:56 ` Andrew Morton
2008-02-16 10:41 ` Brice Goglin
` (2 subsequent siblings)
3 siblings, 1 reply; 116+ messages in thread
From: Avi Kivity @ 2008-02-16 8:45 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
Andrew Morton wrote:
> How important is this feature to KVM?
>
Very. kvm pins pages that are referenced by the guest; a 64-bit guest
will easily pin its entire memory with the kernel map. So this is
critical for guest swapping to actually work.
Other nice features like page migration are also enabled by this patch.
--
Any sufficiently difficult bug is indistinguishable from a feature.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 8:45 ` Avi Kivity
@ 2008-02-16 8:56 ` Andrew Morton
2008-02-16 9:21 ` Avi Kivity
0 siblings, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 8:56 UTC (permalink / raw)
To: Avi Kivity
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, 16 Feb 2008 10:45:50 +0200 Avi Kivity <avi@qumranet.com> wrote:
> Andrew Morton wrote:
> > How important is this feature to KVM?
> >
>
> Very. kvm pins pages that are referenced by the guest;
hm. Why does it do that?
> a 64-bit guest
> will easily pin its entire memory with the kernel map.
> So this is
> critical for guest swapping to actually work.
Curious. If KVM can release guest pages at the request of this notifier so
that they can be swapped out, why can't it release them by default, and
allow swapping to proceed?
>
> Other nice features like page migration are also enabled by this patch.
>
We already have page migration. Do you mean page-migration-when-using-kvm?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 8:56 ` Andrew Morton
@ 2008-02-16 9:21 ` Avi Kivity
0 siblings, 0 replies; 116+ messages in thread
From: Avi Kivity @ 2008-02-16 9:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
Andrew Morton wrote:
>> Very. kvm pins pages that are referenced by the guest;
>>
>
> hm. Why does it do that?
>
>
It was deemed best not to allow the guest to write to a page that has
been swapped out and assigned to an unrelated host process.
One way to view the kvm shadow page tables is as hardware dma
descriptors. kvm pins pages for the same reason that drivers pin pages
that are being dma'ed. It's also the reason why mmu notifiers are useful
for such a wide range of dma capable hardware.
>> a 64-bit guest
>> will easily pin its entire memory with the kernel map.
>>
>
>
>> So this is
>> critical for guest swapping to actually work.
>>
>
> Curious. If KVM can release guest pages at the request of this notifier so
> that they can be swapped out, why can't it release them by default, and
> allow swapping to proceed?
>
>
If kvm releases a page, it must also zap any shadow ptes pointing at the
page and flush the tlb. If you do that for all of memory you can't
reference any of it.
Releasing a page has costs, both at the time of the release and when the
guest eventually refers to the page again.
>> Other nice features like page migration are also enabled by this patch.
>>
>>
>
> We already have page migration. Do you mean page-migration-when-using-kvm?
>
Yes, I'm obviously writing from a kvm-centric point of view. This is an
important feature, as the virtualization future seems to be NUMA hosts
(2- or 4- way, 4 cores per socket) running moderately sized guests. The
ability to load-balance guests among the NUMA nodes is important for
performance.
(btw, I'm also looking forward to memory defragmentation. large pages
are important for virtualization workloads and mmu notifiers are again
critical to getting it to work while running kvm).
--
Any sufficiently difficult bug is indistinguishable from a feature.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 3:37 ` Andrew Morton
2008-02-16 8:45 ` Avi Kivity
@ 2008-02-16 10:41 ` Brice Goglin
2008-02-16 10:58 ` Andrew Morton
2008-02-16 19:21 ` Christoph Lameter
2008-02-17 5:04 ` Doug Maxey
3 siblings, 1 reply; 116+ messages in thread
From: Brice Goglin @ 2008-02-16 10:41 UTC (permalink / raw)
To: Andrew Morton; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm
Andrew Morton wrote:
> What is the status of getting infiniband to use this facility?
>
> How important is this feature to KVM?
>
> To xpmem?
>
> Which other potential clients have been identified and how important it it
> to those?
>
As I said when Andrea posted the first patch series, I used something
very similar for non-RDMA-based HPC about 4 years ago. I haven't had
time yet to look in depth and try the latest proposed API but my feeling
is that it looks good.
Brice
^ permalink raw reply [flat|nested] 116+ messages in thread
* [PATCH] KVM swapping with MMU Notifiers V7
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
` (5 preceding siblings ...)
2008-02-15 6:49 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
@ 2008-02-16 10:48 ` Andrea Arcangeli
2008-02-16 11:08 ` Andrew Morton
2008-02-16 11:51 ` Robin Holt
6 siblings, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-16 10:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
Those below two patches enable KVM to swap the guest physical memory
through Christoph's V7.
There's one last _purely_theoretical_ race condition I figured out and
that I'm wondering how to best fix. The race condition worst case is
that a few guest physical pages could remain pinned by sptes. The race
can materialize if the linux pte is zapped after get_user_pages
returns but before the page is mapped by the spte and tracked by
rmap. The invalidate_ calls can also likely be optimized further but
it's not a fast path so it's not urgent.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 41962e7..e1287ab 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM && EXPERIMENTAL
select PREEMPT_NOTIFIERS
+ select MMU_NOTIFIER
select ANON_INODES
---help---
Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fd39cd1..b56e388 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -533,6 +533,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
kvm_flush_remote_tlbs(kvm);
}
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+ struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT);
+ get_page(page);
+ rmap_remove(kvm, spte);
+ set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+ kvm_flush_remote_tlbs(kvm);
+ __free_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+ u64 *spte, *curr_spte;
+
+ spte = rmap_next(kvm, rmapp, NULL);
+ while (spte) {
+ BUG_ON(!(*spte & PT_PRESENT_MASK));
+ rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+ curr_spte = spte;
+ spte = rmap_next(kvm, rmapp, spte);
+ kvm_unmap_spte(kvm, curr_spte);
+ }
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+ int i;
+
+ /*
+ * If mmap_sem isn't taken, we can look the memslots with only
+ * the mmu_lock by skipping over the slots with userspace_addr == 0.
+ */
+ spin_lock(&kvm->mmu_lock);
+ for (i = 0; i < kvm->nmemslots; i++) {
+ struct kvm_memory_slot *memslot = &kvm->memslots[i];
+ unsigned long start = memslot->userspace_addr;
+ unsigned long end;
+
+ /* mmu_lock protects userspace_addr */
+ if (!start)
+ continue;
+
+ end = start + (memslot->npages << PAGE_SHIFT);
+ if (hva >= start && hva < end) {
+ gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+ kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]);
+ }
+ }
+ spin_unlock(&kvm->mmu_lock);
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+ u64 *spte;
+ int young = 0;
+
+ spte = rmap_next(kvm, rmapp, NULL);
+ while (spte) {
+ int _young;
+ u64 _spte = *spte;
+ BUG_ON(!(_spte & PT_PRESENT_MASK));
+ _young = _spte & PT_ACCESSED_MASK;
+ if (_young) {
+ young = !!_young;
+ set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK);
+ }
+ spte = rmap_next(kvm, rmapp, spte);
+ }
+ return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+ int i;
+ int young = 0;
+
+ /*
+ * If mmap_sem isn't taken, we can look the memslots with only
+ * the mmu_lock by skipping over the slots with userspace_addr == 0.
+ */
+ spin_lock(&kvm->mmu_lock);
+ for (i = 0; i < kvm->nmemslots; i++) {
+ struct kvm_memory_slot *memslot = &kvm->memslots[i];
+ unsigned long start = memslot->userspace_addr;
+ unsigned long end;
+
+ /* mmu_lock protects userspace_addr */
+ if (!start)
+ continue;
+
+ end = start + (memslot->npages << PAGE_SHIFT);
+ if (hva >= start && hva < end) {
+ gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+ young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]);
+ }
+ }
+ spin_unlock(&kvm->mmu_lock);
+
+ if (young)
+ kvm_flush_remote_tlbs(kvm);
+
+ return young;
+}
+
#ifdef MMU_DEBUG
static int is_empty_shadow_page(u64 *spt)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c910c7..2b2398f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3185,6 +3185,46 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
free_page((unsigned long)vcpu->arch.pio_data);
}
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+ struct kvm_arch *kvm_arch;
+ kvm_arch = container_of(mn, struct kvm_arch, mmu_notifier);
+ return container_of(kvm_arch, struct kvm, arch);
+}
+
+void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ BUG_ON(mm != kvm->mm);
+ kvm_unmap_hva(kvm, address);
+}
+
+int kvm_mmu_notifier_age_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ BUG_ON(mm != kvm->mm);
+ return kvm_age_hva(kvm, address);
+}
+
+void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int lock)
+{
+ for (; start < end; start += PAGE_SIZE)
+ kvm_mmu_notifier_invalidate_page(mn, mm, start);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+ .invalidate_page = kvm_mmu_notifier_invalidate_page,
+ .age_page = kvm_mmu_notifier_age_page,
+ .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
+};
+
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
@@ -3194,6 +3234,9 @@ struct kvm *kvm_arch_create_vm(void)
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+
return kvm;
}
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index da61255..11976c8 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -13,6 +13,7 @@
#include <linux/types.h>
#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
#include <linux/kvm.h>
#include <linux/kvm_para.h>
@@ -287,6 +288,8 @@ struct kvm_arch{
int round_robin_prev_vcpu;
unsigned int tss_addr;
struct page *apic_access_page;
+
+ struct mmu_notifier mmu_notifier;
};
struct kvm_vm_stat {
@@ -404,6 +407,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
int kvm_mmu_setup(struct kvm_vcpu *vcpu);
void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte);
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
+int kvm_age_hva(struct kvm *kvm, unsigned long hva);
int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
void kvm_mmu_zap_all(struct kvm *kvm);
This allows to browse the memslots with only the mmu_lock hold and
it should be applied along the above patch:
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c910c7..80b719d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3245,16 +3245,23 @@ int kvm_arch_set_memory_region(struct kvm *kvm,
*/
if (!user_alloc) {
if (npages && !old.rmap) {
+ unsigned long userspace_addr;
+
down_write(¤t->mm->mmap_sem);
- memslot->userspace_addr = do_mmap(NULL, 0,
- npages * PAGE_SIZE,
- PROT_READ | PROT_WRITE,
- MAP_SHARED | MAP_ANONYMOUS,
- 0);
+ userspace_addr = do_mmap(NULL, 0,
+ npages * PAGE_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS,
+ 0);
up_write(¤t->mm->mmap_sem);
- if (IS_ERR((void *)memslot->userspace_addr))
- return PTR_ERR((void *)memslot->userspace_addr);
+ if (IS_ERR((void *)userspace_addr))
+ return PTR_ERR((void *)userspace_addr);
+
+ /* set userspace_addr atomically for kvm_hva_to_rmapp */
+ spin_lock(&kvm->mmu_lock);
+ memslot->userspace_addr = userspace_addr;
+ spin_unlock(&kvm->mmu_lock);
} else {
if (!old.user_alloc && old.rmap) {
int ret;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf6df51..743c5c5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -299,7 +299,15 @@ int __kvm_set_memory_region(struct kvm *kvm,
memset(new.rmap, 0, npages * sizeof(*new.rmap));
new.user_alloc = user_alloc;
- new.userspace_addr = mem->userspace_addr;
+ /*
+ * hva_to_rmmap() serialzies with the mmu_lock and to be
+ * safe it has to ignore memslots with !user_alloc &&
+ * !userspace_addr.
+ */
+ if (user_alloc)
+ new.userspace_addr = mem->userspace_addr;
+ else
+ new.userspace_addr = 0;
}
/* Allocate page dirty bitmap if needed */
@@ -312,14 +320,18 @@ int __kvm_set_memory_region(struct kvm *kvm,
memset(new.dirty_bitmap, 0, dirty_bytes);
}
+ spin_lock(&kvm->mmu_lock);
if (mem->slot >= kvm->nmemslots)
kvm->nmemslots = mem->slot + 1;
*memslot = new;
+ spin_unlock(&kvm->mmu_lock);
r = kvm_arch_set_memory_region(kvm, mem, old, user_alloc);
if (r) {
+ spin_lock(&kvm->mmu_lock);
*memslot = old;
+ spin_unlock(&kvm->mmu_lock);
goto out_free;
}
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 10:41 ` Brice Goglin
@ 2008-02-16 10:58 ` Andrew Morton
2008-02-16 19:31 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 10:58 UTC (permalink / raw)
To: Brice Goglin; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm
On Sat, 16 Feb 2008 11:41:35 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote:
> Andrew Morton wrote:
> > What is the status of getting infiniband to use this facility?
> >
> > How important is this feature to KVM?
> >
> > To xpmem?
> >
> > Which other potential clients have been identified and how important it it
> > to those?
> >
>
> As I said when Andrea posted the first patch series, I used something
> very similar for non-RDMA-based HPC about 4 years ago. I haven't had
> time yet to look in depth and try the latest proposed API but my feeling
> is that it looks good.
>
"looks good" maybe. But it's in the details where I fear this will come
unstuck. The likelihood that some callbacks really will want to be able to
block in places where this interface doesn't permit that - either to wait
for IO to complete or to wait for other threads to clear critical regions.
>From that POV it doesn't look like a sufficiently general and useful
design. Looks like it was grafted onto the current VM implementation in a
way which just about suits two particular clients if they try hard enough.
Which is all perfectly understandable - it would be hard to rework core MM
to be able to make this interface more general. But I do think it's
half-baked and there is a decent risk that future (or present) code which
_could_ use something like this won't be able to use this one, and will
continue to futz with mlock, page-pinning, etc.
Not that I know what the fix to that is..
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-16 11:07 ` Andrea Arcangeli
2008-02-16 19:22 ` Christoph Lameter
2008-02-18 1:51 ` Nick Piggin
2 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-16 11:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, Feb 15, 2008 at 07:37:36PM -0800, Andrew Morton wrote:
> The "|" is obviously deliberate. But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced. People who read your code will want to understand
> this.
This is to clear the young bit in every pte and spte to such physical
page before backing off because any young bit was on. So if any young
bit will be on in the next scan, we're guaranteed the page has been
touched recently and not ages before (otherwise it would take a worst
case N rounds of the lru before the page can be freed, where N is the
number of pte or sptes pointing to the page).
> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory.
> ->invalidate_page() _has_ to be able to block. Confused.
invalidate_page marking the spte invalid and flushing the asid/tlb
doesn't need to block the same way ptep_clear_flush doesn't need to
block for the main linux pte. Infact before invalidate_page and
ptep_clear_flush can touch anything at all, they've to take their own
spinlocks (mmu_lock for the former, and PT lock for the latter).
The only sleeping trouble is for networked driven message passing,
where they want to schedule while they wait the message to arrive or
it'd hang the whole cpu to spin for so long.
sptes are cpu-clocked entities like ptes so scheduling there is by far
not necessary because there's zero delay in invalidating them and
flushing their tlbs. GRU is similar. Because we boost the reference
count of the pages for every spte mapping, only implementing
invalidate_range_end is enough, but I need to figure out the
get_user_pages->rmap_add window too and because get_user_pages can
schedule, and if I want to add a critical section around it to avoid
calling get_user_pages twice during the kvm page fault, a mutex would
be the only way (it sure can't be a spinlock). But a mutex can't be
taken by invalidate_page to stop it. So that leaves me with the idea
of adding a get_user_pages variant that returns the page locked. So
instead of calling get_user_pages a second time after rmap_add
returns, I will only need to call unlock_page which should be faster
than a follow_page. And setting the PG_lock before dropping the PT
lock in follow_page, should be fast enough too.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [PATCH] KVM swapping with MMU Notifiers V7
2008-02-16 10:48 ` [PATCH] KVM swapping with MMU Notifiers V7 Andrea Arcangeli
@ 2008-02-16 11:08 ` Andrew Morton
2008-02-18 12:17 ` Andrea Arcangeli
2008-02-16 11:51 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2008-02-16 11:08 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, 16 Feb 2008 11:48:27 +0100 Andrea Arcangeli <andrea@qumranet.com> wrote:
> +void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int lock)
> +{
> + for (; start < end; start += PAGE_SIZE)
> + kvm_mmu_notifier_invalidate_page(mn, mm, start);
> +}
> +
> +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> + .invalidate_page = kvm_mmu_notifier_invalidate_page,
> + .age_page = kvm_mmu_notifier_age_page,
> + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
> +};
So this doesn't implement ->invalidate_range_start().
By what means does it prevent new mappings from being established in the
range after core mm has tried to call ->invalidate_rande_start()?
mmap_sem, I assume?
> + /* set userspace_addr atomically for kvm_hva_to_rmapp */
> + spin_lock(&kvm->mmu_lock);
> + memslot->userspace_addr = userspace_addr;
> + spin_unlock(&kvm->mmu_lock);
are you sure? kvm_unmap_hva() and kvm_age_hva() read ->userspace_addr a
single time and it doesn't immediately look like there's a need to take the
lock here?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [PATCH] KVM swapping with MMU Notifiers V7
2008-02-16 10:48 ` [PATCH] KVM swapping with MMU Notifiers V7 Andrea Arcangeli
2008-02-16 11:08 ` Andrew Morton
@ 2008-02-16 11:51 ` Robin Holt
2008-02-18 12:35 ` Andrea Arcangeli
1 sibling, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-02-16 11:51 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, akpm, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, Feb 16, 2008 at 11:48:27AM +0100, Andrea Arcangeli wrote:
> Those below two patches enable KVM to swap the guest physical memory
> through Christoph's V7.
>
> There's one last _purely_theoretical_ race condition I figured out and
> that I'm wondering how to best fix. The race condition worst case is
> that a few guest physical pages could remain pinned by sptes. The race
> can materialize if the linux pte is zapped after get_user_pages
> returns but before the page is mapped by the spte and tracked by
> rmap. The invalidate_ calls can also likely be optimized further but
> it's not a fast path so it's not urgent.
I am doing this in xpmem with a stack-based structure in the function
calling get_user_pages. That structure describes the start and
end address of the range we are doing the get_user_pages on. If an
invalidate_range_begin comes in while we are off to the kernel doing
the get_user_pages, the invalidate_range_begin marks that structure
indicating an invalidate came in. When the get_user_pages gets the
structures relocked, it checks that flag (really a generation counter)
and if it is set, retries the get_user_pages. After 3 retries, it
returns -EAGAIN and the fault is started over from the remote side.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 3:37 ` Andrew Morton
2008-02-16 8:45 ` Avi Kivity
2008-02-16 10:41 ` Brice Goglin
@ 2008-02-16 19:21 ` Christoph Lameter
2008-02-17 3:01 ` Andrea Arcangeli
2008-02-17 5:04 ` Doug Maxey
3 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 15 Feb 2008, Andrew Morton wrote:
> What is the status of getting infiniband to use this facility?
Well we are talking about this it seems.
>
> How important is this feature to KVM?
Andrea can answer this.
> To xpmem?
Without this feature we are stuck with page pinning by increasing
refcounts which leads to endless lru scanning and other misbehavior. Also
applications that use XPmem will not be able to swap or be able to use
things like remap.
> Which other potential clients have been identified and how important it it
> to those?
It is likely important to various DMA engines, framebuffers devices etc
etc. Seems to be a generally useful feature.
> > +The notifier chains provide two callback mechanisms. The
> > +first one is required for any device that establishes external mappings.
> > +The second (rmap) mechanism is required if a device needs to be
> > +able to sleep when invalidating references. Sleeping may be necessary
> > +if we are mapping across a network or to different Linux instances
> > +in the same address space.
>
> I'd have thought that a major reason for sleeping would be to wait for IO
> to complete. Worth mentioning here?
Right.
> Why is that "easy"? I's have thought that it would only be easy if the
> driver happened to be using those same locks for its own purposes.
> Otherwise it is "awkward"?
Its relatively easy because it is tied directly to a process and can use
external tlb shootdown / external page table clearing directly. The other
method requires an rmap in the device driver where it can lookup the
processes that are mapping the page.
> > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> > +called most of the time without any locks held. It is only called with
> > +locks held for file backed mappings that are truncated. A flag indicates
> > +in which mode we are. A driver can use that mechanism to f.e.
> > +delay the freeing of the pages during truncate until no locks are held.
>
> That sucks big time. What do we need to do to make get the callback
> functions called in non-atomic context?
We would have to drop the inode_mmap_lock. Could be done with some minor
work.
> > +Pages must be marked dirty if dirty bits are found to be set in
> > +the external ptes during unmap.
>
> That sentence is too vague. Define "marked dirty"?
Call set_page_dirty().
> > +The *release* method is called when a Linux process exits. It is run before
>
> We'd conventionally use a notation such as "->release()" here, rather than
> the asterisks.
Ok.
>
> > +the pages and mappings of a process are torn down and gives the device driver
> > +a chance to zap all the external mappings in one go.
>
> I assume what you mean here is that ->release() is called during exit()
> when the final reference to an mm is being dropped.
Right.
> > +An example for a code that can be used to build a notifier mechanism into
> > +a device driver can be found in the file
> > +Documentation/mmu_notifier/skeleton.c
>
> Should that be in samples/?
Oh. We have that?
> > +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> > +*before* the Linux rmaps are walked. At that point only the page lock is
> > +held. The invalidate_page() function must walk the driver rmaps and evict
> > +all the references to the page.
>
> What happens if it cannot do so?
The page is not reclaimed if we were called from try_to_unmap(). From
page_mkclean() we must always evict the page to switch off the write
protect bit.
> > +There is no process information available before the rmaps are consulted.
>
> Not sure what that sentence means. I guess "available to the core VM"?
At that point we only have the page. We do not know which processes map
the page. In order to find out we need to take a spinlock.
> > +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> > +it is a global callback list. Having to perform a callback for each and every
> > +page that is reclaimed would be inefficient. Therefore we add an additional
> > +page flag: PageRmapExternal().
>
> How many page flags are left?
30 or so. Its only available on 64bit.
> Is this feature important enough to justfy consumption of another one?
>
> > Only pages that are marked with this bit can
> > +be exported and the rmap callbacks will only be performed for pages marked
> > +that way.
>
> "exported": new term, unclear what it means.
Something external to the kernel references the page.
> > +The required additional Page flag is only availabe in 64 bit mode and
> > +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
>
> whoa. Is that good? You just made your feature unavailable on the great
> majority of Linux systems.
rmaps are usually used by complex drivers that are typically used in large
systems.
> > + * Notifier functions for hardware and software that establishes external
> > + * references to pages of a Linux system. The notifier calls ensure that
> > + * external mappings are removed when the Linux VM removes memory ranges
> > + * or individual pages from a process.
>
> So the callee cannot fail. hm. If it can't block, it's likely screwed in
> that case. In other cases it might be screwed anyway. I suspect we'll
> need to be able to handle callee failure.
Probably.
>
> > + * These fall into two classes:
> > + *
> > + * 1. mmu_notifier
> > + *
> > + * These are callbacks registered with an mm_struct. If pages are
> > + * removed from an address space then callbacks are performed.
>
> "to be removed", I guess. It's called before the page is actually removed?
Its called after the pte was cleared while holding the pte lock.
> > + * The invalidate_range_start/end callbacks can be performed in contexts
> > + * where sleeping is allowed or in atomic contexts. A flag is passed
> > + * to indicate an atomic context.
>
> We generally would prefer separate callbacks, rather than a unified
> callback with a mode flag.
We could drop the inode_mmap_lock when doing truncate. That would make
this work but its a kind of invasive thing for the VM.
> > +struct mmu_notifier_ops {
> > + /*
> > + * The release notifier is called when no other execution threads
> > + * are left. Synchronization is not necessary.
>
> "and the mm is about to be destroyed"?
Right.
> > + /*
> > + * invalidate_range_begin() and invalidate_range_end() must be paired.
> > + *
> > + * Multiple invalidate_range_begin/ends may be nested or called
> > + * concurrently.
>
> Under what circumstances would they be nested?
Hmmmm.. Right they cannot be nested. Multiple processors can have
invalidates() concurrently in progress.
> > That is legit. However, no new external references
>
> references to what?
To the ranges that are in the process of being invalidated.
> > + * invalidate_range_begin() must clear all references in the range
> > + * and stop the establishment of new references.
>
> and stop the establishment of new references within the range, I assume?
Right.
> If so, that's putting a heck of a lot of complexity into the driver, isn't
> it? It needs to temporarily remember an arbitrarily large number of
> regions in this mm against which references may not be taken?
That is one implementation (XPmem does that). The other is to simply stop
all references when any invalidate_range is in progress (KVM and GRU do
that).
> > + * invalidate_range_end() reenables the establishment of references.
>
> within the range?
Right.
> > +extern void mmu_notifier_release(struct mm_struct *mm);
> > +extern int mmu_notifier_age_page(struct mm_struct *mm,
> > + unsigned long address);
>
> There's the mysterious age_page again.
Andrea put this in to check the reference status of a page. It functions
like the accessed bit.
> > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> > +{
> > + INIT_HLIST_HEAD(&mnh->head);
> > +}
> > +
> > +#define mmu_notifier(function, mm, args...) \
> > + do { \
> > + struct mmu_notifier *__mn; \
> > + struct hlist_node *__n; \
> > + \
> > + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> > + rcu_read_lock(); \
> > + hlist_for_each_entry_rcu(__mn, __n, \
> > + &(mm)->mmu_notifier.head, \
> > + hlist) \
> > + if (__mn->ops->function) \
> > + __mn->ops->function(__mn, \
> > + mm, \
> > + args); \
> > + rcu_read_unlock(); \
> > + } \
> > + } while (0)
>
> The macro references its args more than once. Anyone who does
>
> mmu_notifier(function, some_function_which_has_side_effects())
>
> will get a surprise. Use temporaries.
Ok.
> > +#define mmu_notifier(function, mm, args...) \
> > + do { \
> > + if (0) { \
> > + struct mmu_notifier *__mn; \
> > + \
> > + __mn = (struct mmu_notifier *)(0x00ff); \
> > + __mn->ops->function(__mn, mm, args); \
> > + }; \
> > + } while (0)
>
> That's a bit weird. Can't we do the old
>
> (void)function;
> (void)mm;
>
> trick? Or make it a staic inline function?
Static inline wont allow the checking of the parameters.
(void) may be a good thing here.
> > +config MMU_NOTIFIER
> > + def_bool y
> > + bool "MMU notifier, for paging KVM/RDMA"
>
> Why is this not selectable? The help seems a bit brief.
>
> Does this cause 32-bit systems to drag in a bunch of code they're not
> allowed to ever use?
I have selected it a number of times. We could make that a bit longer
right.
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + hlist_for_each_entry_safe(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > + hlist_del_init(&mn->hlist);
> > + if (mn->ops->release)
> > + mn->ops->release(mn, mm);
>
> We do this a lot, but back in the old days people didn't like optional
> callbacks which can be NULL. If we expect that mmu_notifier_ops.release is
> usually implemented, the just unconditionally call it and require that all
> clients implement it. Perhaps provide an exported-to-modules stuv in core
> kernel for clients which didn't want to implement ->release().
Ok.
> > +{
> > + struct mmu_notifier *mn;
> > + struct hlist_node *n;
> > + int young = 0;
> > +
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + rcu_read_lock();
> > + hlist_for_each_entry_rcu(mn, n,
> > + &mm->mmu_notifier.head, hlist) {
> > + if (mn->ops->age_page)
> > + young |= mn->ops->age_page(mn, mm, address);
> > + }
> > + rcu_read_unlock();
> > + }
> > +
> > + return young;
> > +}
>
> should the rcu_read_lock() cover the hlist_empty() test?
>
> This function looks like it was tossed in at the last minute. It's
> mysterious, undocumented, poorly commented, poorly named. A better name
> would be one which has some correlation with the return value.
>
> Because anyone who looks at some code which does
>
> if (mmu_notifier_age_page(mm, address))
> ...
>
> has to go and reverse-engineer the implementation of
> mmu_notifier_age_page() to work out under which circumstances the "..."
> will be executed. But this should be apparent just from reading the callee
> implementation.
>
> This function *really* does need some documentation. What does it *mean*
> when the ->age_page() from some of the notifiers returned "1" and the
> ->age_page() from some other notifiers returned zero? Dunno.
Andrea: Could you provide some more detail here?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-16 3:37 ` Andrew Morton
2008-02-16 11:07 ` Andrea Arcangeli
@ 2008-02-16 19:22 ` Christoph Lameter
2008-02-16 19:54 ` Avi Kivity
2008-02-19 8:46 ` Nick Piggin
2008-02-18 1:51 ` Nick Piggin
2 siblings, 2 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 15 Feb 2008, Andrew Morton wrote:
> > @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
> > if (vma->vm_flags & VM_LOCKED) {
> > referenced++;
> > *mapcount = 1; /* break early from loop */
> > - } else if (ptep_clear_flush_young(vma, address, pte))
> > + } else if (ptep_clear_flush_young(vma, address, pte) |
> > + mmu_notifier_age_page(mm, address))
> > referenced++;
>
> The "|" is obviously deliberate. But no explanation is provided telling us
> why we still call the callback if ptep_clear_flush_young() said the page
> was recently referenced. People who read your code will want to understand
> this.
Andrea?
> > flush_cache_page(vma, address, pte_pfn(*pte));
> > entry = ptep_clear_flush(vma, address, pte);
> > + mmu_notifier(invalidate_page, mm, address);
>
> I just don't see how ths can be done if the callee has another thread in
> the middle of establishing IO against this region of memory.
> ->invalidate_page() _has_ to be able to block. Confused.
The page lock is held and that holds off I/O?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-16 19:26 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 15 Feb 2008, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:01 -0800 Christoph Lameter <clameter@sgi.com> wrote:
>
> > The invalidation of address ranges in a mm_struct needs to be
> > performed when pages are removed or permissions etc change.
>
> hm. Do they? Why? If I'm in the process of zero-copy writing a hunk of
> memory out to hardware then do I care if someone write-protects the ptes?
>
> Spose so, but some fleshing-out of the various scenarios here would clarify
> things.
You care f.e. if the VM needs to writeprotect a memory range and a write
occurs. In that case the VM needs to be proper write processing and write
through an external pte would cause memory corruption.
> > If invalidate_range_begin() is called with locks held then we
> > pass a flag into invalidate_range() to indicate that no sleeping is
> > possible. Locks are only held for truncate and huge pages.
>
> This is so bad.
Ok so I can twidlle around with the inode_mmap_lock to drop it while this
is called?
> > In two cases we use invalidate_range_begin/end to invalidate
> > single pages because the pair allows holding off new references
> > (idea by Robin Holt).
>
> Assuming that there is a missing "within the range" in this description, I
> assume that all clients will just throw up theior hands in horror and will
> disallow all references to all parts of the mm.
Right. Missing within the range. We only need to disallow creating new
ptes right? Why disallow references?
> > xip_unmap: We are not taking the PageLock so we cannot
> > use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> > stands in.
>
> What does "stands in" mean?
Use a range begin / end to invalidate a page.
> > + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> > err = populate_range(mm, vma, start, size, pgoff);
> > + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
>
> To avoid off-by-one confusion the changelogs, documentation and comments
> should be very careful to tell the reader whether the range includes the
> byte at start+size. I don't thik that was done?
No it was not. I assumed that the convention is always start - (end - 1)
and the byte at end is not affected by the operation.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-16 19:28 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 15 Feb 2008, Andrew Morton wrote:
> > +#define mmu_rmap_notifier(function, args...) \
> > + do { \
> > + struct mmu_rmap_notifier *__mrn; \
> > + struct hlist_node *__n; \
> > + \
> > + rcu_read_lock(); \
> > + hlist_for_each_entry_rcu(__mrn, __n, \
> > + &mmu_rmap_notifier_list, hlist) \
> > + if (__mrn->ops->function) \
> > + __mrn->ops->function(__mrn, args); \
> > + rcu_read_unlock(); \
> > + } while (0);
> > +
>
> buggy macro: use locals.
Ok. Same as the non rmap version.
> > +EXPORT_SYMBOL(mmu_rmap_export_page);
>
> The other patch used EXPORT_SYMBOL_GPL.
Ok will make that consistent.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 10:58 ` Andrew Morton
@ 2008-02-16 19:31 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: Brice Goglin, Andrea Arcangeli, linux-kernel, linux-mm
On Sat, 16 Feb 2008, Andrew Morton wrote:
> "looks good" maybe. But it's in the details where I fear this will come
> unstuck. The likelihood that some callbacks really will want to be able to
> block in places where this interface doesn't permit that - either to wait
> for IO to complete or to wait for other threads to clear critical regions.
We can get the invalidate_range to always be called without spinlocks if
we deal with the case of the inode_mmap_lock being held in truncate case.
If you always want to be able to sleep then we could drop the
invalidate_page() that is called while pte locks held and require the use
of a device driver rmap?
> >From that POV it doesn't look like a sufficiently general and useful
> design. Looks like it was grafted onto the current VM implementation in a
> way which just about suits two particular clients if they try hard enough.
You missed KVM. We did the best we could being as least invasive as
possible.
> Which is all perfectly understandable - it would be hard to rework core MM
> to be able to make this interface more general. But I do think it's
> half-baked and there is a decent risk that future (or present) code which
> _could_ use something like this won't be able to use this one, and will
> continue to futz with mlock, page-pinning, etc.
>
> Not that I know what the fix to that is..
You do not see a chance of this being okay if we adopt the two measures
that I mentioned above?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-16 19:22 ` Christoph Lameter
@ 2008-02-16 19:54 ` Avi Kivity
2008-02-19 8:46 ` Nick Piggin
1 sibling, 0 replies; 116+ messages in thread
From: Avi Kivity @ 2008-02-16 19:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Andrea Arcangeli, Robin Holt, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
>
>
>>> @@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
>>> if (vma->vm_flags & VM_LOCKED) {
>>> referenced++;
>>> *mapcount = 1; /* break early from loop */
>>> - } else if (ptep_clear_flush_young(vma, address, pte))
>>> + } else if (ptep_clear_flush_young(vma, address, pte) |
>>> + mmu_notifier_age_page(mm, address))
>>> referenced++;
>>>
>> The "|" is obviously deliberate. But no explanation is provided telling us
>> why we still call the callback if ptep_clear_flush_young() said the page
>> was recently referenced. People who read your code will want to understand
>> this.
>>
>
> Andrea?
>
>
I'm not Andrea, but the way I read it, ptep_clear_flush_young() and
->age_page() each have two effects: check whether the page has been
referenced and clear the referenced bit. || would retain the semantics
of the check but lose the clearing. | does the right thing.
--
Any sufficiently difficult bug is indistinguishable from a feature.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 19:21 ` Christoph Lameter
@ 2008-02-17 3:01 ` Andrea Arcangeli
2008-02-17 12:24 ` Robin Holt
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-17 3:01 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
>
> > What is the status of getting infiniband to use this facility?
>
> Well we are talking about this it seems.
It seems the IB folks think allowing RDMA over virtual memory is not
interesting, their argument seem to be that RDMA is only interesting
on RAM (and they seem not interested in allowing RDMA over a ram+swap
backed _virtual_ memory allocation). They've just to decide if
ram+swap allocation for RDMA is useful or not.
> > How important is this feature to KVM?
>
> Andrea can answer this.
I think I already did in separate email.
> > That sucks big time. What do we need to do to make get the callback
> > functions called in non-atomic context?
I sure agree given I also asked to drop the lock param and enforce the
invalidate_range_* to always be called in non atomic context.
> We would have to drop the inode_mmap_lock. Could be done with some minor
> work.
The invalidate may be deferred after releasing the lock, the lock may
not have to be dropped to cleanup the API (and make xpmem life easier).
> That is one implementation (XPmem does that). The other is to simply stop
> all references when any invalidate_range is in progress (KVM and GRU do
> that).
KVM doesn't stop new references. It doesn't need to because it holds a
reference on the page (GRU doesn't). KVM can invalidate the spte and
flush the tlb only after the linux pte has been cleared and after the
page has been released by the VM (because the page doesn't go in the
freelist and it remains pinned for a little while, until the spte is
dropped too inside invalidate_range_end). GRU has to invalidate
_before_ the linux pte is cleared so it has to stop new references
from being established in the invalidate_range_start/end critical
section.
> Andrea put this in to check the reference status of a page. It functions
> like the accessed bit.
In short each pte can have some spte associated to it. So whenever we
do a ptep_clear_flush protected by the PT lock, we also have to run
invalidate_page that will internally invoke a sort-of
sptep_clear_flush protected by a kvm->mmu_lock (equivalent of
page_table_lock/PT-lock). sptes just like ptes maps virtual addresses
to physical addresses, so you can read/write to RAM either through a
pte or through a spte.
Just like it would be insane to have any requirement that
ptep_clear_flush has to run in not-atomic context (forcing a
conversion of the PT lock to a mutex), it's also weird require the
invalidate_page/age_page to run in atomic context.
All troubles start with the xpmem requirements of having to schedule
in its equivalent of the sptep_clear_flush because it's not a
gigaherz-in-cpu thing but a gigabit thing where the network stack is
involved with its own software linux driven skb memory allocations,
schedules waiting for network I/O, etc... Imagine ptes allocated in a
remote node, no surprise its brings a new set of problems (assuming it
can work reliably during oom given its memory requirements in the
try_to_unmap path, no page can ever be freed until the skbs have been
allocated and sent and allocated again to receive the ack).
Furthermore xpmem doesn't associate any pte to a spte, it associates a
page_t to certain remote references, or it would be in trouble with
invalidate_page that corresponds to ptep_clear_flush on a virtual
address that exists thanks to the anon_vma/i_mmap lock held (and not
thanks to the mmap_sem like in all invalidate_range calls).
Christoph's patch is a mix of two entirely separated features. KVM can
live with V7 just fine, but it's a lot more than what is needed by KVM.
I don't think that invalidate_page/age_page must be allowed to sleep
because invalidate_range also can sleep. You've to just ask yourself
if the VM locks shall remain spinlocks, for the VM own good (not for
the mmu notifiers good). It'd be bad to make the VM underperform with
mutex protecting tiny critical sections to please some mmu notifier
user. But if they're spinlocks, then clearly invalidate_page/age_page
based on virtual addresses can't sleep or the virtual address wouldn't
make sense anymore by the time the spinlock is released.
> > This function looks like it was tossed in at the last minute. It's
> > mysterious, undocumented, poorly commented, poorly named. A better name
> > would be one which has some correlation with the return value.
> >
> > Because anyone who looks at some code which does
> >
> > if (mmu_notifier_age_page(mm, address))
> > ...
> >
> > has to go and reverse-engineer the implementation of
> > mmu_notifier_age_page() to work out under which circumstances the "..."
> > will be executed. But this should be apparent just from reading the callee
> > implementation.
> >
> > This function *really* does need some documentation. What does it *mean*
> > when the ->age_page() from some of the notifiers returned "1" and the
> > ->age_page() from some other notifiers returned zero? Dunno.
>
> Andrea: Could you provide some more detail here?
age_page is simply the ptep_clear_flush_young equivalent for
sptes. It's meant to provide aging to the pages mapped by secondary
mmus. Its return value is the same one of ptep_clear_flush_young but
it represents the sptes associated with the pte,
ptep_clear_flush_young instead only takes care of the pte itself.
For KVM the below would be all that is needed, the fact
invalidate_range can sleep and invalidate_page/age_page can't, is
because their users are very different. With my approach the mmu
notifiers callback are always protected by the PT lock (just like
ptep_clear_flush and the other pte+tlb manglings) and they're called
after the pte is cleared and before the VM reference on the page has
been dropped. That makes it safe for GRU too, so for my initial
approach _none_ of the callbacks was allowed to sleep, and that was a
feature that allows GRU not to block its tlb miss interrupt with any
further locking (the PT-lock taken by follow_page automatically
serialized the GRU interrupt against the MMU notifiers and the linux
page fault). For KVM the invalidate_pages of my patch is converted to
invalidate_range_end because it doesn't matter for KVM if it's called
after the PT lock has been dropped. In the try_to_unmap case
invalidate_page is called by atomic context in Christoph's patch too,
because a virtual address and in turn a pte and in turn certain sptes,
can only exist thanks to the spinlocks taken by the VM. Changing the
VM to make mmu notifiers sleepable in the try_to_unmap path sounds bad
to me, especially given not even xpmem needs this.
You can see how everything looks simpler and more symmetric by
assuming the secondary mmu-references are established and dropped like
ptes, like in the KVM case where infact sptes are a pure cpu thing
exact like the ptes. XPMEM adds the requirement that sptes are infact
remote entities that are mangled by a message passing protocol over
the network, it's the same as ptep_clear_flush being required to
schedule and send skbs to be successful and allowing try_to_unmap to
do its work. Same problem. No wonder patch gets more complicated then.
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
__young = ptep_test_and_clear_young(__vma, __address, __ptep); \
if (__young) \
flush_tlb_page(__vma, __address); \
+ __young |= mmu_notifier_age_page((__vma)->vm_mm, __address); \
__young; \
})
#endif
@@ -86,6 +87,7 @@ do { \
pte_t __pte; \
__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \
flush_tlb_page(__vma, __address); \
+ mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \
__pte; \
})
#endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
{
pte_t pte = *ptep;
ptep_invalidate(address, ptep);
+ mmu_notifier(invalidate_page, vma->vm_mm, address);
return pte;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
+#include <linux/mmu_notifier.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -219,6 +220,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,132 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+ /*
+ * Called when nobody can register any more notifier in the mm
+ * and after the "mn" notifier has been disarmed already.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /*
+ * invalidate_page[s] is called in atomic context
+ * after any pte has been updated and before
+ * dropping the PT lock required to update any Linux pte.
+ * Once the PT lock will be released the pte will have its
+ * final value to export through the secondary MMU.
+ * Before this is invoked any secondary MMU is still ok
+ * to read/write to the page previously pointed by the
+ * Linux pte because the old page hasn't been freed yet.
+ * If required set_page_dirty has to be called internally
+ * to this method.
+ */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+ void (*invalidate_pages)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
+ /*
+ * Age page is called in atomic context inside the PT lock
+ * right after the VM is test-and-clearing the young/accessed
+ * bitflag in the pte. This way the VM will provide proper aging
+ * to the accesses to the page through the secondary MMUs
+ * and not only to the ones through the Linux pte.
+ */
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+};
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+ struct hlist_head head;
+ spinlock_t lock;
+};
+
+#include <linux/mm_types.h>
+
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+ spin_lock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+struct mmu_notifier_head {};
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
if (pte_none(pte))
continue;
+ mmu_notifier(invalidate_page, mm, address);
page = pte_page(pte);
if (pte_dirty(pte))
set_page_dirty(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -494,6 +494,7 @@ static int copy_pte_range(struct mm_stru
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
int rss[2];
+ unsigned long start;
again:
rss[1] = rss[0] = 0;
@@ -505,6 +506,7 @@ again:
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
arch_enter_lazy_mmu_mode();
+ start = addr;
do {
/*
* We are holding two locks at this point - either of them
@@ -525,6 +527,8 @@ again:
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
+ if (is_cow_mapping(vma->vm_flags))
+ mmu_notifier(invalidate_pages, vma->vm_mm, start, addr);
spin_unlock(src_ptl);
pte_unmap_nested(src_pte - 1);
add_mm_rss(dst_mm, rss[0], rss[1]);
@@ -660,6 +664,7 @@ static unsigned long zap_pte_range(struc
}
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
+ mmu_notifier(invalidate_page, mm, addr);
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
@@ -1248,6 +1253,7 @@ static int remap_pte_range(struct mm_str
{
pte_t *pte;
spinlock_t *ptl;
+ unsigned long start = addr;
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
@@ -1259,6 +1265,7 @@ static int remap_pte_range(struct mm_str
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
+ mmu_notifier(invalidate_pages, mm, start, addr);
pte_unmap_unlock(pte - 1, ptl);
return 0;
}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2044,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,73 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/rcupdate.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *tmp;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ hlist_for_each_entry_safe(mn, n, tmp,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ spin_lock(&mm->mmu_notifier.lock);
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+ spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ spin_lock(&mm->mmu_notifier.lock);
+ hlist_del_rcu(&mn->hlist);
+ spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long start = addr;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
+ mmu_notifier(invalidate_pages, mm, start, addr);
pte_unmap_unlock(pte - 1, ptl);
}
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-16 3:37 ` Andrew Morton
` (2 preceding siblings ...)
2008-02-16 19:21 ` Christoph Lameter
@ 2008-02-17 5:04 ` Doug Maxey
3 siblings, 0 replies; 116+ messages in thread
From: Doug Maxey @ 2008-02-17 5:04 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman, Ben Herrenschmidt, Jan-Bernd Themann
On Fri, 15 Feb 2008 19:37:19 PST, Andrew Morton wrote:
> Which other potential clients have been identified and how important it it
> to those?
The powerpc ehea utilizes its own mmu. Not sure about the importance
to the driver. (But will investigate :)
++doug
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-17 3:01 ` Andrea Arcangeli
@ 2008-02-17 12:24 ` Robin Holt
0 siblings, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-17 12:24 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Andrew Morton, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Sun, Feb 17, 2008 at 04:01:20AM +0100, Andrea Arcangeli wrote:
> On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> > On Fri, 15 Feb 2008, Andrew Morton wrote:
> >
> > > What is the status of getting infiniband to use this facility?
> >
> > Well we are talking about this it seems.
>
> It seems the IB folks think allowing RDMA over virtual memory is not
> interesting, their argument seem to be that RDMA is only interesting
> on RAM (and they seem not interested in allowing RDMA over a ram+swap
> backed _virtual_ memory allocation). They've just to decide if
> ram+swap allocation for RDMA is useful or not.
I don't think that is a completely fair characterization. It would be
more fair to say that the changes required to their library/user api
would be too significant to allow an adaptation to any scheme which
allowed removal of physical memory below a virtual mapping.
I agree with the IB folks when they say it is impossible with their
current scheme. The fact that any consumer of their endpoint identifier
can use any identifier without notifying the kernel prior to its use
certainly makes any implementation under any scheme impossible.
I guess we could possibly make things work for IB if we did some heavy
work. Let's assume, instead of passing around the physical endpoint
identifiers, they passed around a handle. In order for any IB endpoint
to commuicate, it would need to request the kernel translate a handle
into an endpoint identifier. In order for the kernel to put a TLB
entry into the processes address space allowing the process access to
the _CARD_, it would need to ensure all the current endpoint identifiers
for this process were "active" meaning we have verified with the other
endpoint that all pages are faulted and TLB/PFN information is in the
owning card's TLB/PFN tables. Once all of a processes endoints are
"active" we would drop in the PFN for the adapter into the pages tables.
Any time pages are being revoked from under an active handle, we would
shoot-down the IB adapter card TLB entries for all the remote users of
this handle and quiesce the cards state to ensure transfers are either
complete or terminated. When their are no active transfers, we would
respond back to the owner and they could complete the source process
page table cleaning. Any time all of the pages for a handle can not be
mapped from virtual to physical, the remote process would be SIGBUS'd
instead of having it IB adapter TLB installed.
This is essentially how XPMEM does it except we have the benefit of
working on individual pages.
Again, not knowing what I am talking about, but under the assumption that
MPI IB use is contained to a library, I would hope the changes could be
contained under the MPI-to-IB library interface and would not need any
changes at the MPI-user library interface.
We do keep track of the virtual address ranges within a handle that
are being used. I assume the IB folks will find that helpful as well.
Otherwise, I think they could make things operate this way. XPMEM has
the advantage of not needing to have virtual-to-physical at all times,
but otherwise it is essentially the same.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-16 3:37 ` Andrew Morton
2008-02-16 11:07 ` Andrea Arcangeli
2008-02-16 19:22 ` Christoph Lameter
@ 2008-02-18 1:51 ` Nick Piggin
2 siblings, 0 replies; 116+ messages in thread
From: Nick Piggin @ 2008-02-18 1:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Saturday 16 February 2008 14:37, Andrew Morton wrote:
> On Thu, 14 Feb 2008 22:49:02 -0800 Christoph Lameter <clameter@sgi.com>
wrote:
> > Two callbacks to remove individual pages as done in rmap code
> >
> > invalidate_page()
> >
> > Called from the inner loop of rmap walks to invalidate pages.
> >
> > age_page()
> >
> > Called for the determination of the page referenced status.
> >
> > If we do not care about page referenced status then an age_page callback
> > may be be omitted. PageLock and pte lock are held when either of the
> > functions is called.
>
> The age_page mystery shallows.
BTW. can this callback be called mmu_notifier_clear_flush_young? To
match the core VM.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [PATCH] KVM swapping with MMU Notifiers V7
2008-02-16 11:08 ` Andrew Morton
@ 2008-02-18 12:17 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-18 12:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, Feb 16, 2008 at 03:08:17AM -0800, Andrew Morton wrote:
> On Sat, 16 Feb 2008 11:48:27 +0100 Andrea Arcangeli <andrea@qumranet.com> wrote:
>
> > +void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start, unsigned long end,
> > + int lock)
> > +{
> > + for (; start < end; start += PAGE_SIZE)
> > + kvm_mmu_notifier_invalidate_page(mn, mm, start);
> > +}
> > +
> > +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> > + .invalidate_page = kvm_mmu_notifier_invalidate_page,
> > + .age_page = kvm_mmu_notifier_age_page,
> > + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
> > +};
>
> So this doesn't implement ->invalidate_range_start().
Correct. range_start is needed by subsystems that don't pin the pages
(so they've to drop the secondary mmu mappings on the physical page
before the page is released by the linux VM).
> By what means does it prevent new mappings from being established in the
> range after core mm has tried to call ->invalidate_rande_start()?
> mmap_sem, I assume?
No, populate range only takes the mmap_sem in read mode and the kvm page
fault also is of course taking it only in read mode.
What makes it safe, is that invalidate_range_end is called _after_ the
linux pte is clear. The kvm page fault, if it triggers, it will call
into get_user_pages again to re-establish the linux pte _before_
establishing the spte.
It's the same reason why it's safe to flush the tlb after clearing the
linux pte. sptes are like a secondary tlb.
> > + /* set userspace_addr atomically for kvm_hva_to_rmapp */
> > + spin_lock(&kvm->mmu_lock);
> > + memslot->userspace_addr = userspace_addr;
> > + spin_unlock(&kvm->mmu_lock);
>
> are you sure? kvm_unmap_hva() and kvm_age_hva() read ->userspace_addr a
> single time and it doesn't immediately look like there's a need to take the
> lock here?
gcc will always write it with a movq but this is to be
C-specs-compliant and because this is by far not a performance
critical path I thought it was simpler than some other atomic move in
a single insn.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [PATCH] KVM swapping with MMU Notifiers V7
2008-02-16 11:51 ` Robin Holt
@ 2008-02-18 12:35 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-18 12:35 UTC (permalink / raw)
To: Robin Holt
Cc: Christoph Lameter, akpm, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Sat, Feb 16, 2008 at 05:51:38AM -0600, Robin Holt wrote:
> I am doing this in xpmem with a stack-based structure in the function
> calling get_user_pages. That structure describes the start and
> end address of the range we are doing the get_user_pages on. If an
> invalidate_range_begin comes in while we are off to the kernel doing
> the get_user_pages, the invalidate_range_begin marks that structure
> indicating an invalidate came in. When the get_user_pages gets the
> structures relocked, it checks that flag (really a generation counter)
> and if it is set, retries the get_user_pages. After 3 retries, it
> returns -EAGAIN and the fault is started over from the remote side.
A seqlock sounds a good optimization for the non-swapping fast path, a
per-VM-guest seqlock number can allow us to know when we need to worry
to call get_user_pages a second time, but won't be really a retry like
in 99% of seqlock usages for the reader side, but just a second
get_user_pages to trigger a minor fault. Then if the page is different
in the second run, we'll really retry (so not in function of the
seqlock but in function of the get_user_pages page array), and there's
no risk of livelocks because get_user_pages returning a different page
won't be the common case. The seqlock should be increased first before
the invalidate and a second time once the invalidate is over.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-18 22:33 ` Roland Dreier
1 sibling, 0 replies; 116+ messages in thread
From: Roland Dreier @ 2008-02-18 22:33 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
It seems that we've come up with two reasonable cases where it makes
sense to use these notifiers for InfiniBand/RDMA:
First, the ability to safely to DMA to/from userspace memory with the
memory regions mlock()ed but the pages not pinned. In this case the
notifiers here would seem to suit us well:
> + void (*invalidate_range_begin)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int atomic);
> +
> + void (*invalidate_range_end)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int atomic);
If I understand correctly, the IB stack would have to get the hardware
driver to shoot down translation entries and suspend access to the
region when an invalidate_range_begin notifier is called, and wait for
the invalidate_range_end notifier to repopulate the adapter
translation tables. This will probably work OK as long as the
interval between the invalidate_range_begin and invalidate_range_end
calls is not "too long."
Also, using this effectively requires us to figure out how we want to
mlock() regions that are going to be used for RDMA. We could require
userspace to do it, but it's not clear to me that we're safe in the
case where userspace decides not to... what happens if some pages get
swapped out after the invalidate_range_begin notifier?
The second case where some form of notifiers are useful is for
userspace to know when a memory registration is still valid, ie Pete
Wyckoff's work:
http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf
http://www.osc.edu/~pw/dreg/
however these MMU notifiers seem orthogonal to that: the registration
cache is concerned with address spaces, not page mapping, and hence
the existing vma operations seem to be a better fit.
- R.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-16 19:22 ` Christoph Lameter
2008-02-16 19:54 ` Avi Kivity
@ 2008-02-19 8:46 ` Nick Piggin
2008-02-19 13:30 ` Andrea Arcangeli
1 sibling, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-02-19 8:46 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
> > > flush_cache_page(vma, address, pte_pfn(*pte));
> > > entry = ptep_clear_flush(vma, address, pte);
> > > + mmu_notifier(invalidate_page, mm, address);
> >
> > I just don't see how ths can be done if the callee has another thread in
> > the middle of establishing IO against this region of memory.
> > ->invalidate_page() _has_ to be able to block. Confused.
>
> The page lock is held and that holds off I/O?
I think the actual answer is that "it doesn't matter".
ptes are not exactly the entity via which IO gets established, so
all we really care about here is that after the callback finishes,
we will not get any more reads or writes to the page via the
external mapping.
As far as holding off local IO goes, that is the job of the core
VM. (And no, page lock does not necessarily hold it off FYI -- it
can be writeback IO or even IO directly via buffers).
Holding off IO via the external references I guess is a job for
the notifier driver.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-19 8:54 ` Nick Piggin
2008-02-19 13:34 ` Andrea Arcangeli
2008-02-19 23:08 ` Nick Piggin
2 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-02-19 8:54 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
>
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
This whole thing would be much better if you didn't rely on the page
lock at all, but either a) used the same locking as Linux does for its
ptes/tlbs, or b) have some locking that is private to the mmu notifier
code. Then there is not all this new stuff that has to be understood in
the core VM.
Also, why do you have to "invalidate" ranges when switching to a
_more_ permissive state? This stuff should basically be the same as
(a subset of) the TLB flushing API AFAIKS. Anything more is a pretty
big burden to put in the core VM.
See my alternative patch I posted -- I can't see why it won't work
just like a TLB.
As far as sleeping inside callbacks goes... I think there are big
problems with the patch (the sleeping patch and the external rmap
patch). I don't think it is workable in its current state. Either
we have to make some big changes to the core VM, or we have to turn
some locks into sleeping locks to do it properly AFAIKS. Neither
one is good.
But anyway, I don't really think the two approaches (Andrea's
notifiers vs sleeping/xrmap) should be tangled up too much. I
think Andrea's can possibly be quite unintrusive and useful very
soon.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks
2008-02-19 8:46 ` Nick Piggin
@ 2008-02-19 13:30 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-19 13:30 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Andrew Morton, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Tue, Feb 19, 2008 at 07:46:10PM +1100, Nick Piggin wrote:
> On Sunday 17 February 2008 06:22, Christoph Lameter wrote:
> > On Fri, 15 Feb 2008, Andrew Morton wrote:
>
> > > > flush_cache_page(vma, address, pte_pfn(*pte));
> > > > entry = ptep_clear_flush(vma, address, pte);
> > > > + mmu_notifier(invalidate_page, mm, address);
> > >
> > > I just don't see how ths can be done if the callee has another thread in
> > > the middle of establishing IO against this region of memory.
> > > ->invalidate_page() _has_ to be able to block. Confused.
> >
> > The page lock is held and that holds off I/O?
>
> I think the actual answer is that "it doesn't matter".
Agreed. The PG_lock itself taken when invalidate_page is called, is
used to serialized the VM against the VM, not the VM against I/O.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-19 8:54 ` Nick Piggin
@ 2008-02-19 13:34 ` Andrea Arcangeli
2008-02-27 22:23 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-19 13:34 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, akpm, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Tue, Feb 19, 2008 at 07:54:14PM +1100, Nick Piggin wrote:
> As far as sleeping inside callbacks goes... I think there are big
> problems with the patch (the sleeping patch and the external rmap
> patch). I don't think it is workable in its current state. Either
> we have to make some big changes to the core VM, or we have to turn
> some locks into sleeping locks to do it properly AFAIKS. Neither
> one is good.
Agreed.
The thing is quite simple, the moment we support xpmem the complexity
in the mmu notifier patch start and there are hacks, duplicated
functionality through the same xpmem callbacks etc... GRU can already
be 100% supported (infact simpler and safer) with my patch.
> But anyway, I don't really think the two approaches (Andrea's
> notifiers vs sleeping/xrmap) should be tangled up too much. I
> think Andrea's can possibly be quite unintrusive and useful very
> soon.
Yes, that's why I kept maintaining my patch and I posted the last
revision to Andrew. I use pte/tlb locking of the core VM, it's
unintrusive and obviously safe. Furthermore it can be extended with
Christoph's stuff in a 100% backwards compatible fashion later if needed.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-19 8:54 ` Nick Piggin
@ 2008-02-19 23:08 ` Nick Piggin
2008-02-20 1:00 ` Andrea Arcangeli
2008-02-27 22:35 ` Christoph Lameter
2 siblings, 2 replies; 116+ messages in thread
From: Nick Piggin @ 2008-02-19 23:08 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> The invalidation of address ranges in a mm_struct needs to be
> performed when pages are removed or permissions etc change.
>
> If invalidate_range_begin() is called with locks held then we
> pass a flag into invalidate_range() to indicate that no sleeping is
> possible. Locks are only held for truncate and huge pages.
You can't sleep inside rcu_read_lock()!
I must say that for a patch that is up to v8 or whatever and is
posted twice a week to such a big cc list, it is kind of slack to
not even test it and expect other people to review it.
Also, what we are going to need here are not skeleton drivers
that just do all the *easy* bits (of registering their callbacks),
but actual fully working examples that do everything that any
real driver will need to do. If not for the sanity of the driver
writer, then for the sanity of the VM developers (I don't want
to have to understand xpmem or infiniband in order to understand
how the VM works).
> In two cases we use invalidate_range_begin/end to invalidate
> single pages because the pair allows holding off new references
> (idea by Robin Holt).
>
> do_wp_page(): We hold off new references while we update the pte.
>
> xip_unmap: We are not taking the PageLock so we cannot
> use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
> stands in.
>
> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
> Signed-off-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
> mm/filemap_xip.c | 5 +++++
> mm/fremap.c | 3 +++
> mm/hugetlb.c | 3 +++
> mm/memory.c | 35 +++++++++++++++++++++++++++++------
> mm/mmap.c | 2 ++
> mm/mprotect.c | 3 +++
> mm/mremap.c | 7 ++++++-
> 7 files changed, 51 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c 2008-02-14 18:43:31.000000000 -0800
> +++ linux-2.6/mm/fremap.c 2008-02-14 18:45:07.000000000 -0800
> @@ -15,6 +15,7 @@
> #include <linux/rmap.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/mmu_context.h>
> #include <asm/cacheflush.h>
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> + mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
> err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
> if (!err && !(flags & MAP_NONBLOCK)) {
> if (unlikely(has_write_lock)) {
> downgrade_write(&mm->mmap_sem);
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c 2008-02-14 18:43:31.000000000 -0800
> +++ linux-2.6/mm/memory.c 2008-02-14 18:45:07.000000000 -0800
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/pgalloc.h>
> #include <asm/uaccess.h>
> @@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
> if (is_vm_hugetlb_page(vma))
> return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
> +
> dst_pgd = pgd_offset(dst_mm, addr);
> src_pgd = pgd_offset(src_mm, addr);
> do {
> @@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
> vma, addr, next))
> return -ENOMEM;
> } while (dst_pgd++, src_pgd++, addr = next, addr != end);
> +
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier(invalidate_range_end, src_mm,
> + vma->vm_start, end, 0);
> +
> return 0;
> }
>
> @@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
> struct mmu_gather *tlb;
> unsigned long end = address + size;
> unsigned long nr_accounted = 0;
> + int atomic = details ? (details->i_mmap_lock != 0) : 0;
>
> lru_add_drain();
> tlb = tlb_gather_mmu(mm, 0);
> update_hiwater_rss(mm);
> + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
> end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
> if (tlb)
> tlb_finish_mmu(tlb, address, end);
> + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
> return end;
> }
>
Where do you invalidate for munmap()?
Also, how to you resolve the case where you are not allowed to sleep?
I would have thought either you have to handle it, in which case nobody
needs to sleep; or you can't handle it, in which case the code is
broken.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-15 6:49 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
@ 2008-02-19 23:55 ` Nick Piggin
2008-02-20 3:12 ` Robin Holt
2008-02-27 22:43 ` Christoph Lameter
1 sibling, 2 replies; 116+ messages in thread
From: Nick Piggin @ 2008-02-19 23:55 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> These special additional callbacks are required because XPmem (and likely
> other mechanisms) do use their own rmap (multiple processes on a series
> of remote Linux instances may be accessing the memory of a process).
> F.e. XPmem may have to send out notifications to remote Linux instances
> and receive confirmation before a page can be freed.
>
> So we handle this like an additional Linux reverse map that is walked after
> the existing rmaps have been walked. We leave the walking to the driver
> that is then able to use something else than a spinlock to walk its reverse
> maps. So we can actually call the driver without holding spinlocks while we
> hold the Pagelock.
I don't know how this is supposed to solve anything. The sleeping
problem happens I guess mostly in truncate. And all you are doing
is putting these rmap callbacks in page_mkclean and try_to_unmap.
> However, we cannot determine the mm_struct that a page belongs to at
> that point. The mm_struct can only be determined from the rmaps by the
> device driver.
>
> We add another pageflag (PageExternalRmap) that is set if a page has
> been remotely mapped (f.e. by a process from another Linux instance).
> We can then only perform the callbacks for pages that are actually in
> remote use.
>
> Rmap notifiers need an extra page bit and are only available
> on 64 bit platforms. This functionality is not available on 32 bit!
>
> A notifier that uses the reverse maps callbacks does not need to provide
> the invalidate_page() method that is called when locks are held.
That doesn't seem right. To start with, the new callbacks aren't
even called in the places where invalidate_page isn't allowed to
sleep.
The problem is unmap_mapping_range, right? And unmap_mapping_range
must walk the rmaps with the mmap lock held, which is why it can't
sleep. And it can't hold any mmap_sem so it cannot prevent address
space modifications of the processes in question between the time
you unmap them from the linux ptes with unmap_mapping_range, and the
time that you unmap them from your driver.
So in the meantime, you could have eg. a fault come in and set up a
new page for one of the processes, and that page might even get
exported via the same external driver. And now you have a totally
inconsistent view.
Preventing new mappings from being set up until the old mapping is
completely flushed is basically what we need to ensure for any sane
TLB as far as I can tell. To do that, you'll need to make the mmap
lock sleep, and either take mmap_sem inside it (which is a
deadlock condition at the moment), or make ptl sleep as well. These
are simply the locks we use to prevent that from happening, so I
can't see how you can possibly hope to have a coherent TLB without
invalidating inside those locks.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-19 23:08 ` Nick Piggin
@ 2008-02-20 1:00 ` Andrea Arcangeli
2008-02-20 3:00 ` Robin Holt
2008-02-27 22:39 ` Christoph Lameter
2008-02-27 22:35 ` Christoph Lameter
1 sibling, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-20 1:00 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, akpm, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
> You can't sleep inside rcu_read_lock()!
>
> I must say that for a patch that is up to v8 or whatever and is
> posted twice a week to such a big cc list, it is kind of slack to
> not even test it and expect other people to review it.
Well, xpmem requirements are complex. As as side effect of the
simplicity of my approach, my patch is 100% safe since #v1. Now it
also works for GRU and it cluster invalidates.
> Also, what we are going to need here are not skeleton drivers
> that just do all the *easy* bits (of registering their callbacks),
> but actual fully working examples that do everything that any
> real driver will need to do. If not for the sanity of the driver
I've a fully working scenario for my patch, infact I didn't post the
mmu notifier patch until I got KVM to swap 100% reliably to be sure I
would post something that works well. mmu notifiers are already used
in KVM for:
1) 100% reliable and efficient swapping of guest physical memory
2) copy-on-writes of writeprotect faults after ksm page sharing of guest
physical memory
3) ballooning using madvise to give the guest memory back to the host
My implementation is the most handy because it requires zero changes
to the ksm code too (no explicit mmu notifier calls after
ptep_clear_flush) and it's also 100% safe (no mess with schedules over
rcu_read_lock), no "atomic" parameters, and it doesn't open a window
where sptes have a view on older pages and linux pte has view on newer
pages (this can happen with remap_file_pages with my KVM swapping
patch to use V8 Christoph's patch).
> Also, how to you resolve the case where you are not allowed to sleep?
> I would have thought either you have to handle it, in which case nobody
> needs to sleep; or you can't handle it, in which case the code is
> broken.
I also asked exactly this, glad you reasked this too.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-20 1:00 ` Andrea Arcangeli
@ 2008-02-20 3:00 ` Robin Holt
2008-02-20 3:11 ` Nick Piggin
2008-02-27 22:39 ` Christoph Lameter
1 sibling, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-02-20 3:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, Christoph Lameter, akpm, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
> > You can't sleep inside rcu_read_lock()!
> >
> > I must say that for a patch that is up to v8 or whatever and is
> > posted twice a week to such a big cc list, it is kind of slack to
> > not even test it and expect other people to review it.
>
> Well, xpmem requirements are complex. As as side effect of the
> simplicity of my approach, my patch is 100% safe since #v1. Now it
> also works for GRU and it cluster invalidates.
>
> > Also, what we are going to need here are not skeleton drivers
> > that just do all the *easy* bits (of registering their callbacks),
> > but actual fully working examples that do everything that any
> > real driver will need to do. If not for the sanity of the driver
>
> I've a fully working scenario for my patch, infact I didn't post the
> mmu notifier patch until I got KVM to swap 100% reliably to be sure I
> would post something that works well. mmu notifiers are already used
> in KVM for:
>
> 1) 100% reliable and efficient swapping of guest physical memory
> 2) copy-on-writes of writeprotect faults after ksm page sharing of guest
> physical memory
> 3) ballooning using madvise to give the guest memory back to the host
>
> My implementation is the most handy because it requires zero changes
> to the ksm code too (no explicit mmu notifier calls after
> ptep_clear_flush) and it's also 100% safe (no mess with schedules over
> rcu_read_lock), no "atomic" parameters, and it doesn't open a window
> where sptes have a view on older pages and linux pte has view on newer
> pages (this can happen with remap_file_pages with my KVM swapping
> patch to use V8 Christoph's patch).
>
> > Also, how to you resolve the case where you are not allowed to sleep?
> > I would have thought either you have to handle it, in which case nobody
> > needs to sleep; or you can't handle it, in which case the code is
> > broken.
>
> I also asked exactly this, glad you reasked this too.
Currently, we BUG_ON having a PFN in our tables and not being able
to sleep. These are mappings which MPT has never supported in the past
and XPMEM was already not allowing page faults for VMAs which are not
anonymous so it should never happen. If the file-backed operations can
ever get changed to allow for sleeping and a customer has a need for it,
we would need to change XPMEM to allow those types of faults to succeed.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-20 3:00 ` Robin Holt
@ 2008-02-20 3:11 ` Nick Piggin
2008-02-20 3:19 ` Robin Holt
0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-02-20 3:11 UTC (permalink / raw)
To: Robin Holt
Cc: Andrea Arcangeli, Christoph Lameter, akpm, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Wednesday 20 February 2008 14:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
> > > Also, how to you resolve the case where you are not allowed to sleep?
> > > I would have thought either you have to handle it, in which case nobody
> > > needs to sleep; or you can't handle it, in which case the code is
> > > broken.
> >
> > I also asked exactly this, glad you reasked this too.
>
> Currently, we BUG_ON having a PFN in our tables and not being able
> to sleep. These are mappings which MPT has never supported in the past
> and XPMEM was already not allowing page faults for VMAs which are not
> anonymous so it should never happen. If the file-backed operations can
> ever get changed to allow for sleeping and a customer has a need for it,
> we would need to change XPMEM to allow those types of faults to succeed.
Do you really want to be able to swap, or are you just interested
in keeping track of unmaps / prot changes?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-19 23:55 ` Nick Piggin
@ 2008-02-20 3:12 ` Robin Holt
2008-02-20 3:51 ` Nick Piggin
2008-02-27 22:43 ` Christoph Lameter
1 sibling, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-02-20 3:12 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, akpm, Andrea Arcangeli, Robin Holt,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
On Wed, Feb 20, 2008 at 10:55:20AM +1100, Nick Piggin wrote:
> On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> > These special additional callbacks are required because XPmem (and likely
> > other mechanisms) do use their own rmap (multiple processes on a series
> > of remote Linux instances may be accessing the memory of a process).
> > F.e. XPmem may have to send out notifications to remote Linux instances
> > and receive confirmation before a page can be freed.
> >
> > So we handle this like an additional Linux reverse map that is walked after
> > the existing rmaps have been walked. We leave the walking to the driver
> > that is then able to use something else than a spinlock to walk its reverse
> > maps. So we can actually call the driver without holding spinlocks while we
> > hold the Pagelock.
>
> I don't know how this is supposed to solve anything. The sleeping
> problem happens I guess mostly in truncate. And all you are doing
> is putting these rmap callbacks in page_mkclean and try_to_unmap.
>
>
> > However, we cannot determine the mm_struct that a page belongs to at
> > that point. The mm_struct can only be determined from the rmaps by the
> > device driver.
> >
> > We add another pageflag (PageExternalRmap) that is set if a page has
> > been remotely mapped (f.e. by a process from another Linux instance).
> > We can then only perform the callbacks for pages that are actually in
> > remote use.
> >
> > Rmap notifiers need an extra page bit and are only available
> > on 64 bit platforms. This functionality is not available on 32 bit!
> >
> > A notifier that uses the reverse maps callbacks does not need to provide
> > the invalidate_page() method that is called when locks are held.
>
> That doesn't seem right. To start with, the new callbacks aren't
> even called in the places where invalidate_page isn't allowed to
> sleep.
>
> The problem is unmap_mapping_range, right? And unmap_mapping_range
> must walk the rmaps with the mmap lock held, which is why it can't
> sleep. And it can't hold any mmap_sem so it cannot prevent address
> space modifications of the processes in question between the time
> you unmap them from the linux ptes with unmap_mapping_range, and the
> time that you unmap them from your driver.
>
> So in the meantime, you could have eg. a fault come in and set up a
> new page for one of the processes, and that page might even get
> exported via the same external driver. And now you have a totally
> inconsistent view.
>
> Preventing new mappings from being set up until the old mapping is
> completely flushed is basically what we need to ensure for any sane
> TLB as far as I can tell. To do that, you'll need to make the mmap
> lock sleep, and either take mmap_sem inside it (which is a
> deadlock condition at the moment), or make ptl sleep as well. These
> are simply the locks we use to prevent that from happening, so I
> can't see how you can possibly hope to have a coherent TLB without
> invalidating inside those locks.
All of that is correct. For XPMEM, we do not currently allow file backed
mapping pages from being exported so we should never reach this condition.
It has been an issue since day 1. We have operated with that assumption
for 6 years and have not had issues with that assumption. The user of
xpmem is MPT and it controls the communication buffers so it is reasonable
to expect this type of behavior.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-20 3:11 ` Nick Piggin
@ 2008-02-20 3:19 ` Robin Holt
0 siblings, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-20 3:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, Andrea Arcangeli, Christoph Lameter, akpm,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
On Wed, Feb 20, 2008 at 02:11:41PM +1100, Nick Piggin wrote:
> On Wednesday 20 February 2008 14:00, Robin Holt wrote:
> > On Wed, Feb 20, 2008 at 02:00:38AM +0100, Andrea Arcangeli wrote:
> > > On Wed, Feb 20, 2008 at 10:08:49AM +1100, Nick Piggin wrote:
>
> > > > Also, how to you resolve the case where you are not allowed to sleep?
> > > > I would have thought either you have to handle it, in which case nobody
> > > > needs to sleep; or you can't handle it, in which case the code is
> > > > broken.
> > >
> > > I also asked exactly this, glad you reasked this too.
> >
> > Currently, we BUG_ON having a PFN in our tables and not being able
> > to sleep. These are mappings which MPT has never supported in the past
> > and XPMEM was already not allowing page faults for VMAs which are not
> > anonymous so it should never happen. If the file-backed operations can
> > ever get changed to allow for sleeping and a customer has a need for it,
> > we would need to change XPMEM to allow those types of faults to succeed.
>
> Do you really want to be able to swap, or are you just interested
> in keeping track of unmaps / prot changes?
I would rather not swap, but we do have one customer that would like
swapout to work for certain circumstances. Additionally, we have
many customers that would rather that their system not die under I/O
termination.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-20 3:12 ` Robin Holt
@ 2008-02-20 3:51 ` Nick Piggin
2008-02-20 9:00 ` Robin Holt
0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-02-20 3:51 UTC (permalink / raw)
To: Robin Holt
Cc: Christoph Lameter, akpm, Andrea Arcangeli, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> For XPMEM, we do not currently allow file backed
> mapping pages from being exported so we should never reach this condition.
> It has been an issue since day 1. We have operated with that assumption
> for 6 years and have not had issues with that assumption. The user of
> xpmem is MPT and it controls the communication buffers so it is reasonable
> to expect this type of behavior.
OK, that makes things simpler.
So why can't you export a device from your xpmem driver, which
can be mmap()ed to give out "anonymous" memory pages to be used
for these communication buffers?
I guess you may also want an "munmap/mprotect" callback, which
we don't have in the kernel right now... but at least you could
prototype it easily by having an ioctl to be called before
munmapping or mprotecting (eg. the ioctl could prevent new TLB
setup for the region, and shoot down existing ones).
This is actually going to be much faster for you if you use any
threaded applications, because you will be able to do all the
shootdown round trips outside mmap_sem, and so you will be able
to have other threads faulting and even mmap()ing / munmaping
at the same time as the shootdown is happening.
I guess there is some catch...
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-20 3:51 ` Nick Piggin
@ 2008-02-20 9:00 ` Robin Holt
2008-02-20 9:05 ` Robin Holt
2008-02-21 4:20 ` Nick Piggin
0 siblings, 2 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-20 9:00 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, Christoph Lameter, akpm, Andrea Arcangeli,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > For XPMEM, we do not currently allow file backed
> > mapping pages from being exported so we should never reach this condition.
> > It has been an issue since day 1. We have operated with that assumption
> > for 6 years and have not had issues with that assumption. The user of
> > xpmem is MPT and it controls the communication buffers so it is reasonable
> > to expect this type of behavior.
>
> OK, that makes things simpler.
>
> So why can't you export a device from your xpmem driver, which
> can be mmap()ed to give out "anonymous" memory pages to be used
> for these communication buffers?
Because we need to have heap and stack available as well. MPT does
not control all the communication buffer areas. I haven't checked, but
this is the same problem that IB will have. I believe they are actually
allowing any memory region be accessible, but I am not sure of that.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-20 9:00 ` Robin Holt
@ 2008-02-20 9:05 ` Robin Holt
2008-02-21 4:20 ` Nick Piggin
1 sibling, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-20 9:05 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, Christoph Lameter, akpm, Andrea Arcangeli,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
On Wed, Feb 20, 2008 at 03:00:36AM -0600, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> > On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > > For XPMEM, we do not currently allow file backed
> > > mapping pages from being exported so we should never reach this condition.
> > > It has been an issue since day 1. We have operated with that assumption
> > > for 6 years and have not had issues with that assumption. The user of
> > > xpmem is MPT and it controls the communication buffers so it is reasonable
> > > to expect this type of behavior.
> >
> > OK, that makes things simpler.
> >
> > So why can't you export a device from your xpmem driver, which
> > can be mmap()ed to give out "anonymous" memory pages to be used
> > for these communication buffers?
>
> Because we need to have heap and stack available as well. MPT does
> not control all the communication buffer areas. I haven't checked, but
> this is the same problem that IB will have. I believe they are actually
> allowing any memory region be accessible, but I am not sure of that.
I should have read my work email first. I had gotten an email from
one of our MPT developers saying they would love it if they could share
file backed memory areas as well as it would help them with their MPI-IO
functions which currently need to do multiple copy steps. Not sure how
high of a priority I am going to be able to make that.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-20 9:00 ` Robin Holt
2008-02-20 9:05 ` Robin Holt
@ 2008-02-21 4:20 ` Nick Piggin
2008-02-21 10:58 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-02-21 4:20 UTC (permalink / raw)
To: Robin Holt
Cc: Christoph Lameter, akpm, Andrea Arcangeli, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Wednesday 20 February 2008 20:00, Robin Holt wrote:
> On Wed, Feb 20, 2008 at 02:51:45PM +1100, Nick Piggin wrote:
> > On Wednesday 20 February 2008 14:12, Robin Holt wrote:
> > > For XPMEM, we do not currently allow file backed
> > > mapping pages from being exported so we should never reach this
> > > condition. It has been an issue since day 1. We have operated with
> > > that assumption for 6 years and have not had issues with that
> > > assumption. The user of xpmem is MPT and it controls the communication
> > > buffers so it is reasonable to expect this type of behavior.
> >
> > OK, that makes things simpler.
> >
> > So why can't you export a device from your xpmem driver, which
> > can be mmap()ed to give out "anonymous" memory pages to be used
> > for these communication buffers?
>
> Because we need to have heap and stack available as well. MPT does
> not control all the communication buffer areas. I haven't checked, but
> this is the same problem that IB will have. I believe they are actually
> allowing any memory region be accessible, but I am not sure of that.
Then you should create a driver that the user program can register
and unregister regions of their memory with. The driver can do a
get_user_pages to get the pages, and then you'd just need to set up
some kind of mapping so that userspace can unmap pages / won't leak
memory (and an exit_mm notifier I guess).
Because you don't need to swap, you don't need coherency, and you
are in control of the areas, then this seems like the best choice.
It would allow you to use heap, stack, file-backed, anything.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-21 4:20 ` Nick Piggin
@ 2008-02-21 10:58 ` Robin Holt
2008-02-26 6:11 ` Nick Piggin
0 siblings, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-02-21 10:58 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, Christoph Lameter, akpm, Andrea Arcangeli,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > So why can't you export a device from your xpmem driver, which
> > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > for these communication buffers?
> >
> > Because we need to have heap and stack available as well. MPT does
> > not control all the communication buffer areas. I haven't checked, but
> > this is the same problem that IB will have. I believe they are actually
> > allowing any memory region be accessible, but I am not sure of that.
>
> Then you should create a driver that the user program can register
> and unregister regions of their memory with. The driver can do a
> get_user_pages to get the pages, and then you'd just need to set up
> some kind of mapping so that userspace can unmap pages / won't leak
> memory (and an exit_mm notifier I guess).
OK. You need to explain this better to me. How would this driver
supposedly work? What we have is an MPI library. It gets invoked at
process load time to establish its rank-to-rank communication regions.
It then turns control over to the processes main(). That is allowed to
run until it hits the
MPI_Init(&argc, &argv);
The process is then totally under the users control until:
MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, MPI_COMM_WORLD);
MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD, &status);
That is it. That is all our allowed interaction with the users process.
Are you saying at the time of the MPI_Send, we should:
down_write(¤t->mm->mmap_sem);
Find all the VMAs that describe this region and record their
vm_ops structure.
Find all currently inserted page table information.
Create new VMAs that describe the same regions as before.
Insert our special fault handler which merely calls their old
fault handler and then exports the page then returns the page to the
kernel.
Take an extra reference count on the page for each possible
remote rank we are exporting this to.
That doesn't seem too unreasonable, except when you compare it to how the
driver currently works. Remember, this is done from a library which has
no insight into what the user has done to its own virtual address space.
As a result, each MPI_Send() would result in a system call (or we would
need to have a set of callouts for changes to a processes VMAs) which
would be a significant increase in communication overhead.
Maybe I am missing what you intend to do, but what we need is a means of
tracking one processes virtual address space changes so other processes
can do direct memory accesses without the need for a system call on each
communication event.
> Because you don't need to swap, you don't need coherency, and you
> are in control of the areas, then this seems like the best choice.
> It would allow you to use heap, stack, file-backed, anything.
You are missing one point here. The MPI specifications that have
been out there for decades do not require the process use a library
for allocating the buffer. I realize that is a horrible shortcoming,
but that is the world we live in. Even if we could change that spec,
we would still need to support the existing specs. As a result, the
user can change their virtual address space as they need and still expect
communications be cheap.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-21 10:58 ` Robin Holt
@ 2008-02-26 6:11 ` Nick Piggin
2008-02-26 7:21 ` [ofa-general] " Gleb Natapov
2008-02-26 12:29 ` Robin Holt
0 siblings, 2 replies; 116+ messages in thread
From: Nick Piggin @ 2008-02-26 6:11 UTC (permalink / raw)
To: Robin Holt
Cc: Christoph Lameter, akpm, Andrea Arcangeli, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Thursday 21 February 2008 21:58, Robin Holt wrote:
> On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > > So why can't you export a device from your xpmem driver, which
> > > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > > for these communication buffers?
> > >
> > > Because we need to have heap and stack available as well. MPT does
> > > not control all the communication buffer areas. I haven't checked, but
> > > this is the same problem that IB will have. I believe they are
> > > actually allowing any memory region be accessible, but I am not sure of
> > > that.
> >
> > Then you should create a driver that the user program can register
> > and unregister regions of their memory with. The driver can do a
> > get_user_pages to get the pages, and then you'd just need to set up
> > some kind of mapping so that userspace can unmap pages / won't leak
> > memory (and an exit_mm notifier I guess).
>
> OK. You need to explain this better to me. How would this driver
> supposedly work? What we have is an MPI library. It gets invoked at
> process load time to establish its rank-to-rank communication regions.
> It then turns control over to the processes main(). That is allowed to
> run until it hits the
> MPI_Init(&argc, &argv);
>
> The process is then totally under the users control until:
> MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, MPI_COMM_WORLD);
> MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
> &status);
>
> That is it. That is all our allowed interaction with the users process.
OK, when you said something along the lines of "the MPT library has
control of the comm buffer", then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.
> That doesn't seem too unreasonable, except when you compare it to how the
> driver currently works. Remember, this is done from a library which has
> no insight into what the user has done to its own virtual address space.
> As a result, each MPI_Send() would result in a system call (or we would
> need to have a set of callouts for changes to a processes VMAs) which
> would be a significant increase in communication overhead.
>
> Maybe I am missing what you intend to do, but what we need is a means of
> tracking one processes virtual address space changes so other processes
> can do direct memory accesses without the need for a system call on each
> communication event.
Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?
> > Because you don't need to swap, you don't need coherency, and you
> > are in control of the areas, then this seems like the best choice.
> > It would allow you to use heap, stack, file-backed, anything.
>
> You are missing one point here. The MPI specifications that have
> been out there for decades do not require the process use a library
> for allocating the buffer. I realize that is a horrible shortcoming,
> but that is the world we live in. Even if we could change that spec,
Can you change the spec? Are you working on it?
> we would still need to support the existing specs. As a result, the
> user can change their virtual address space as they need and still expect
> communications be cheap.
That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 6:11 ` Nick Piggin
@ 2008-02-26 7:21 ` Gleb Natapov
2008-02-26 8:52 ` Nick Piggin
2008-02-26 12:29 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Gleb Natapov @ 2008-02-26 7:21 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm,
Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel,
Avi Kivity, kvm-devel, daniel.blueman, general, akpm,
Christoph Lameter
On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > You are missing one point here. The MPI specifications that have
> > been out there for decades do not require the process use a library
> > for allocating the buffer. I realize that is a horrible shortcoming,
> > but that is the world we live in. Even if we could change that spec,
>
> Can you change the spec?
Not really. It will break all existing codes. MPI-2 provides a call for
memory allocation (and it's beneficial to use this call for some interconnects),
but many (most?) applications are still written for MPI-1 and those that
are written for MPI-2 mostly uses the old habit of allocating memory by malloc(),
or even use stack or BSS memory for communication buffer purposes.
--
Gleb.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 7:21 ` [ofa-general] " Gleb Natapov
@ 2008-02-26 8:52 ` Nick Piggin
2008-02-26 9:38 ` Gleb Natapov
2008-02-26 12:28 ` Robin Holt
0 siblings, 2 replies; 116+ messages in thread
From: Nick Piggin @ 2008-02-26 8:52 UTC (permalink / raw)
To: Gleb Natapov
Cc: Robin Holt, steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm,
Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel,
Avi Kivity, kvm-devel, daniel.blueman, general, akpm,
Christoph Lameter
On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > You are missing one point here. The MPI specifications that have
> > > been out there for decades do not require the process use a library
> > > for allocating the buffer. I realize that is a horrible shortcoming,
> > > but that is the world we live in. Even if we could change that spec,
> >
> > Can you change the spec?
>
> Not really. It will break all existing codes.
I meant as in eg. submit changes to MPI-3
> MPI-2 provides a call for
> memory allocation (and it's beneficial to use this call for some
> interconnects), but many (most?) applications are still written for MPI-1
> and those that are written for MPI-2 mostly uses the old habit of
> allocating memory by malloc(), or even use stack or BSS memory for
> communication buffer purposes.
OK, so MPI-2 already has some way to do that... I'm not saying that we
can now completely dismiss the idea of using notifiers for this, but it
is just a good data point to know.
Thanks,
Nick
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 8:52 ` Nick Piggin
@ 2008-02-26 9:38 ` Gleb Natapov
2008-02-26 9:52 ` KOSAKI Motohiro
2008-02-26 12:28 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Gleb Natapov @ 2008-02-26 9:38 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, steiner, Andrea Arcangeli, Peter Zijlstra, linux-mm,
Izik Eidus, Kanoj Sarcar, Roland Dreier, linux-kernel,
Avi Kivity, kvm-devel, daniel.blueman, general, akpm,
Christoph Lameter
On Tue, Feb 26, 2008 at 07:52:41PM +1100, Nick Piggin wrote:
> On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> > On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > > You are missing one point here. The MPI specifications that have
> > > > been out there for decades do not require the process use a library
> > > > for allocating the buffer. I realize that is a horrible shortcoming,
> > > > but that is the world we live in. Even if we could change that spec,
> > >
> > > Can you change the spec?
> >
> > Not really. It will break all existing codes.
>
> I meant as in eg. submit changes to MPI-3
MPI spec tries to be backward compatible. And MPI-2 spec is 10 years
old, but MPI-1 is still in a wider use. HPC is moving fast in terms of HW
technology, but slow in terms of SW. Fortran is still hot there :)
--
Gleb.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 9:38 ` Gleb Natapov
@ 2008-02-26 9:52 ` KOSAKI Motohiro
0 siblings, 0 replies; 116+ messages in thread
From: KOSAKI Motohiro @ 2008-02-26 9:52 UTC (permalink / raw)
To: Gleb Natapov
Cc: kosaki.motohiro, Nick Piggin, Robin Holt, steiner,
Andrea Arcangeli, Peter Zijlstra, linux-mm, Izik Eidus,
Kanoj Sarcar, Roland Dreier, linux-kernel, Avi Kivity, kvm-devel,
daniel.blueman, general, akpm, Christoph Lameter
> > > > Can you change the spec?
> > >
> > > Not really. It will break all existing codes.
> >
> > I meant as in eg. submit changes to MPI-3
>
> MPI spec tries to be backward compatible. And MPI-2 spec is 10 years
> old, but MPI-1 is still in a wider use. HPC is moving fast in terms of HW
> technology, but slow in terms of SW. Fortran is still hot there :)
Agreed.
many many people dislike incompatible specification change.
We should accept real world spec.
- kosaki
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [ofa-general] Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 8:52 ` Nick Piggin
2008-02-26 9:38 ` Gleb Natapov
@ 2008-02-26 12:28 ` Robin Holt
1 sibling, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-26 12:28 UTC (permalink / raw)
To: Nick Piggin
Cc: Gleb Natapov, Robin Holt, steiner, Andrea Arcangeli,
Peter Zijlstra, linux-mm, Izik Eidus, Kanoj Sarcar,
Roland Dreier, linux-kernel, Avi Kivity, kvm-devel,
daniel.blueman, general, akpm, Christoph Lameter
On Tue, Feb 26, 2008 at 07:52:41PM +1100, Nick Piggin wrote:
> On Tuesday 26 February 2008 18:21, Gleb Natapov wrote:
> > On Tue, Feb 26, 2008 at 05:11:32PM +1100, Nick Piggin wrote:
> > > > You are missing one point here. The MPI specifications that have
> > > > been out there for decades do not require the process use a library
> > > > for allocating the buffer. I realize that is a horrible shortcoming,
> > > > but that is the world we live in. Even if we could change that spec,
> > >
> > > Can you change the spec?
> >
> > Not really. It will break all existing codes.
>
> I meant as in eg. submit changes to MPI-3
>
>
> > MPI-2 provides a call for
> > memory allocation (and it's beneficial to use this call for some
> > interconnects), but many (most?) applications are still written for MPI-1
> > and those that are written for MPI-2 mostly uses the old habit of
> > allocating memory by malloc(), or even use stack or BSS memory for
> > communication buffer purposes.
>
> OK, so MPI-2 already has some way to do that... I'm not saying that we
> can now completely dismiss the idea of using notifiers for this, but it
> is just a good data point to know.
It is in MPI-2, but MPI-2 does not prohibit communication from regions
not allocated by the MPI call.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-26 6:11 ` Nick Piggin
2008-02-26 7:21 ` [ofa-general] " Gleb Natapov
@ 2008-02-26 12:29 ` Robin Holt
1 sibling, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-26 12:29 UTC (permalink / raw)
To: Nick Piggin
Cc: Robin Holt, Christoph Lameter, akpm, Andrea Arcangeli,
Avi Kivity, Izik Eidus, kvm-devel, Peter Zijlstra, general,
Steve Wise, Roland Dreier, Kanoj Sarcar, steiner, linux-kernel,
linux-mm, daniel.blueman
> > That is it. That is all our allowed interaction with the users process.
>
> OK, when you said something along the lines of "the MPT library has
> control of the comm buffer", then I assumed it was an area of virtual
> memory which is set up as part of initialization, rather than during
> runtime. I guess I jumped to conclusions.
There are six regions the MPT library typically makes. The most basic
one is a fixed size. It describes the MPT internal buffers, the stack,
the heap, the application text, and finally the entire address space.
That last region is seldom used. MPT only has control over the first
two.
> > That doesn't seem too unreasonable, except when you compare it to how the
> > driver currently works. Remember, this is done from a library which has
> > no insight into what the user has done to its own virtual address space.
> > As a result, each MPI_Send() would result in a system call (or we would
> > need to have a set of callouts for changes to a processes VMAs) which
> > would be a significant increase in communication overhead.
> >
> > Maybe I am missing what you intend to do, but what we need is a means of
> > tracking one processes virtual address space changes so other processes
> > can do direct memory accesses without the need for a system call on each
> > communication event.
>
> Yeah it's tricky. BTW. what is the performance difference between
> having a system call or no?
The system call takes many microseconds and still requires the same
latency of the communication. Without it, our latency is
usually below two microseconds.
> > > Because you don't need to swap, you don't need coherency, and you
> > > are in control of the areas, then this seems like the best choice.
> > > It would allow you to use heap, stack, file-backed, anything.
> >
> > You are missing one point here. The MPI specifications that have
> > been out there for decades do not require the process use a library
> > for allocating the buffer. I realize that is a horrible shortcoming,
> > but that is the world we live in. Even if we could change that spec,
>
> Can you change the spec? Are you working on it?
Even if we changed the spec, the old specs will continue to be
supported. I personally am not involved. Not sure if anybody else is
working this issue.
> > we would still need to support the existing specs. As a result, the
> > user can change their virtual address space as they need and still expect
> > communications be cheap.
>
> That's true. How has it been supported up to now? Are you using
> these kind of notifiers in patched kernels?
At fault time, we check to see if it is an anon or mspec vma. We pin
the page an insert them. The remote OS then losses synchronicity with
the owning processes page tables. If an unmap, madvise, etc occurs the
page tables are updated without regard to our references. Fork or exit
(fork is caught using an LD_PRELOAD library) cause the user pages to be
recalled from the remote side and put_page returns them to the kernel.
We have documented that this loss of synchronicity is due to their
action and not supported. Essentially, we rely upon the application
being well behaved. To this point, that has remainded true.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-19 13:34 ` Andrea Arcangeli
@ 2008-02-27 22:23 ` Christoph Lameter
2008-02-27 23:57 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:23 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Tue, 19 Feb 2008, Andrea Arcangeli wrote:
> Yes, that's why I kept maintaining my patch and I posted the last
> revision to Andrew. I use pte/tlb locking of the core VM, it's
> unintrusive and obviously safe. Furthermore it can be extended with
> Christoph's stuff in a 100% backwards compatible fashion later if needed.
How would that work? You rely on the pte locking. Thus calls are all in an
atomic context. I think we need a general scheme that allows sleeping when
references are invalidates. Even the GRU has performance issues when using
the KVM patch.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-19 23:08 ` Nick Piggin
2008-02-20 1:00 ` Andrea Arcangeli
@ 2008-02-27 22:35 ` Christoph Lameter
2008-02-27 22:42 ` Jack Steiner
` (3 more replies)
1 sibling, 4 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:35 UTC (permalink / raw)
To: Nick Piggin
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, 20 Feb 2008, Nick Piggin wrote:
> On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> > The invalidation of address ranges in a mm_struct needs to be
> > performed when pages are removed or permissions etc change.
> >
> > If invalidate_range_begin() is called with locks held then we
> > pass a flag into invalidate_range() to indicate that no sleeping is
> > possible. Locks are only held for truncate and huge pages.
>
> You can't sleep inside rcu_read_lock()!
Could you be specific? This refers to page migration? Hmmm... Guess we
would need to inc the refcount there instead?
> I must say that for a patch that is up to v8 or whatever and is
> posted twice a week to such a big cc list, it is kind of slack to
> not even test it and expect other people to review it.
It was tested with the GRU and XPmem. Andrea also reported success.
> Also, what we are going to need here are not skeleton drivers
> that just do all the *easy* bits (of registering their callbacks),
> but actual fully working examples that do everything that any
> real driver will need to do. If not for the sanity of the driver
> writer, then for the sanity of the VM developers (I don't want
> to have to understand xpmem or infiniband in order to understand
> how the VM works).
There are 3 different drivers that can already use it but the code is
complex and not easy to review. Skeletons are easy to allow people to get
started with it.
> > lru_add_drain();
> > tlb = tlb_gather_mmu(mm, 0);
> > update_hiwater_rss(mm);
> > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
> > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
> > if (tlb)
> > tlb_finish_mmu(tlb, address, end);
> > + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
> > return end;
> > }
> >
>
> Where do you invalidate for munmap()?
zap_page_range() called from unmap_vmas().
> Also, how to you resolve the case where you are not allowed to sleep?
> I would have thought either you have to handle it, in which case nobody
> needs to sleep; or you can't handle it, in which case the code is
> broken.
That can be done in a variety of ways:
1. Change VM locking
2. Not handle file backed mappings (XPmem could work mostly in such a
config)
3. Keep the refcount elevated until pages are freed in another execution
context.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-20 1:00 ` Andrea Arcangeli
2008-02-20 3:00 ` Robin Holt
@ 2008-02-27 22:39 ` Christoph Lameter
2008-02-28 0:38 ` Andrea Arcangeli
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:39 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, 20 Feb 2008, Andrea Arcangeli wrote:
> Well, xpmem requirements are complex. As as side effect of the
> simplicity of my approach, my patch is 100% safe since #v1. Now it
> also works for GRU and it cluster invalidates.
The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we
have a KVM only solution that works 100% (which makes me just switch
ignore the rest of the argument because 100% solutions usually do not
exist).
> rcu_read_lock), no "atomic" parameters, and it doesn't open a window
> where sptes have a view on older pages and linux pte has view on newer
> pages (this can happen with remap_file_pages with my KVM swapping
> patch to use V8 Christoph's patch).
Ok so you are now getting away from keeping the refcount elevated? That
was your design decision....
> > Also, how to you resolve the case where you are not allowed to sleep?
> > I would have thought either you have to handle it, in which case nobody
> > needs to sleep; or you can't handle it, in which case the code is
> > broken.
>
> I also asked exactly this, glad you reasked this too.
It would have helped if you would have repeated my answers that you had
already gotten before. You knew I was on vacation....
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:35 ` Christoph Lameter
@ 2008-02-27 22:42 ` Jack Steiner
2008-02-28 0:10 ` Christoph Lameter
` (2 subsequent siblings)
3 siblings, 0 replies; 116+ messages in thread
From: Jack Steiner @ 2008-02-27 22:42 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, linux-kernel, linux-mm,
daniel.blueman
>
> > Also, what we are going to need here are not skeleton drivers
> > that just do all the *easy* bits (of registering their callbacks),
> > but actual fully working examples that do everything that any
> > real driver will need to do. If not for the sanity of the driver
> > writer, then for the sanity of the VM developers (I don't want
> > to have to understand xpmem or infiniband in order to understand
> > how the VM works).
>
> There are 3 different drivers that can already use it but the code is
> complex and not easy to review. Skeletons are easy to allow people to get
> started with it.
I posted the full GRU driver late last week. It is a lot of
code & somewhat difficult to understand w/o access to full chip
specs (sorry). The code is fairly well commented & the
parts related to TLB management should be understandable.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-19 23:55 ` Nick Piggin
2008-02-20 3:12 ` Robin Holt
@ 2008-02-27 22:43 ` Christoph Lameter
2008-02-28 0:42 ` Andrea Arcangeli
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:43 UTC (permalink / raw)
To: Nick Piggin
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, 20 Feb 2008, Nick Piggin wrote:
> I don't know how this is supposed to solve anything. The sleeping
> problem happens I guess mostly in truncate. And all you are doing
> is putting these rmap callbacks in page_mkclean and try_to_unmap.
truncate is handled by the range invalidates. This is special code to deal
with the unnap/clean of an individual page.
> That doesn't seem right. To start with, the new callbacks aren't
> even called in the places where invalidate_page isn't allowed to
> sleep.
>
> The problem is unmap_mapping_range, right? And unmap_mapping_range
> must walk the rmaps with the mmap lock held, which is why it can't
> sleep. And it can't hold any mmap_sem so it cannot prevent address
Nope. unmap_mapping_range is already handled by the range callbacks.
> So in the meantime, you could have eg. a fault come in and set up a
> new page for one of the processes, and that page might even get
> exported via the same external driver. And now you have a totally
> inconsistent view.
The situation that you are imagining has already been dealt with by the
earlier patches. This is only to allow sleeping while unmapping individual
pages.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:23 ` Christoph Lameter
@ 2008-02-27 23:57 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-27 23:57 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 02:23:29PM -0800, Christoph Lameter wrote:
> How would that work? You rely on the pte locking. Thus calls are all in an
I don't rely on the pte locking in #v7, exactly to satisfy GRU
(so far purely theoretical) performance complains.
> atomic context. I think we need a general scheme that allows sleeping when
Calls are still in atomic context until we change the i_mmap_lock to a
mutex under a CONFIG_XPMEM, or unless we boost mm_users, drop the lock
and restart the loop at every different mm. In any case those changes
should be under CONFIG_XPMEM IMHO given desktop users definitely don't
need this (regular non-blocking mmu notifiers in my patch are all what
a desktop user need as far as I can tell).
> references are invalidates. Even the GRU has performance issues when using
> the KVM patch.
GRU will perform the same with #v7 or V8.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:35 ` Christoph Lameter
2008-02-27 22:42 ` Jack Steiner
@ 2008-02-28 0:10 ` Christoph Lameter
2008-02-28 0:11 ` Andrea Arcangeli
2008-03-03 5:11 ` Nick Piggin
3 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-28 0:10 UTC (permalink / raw)
To: Nick Piggin
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, 27 Feb 2008, Christoph Lameter wrote:
> Could you be specific? This refers to page migration? Hmmm... Guess we
> would need to inc the refcount there instead?
Argh. No its the callback list scanning. Yuck. No one noticed.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:35 ` Christoph Lameter
2008-02-27 22:42 ` Jack Steiner
2008-02-28 0:10 ` Christoph Lameter
@ 2008-02-28 0:11 ` Andrea Arcangeli
2008-02-28 0:14 ` Christoph Lameter
2008-03-03 5:11 ` Nick Piggin
3 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-28 0:11 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 02:35:59PM -0800, Christoph Lameter wrote:
> Could you be specific? This refers to page migration? Hmmm... Guess we
If the reader schedule, the synchronize_rcu will return in the other
cpu and the objects in the list will be freed and overwritten, and
when the task is scheduled back in, it'll follow dangling pointers...
You can't use RCU if you want any of your invalidate methods to
schedule. Otherwise it's like having zero locking.
> 2. Not handle file backed mappings (XPmem could work mostly in such a
> config)
IMHO that fits under your definition of "hacking something in now and
then having to modify it later".
> 3. Keep the refcount elevated until pages are freed in another execution
> context.
Page refcount is not enough (the mmu_notifier_release will run in
another cpu the moment after i_mmap_lock is unlocked) but mm_users may
prevent us to change the i_mmap_lock to a mutex, but it'll slowdown
truncate as it'll have to drop the lock and restart the radix tree
walk every time so a change like this better fits as a separate
CONFIG_XPMEM IMHO.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 0:11 ` Andrea Arcangeli
@ 2008-02-28 0:14 ` Christoph Lameter
2008-02-28 0:52 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-28 0:14 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:
> > 3. Keep the refcount elevated until pages are freed in another execution
> > context.
>
> Page refcount is not enough (the mmu_notifier_release will run in
> another cpu the moment after i_mmap_lock is unlocked) but mm_users may
> prevent us to change the i_mmap_lock to a mutex, but it'll slowdown
> truncate as it'll have to drop the lock and restart the radix tree
> walk every time so a change like this better fits as a separate
> CONFIG_XPMEM IMHO.
Erm. This would also be needed by RDMA etc.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:39 ` Christoph Lameter
@ 2008-02-28 0:38 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-28 0:38 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 02:39:46PM -0800, Christoph Lameter wrote:
> On Wed, 20 Feb 2008, Andrea Arcangeli wrote:
>
> > Well, xpmem requirements are complex. As as side effect of the
> > simplicity of my approach, my patch is 100% safe since #v1. Now it
> > also works for GRU and it cluster invalidates.
>
> The patch has to satisfy RDMA, XPMEM, GRU and KVM. I keep hearing that we
> have a KVM only solution that works 100% (which makes me just switch
> ignore the rest of the argument because 100% solutions usually do not
> exist).
I only said 100% safe, I didn't imply anything other than it won't
crash the kernel ;).
#v6 and #v7 only leaves XPMEM out AFIK, and that can be supported
later with a CONFIG_XPMEM that purely changes some VM locking. #v7
also provides maximum performance to GRU.
> > rcu_read_lock), no "atomic" parameters, and it doesn't open a window
> > where sptes have a view on older pages and linux pte has view on newer
> > pages (this can happen with remap_file_pages with my KVM swapping
> > patch to use V8 Christoph's patch).
>
> Ok so you are now getting away from keeping the refcount elevated? That
> was your design decision....
No, I'm not getting away from it. If I would get away from it, I would
be forced to implement invalidate_range_begin. However even if I don't
get away from it, the fact I only implement invalidate_range_end, and
that's called after the PT lock is dropped, opens a little window with
lost coherency (which may not be detectable by userland anyway). But this
little window is fine for KVM and it doesn't impose any security
risk. But clearly proving the locking safe becomes a bit more complex
in #v7 than in #v6.
> It would have helped if you would have repeated my answers that you had
> already gotten before. You knew I was on vacation....
I didn't remember the BUG_ON crystal clear sorry, but not sure why you
think it was your call, this was a lowlevel XPMEM question and Robin
promptly answered/reminded about it infact.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-27 22:43 ` Christoph Lameter
@ 2008-02-28 0:42 ` Andrea Arcangeli
2008-02-28 1:01 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-28 0:42 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 02:43:41PM -0800, Christoph Lameter wrote:
> Nope. unmap_mapping_range is already handled by the range callbacks.
But they're called with atomic=1 on anything but anonymous memory. I
understood Andrew asked to remove the atomic param and to allow
sleeping for all kind of vmas. I also understood certain XPMEM
customers asked to use XPMEM on something more than anonymous memory.
> The situation that you are imagining has already been dealt with [..]
I guess there's some misunderstanding, I think Nick was referring to
the above problem.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 0:14 ` Christoph Lameter
@ 2008-02-28 0:52 ` Andrea Arcangeli
2008-02-28 1:03 ` Christoph Lameter
2008-02-28 10:53 ` Robin Holt
0 siblings, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-28 0:52 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
> Erm. This would also be needed by RDMA etc.
The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
to schedule inside the invalidate methods AFIK, so I doubt the above
is true. It'd be interesting to know if IB is like Quadrics and it
also doesn't require blocking to invalidate certain remote mappings.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
2008-02-28 0:42 ` Andrea Arcangeli
@ 2008-02-28 1:01 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-28 1:01 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:
> On Wed, Feb 27, 2008 at 02:43:41PM -0800, Christoph Lameter wrote:
> > Nope. unmap_mapping_range is already handled by the range callbacks.
>
> But they're called with atomic=1 on anything but anonymous memory. I
> understood Andrew asked to remove the atomic param and to allow
> sleeping for all kind of vmas. I also understood certain XPMEM
> customers asked to use XPMEM on something more than anonymous memory.
Yes but the patch that is discussed here does not handle that situation.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 0:52 ` Andrea Arcangeli
@ 2008-02-28 1:03 ` Christoph Lameter
2008-02-28 1:10 ` Andrea Arcangeli
2008-02-28 10:53 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-28 1:03 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:
> On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
> > Erm. This would also be needed by RDMA etc.
>
> The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
> to schedule inside the invalidate methods AFIK, so I doubt the above
> is true. It'd be interesting to know if IB is like Quadrics and it
> also doesn't require blocking to invalidate certain remote mappings.
RDMA works across a network and I would assume that it needs confirmation
that a connection has been torn down before pages can be unmapped.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 1:03 ` Christoph Lameter
@ 2008-02-28 1:10 ` Andrea Arcangeli
2008-02-28 18:43 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-28 1:10 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote:
> RDMA works across a network and I would assume that it needs confirmation
> that a connection has been torn down before pages can be unmapped.
Depends on the latency of the network, for example with page pinning
it can even try to reduce the wait time, by tearing down the mapping
in range_begin and spin waiting the ack only later in range_end.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 0:52 ` Andrea Arcangeli
2008-02-28 1:03 ` Christoph Lameter
@ 2008-02-28 10:53 ` Robin Holt
1 sibling, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-02-28 10:53 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Nick Piggin, akpm, Robin Holt, Avi Kivity,
Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
daniel.blueman
On Thu, Feb 28, 2008 at 01:52:50AM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 27, 2008 at 04:14:08PM -0800, Christoph Lameter wrote:
> > Erm. This would also be needed by RDMA etc.
>
> The only RDMA I know is Quadrics, and Quadrics apparently doesn't need
> to schedule inside the invalidate methods AFIK, so I doubt the above
> is true. It'd be interesting to know if IB is like Quadrics and it
> also doesn't require blocking to invalidate certain remote mappings.
We got an answer from the IB guys already. They do not track which of
their handles are being used by remote processes so neither approach
will work for their purposes with the exception of straight unmaps. In
that case, they could use the callout to remove TLB information and rely
on the lack of page table information to kill the users process.
Without changes to their library spec, I don't believe anything further
is possible. If they did change their library spec, I believe they
could get things to work the same way that XPMEM has gotten things to
work, where a message is sent to the remote side for TLB clearing and
that will require sleeping.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 1:10 ` Andrea Arcangeli
@ 2008-02-28 18:43 ` Christoph Lameter
2008-02-29 0:55 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-28 18:43 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, 28 Feb 2008, Andrea Arcangeli wrote:
> On Wed, Feb 27, 2008 at 05:03:21PM -0800, Christoph Lameter wrote:
> > RDMA works across a network and I would assume that it needs confirmation
> > that a connection has been torn down before pages can be unmapped.
>
> Depends on the latency of the network, for example with page pinning
> it can even try to reduce the wait time, by tearing down the mapping
> in range_begin and spin waiting the ack only later in range_end.
What about invalidate_page()?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-28 18:43 ` Christoph Lameter
@ 2008-02-29 0:55 ` Andrea Arcangeli
2008-02-29 0:59 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 0:55 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote:
> What about invalidate_page()?
That would just spin waiting an ack (just like the smp-tlb-flushing
invalidates in numa already does).
Thinking more about this, we could also parallelize it with an
invalidate_page_before/end. If it takes 1usec to flush remotely,
scheduling would be overkill, but spending 1usec in a while loop isn't
nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure
if it makes sense... it certainly would be quick to add it (especially
thanks to _notify ;).
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 0:55 ` Andrea Arcangeli
@ 2008-02-29 0:59 ` Christoph Lameter
2008-02-29 13:13 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 0:59 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> On Thu, Feb 28, 2008 at 10:43:54AM -0800, Christoph Lameter wrote:
> > What about invalidate_page()?
>
> That would just spin waiting an ack (just like the smp-tlb-flushing
> invalidates in numa already does).
And thus the device driver may stop receiving data on a UP system? It will
never get the ack.
> Thinking more about this, we could also parallelize it with an
> invalidate_page_before/end. If it takes 1usec to flush remotely,
> scheduling would be overkill, but spending 1usec in a while loop isn't
> nice if we can parallelize that 1usec with the ipi-tlb-flush. Not sure
> if it makes sense... it certainly would be quick to add it (especially
> thanks to _notify ;).
invalidate_page_before/end could be realized as an
invalidate_range_begin/end on a page sized range?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 0:59 ` Christoph Lameter
@ 2008-02-29 13:13 ` Andrea Arcangeli
2008-02-29 19:55 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 13:13 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote:
> And thus the device driver may stop receiving data on a UP system? It will
> never get the ack.
Not sure to follow, sorry.
My idea was:
post the invalidate in the mmio region of the device
smp_call_function()
while (mmio device wait-bitflag is on);
Instead of the current:
smp_call_function()
post the invalidate in the mmio region of the device
while (mmio device wait-bitflag is on);
To decrease the wait loop time.
> invalidate_page_before/end could be realized as an
> invalidate_range_begin/end on a page sized range?
If we go this route, once you add support to xpmem, you'll have to
make the anon_vma lock a mutex too, that would be fine with me
though. The main reason invalidate_page exists, is to allow you to
leave it as non-sleep-capable even after you make invalidate_range
sleep capable, and to implement the mmu_rmap_notifiers sleep capable
in all the paths that invalidate_page would be called. That was the
strategy you had in your patch. I'll try to drop invalidate_page. I
wonder if then you won't need the mmu_rmap_notifiers anymore.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 13:13 ` Andrea Arcangeli
@ 2008-02-29 19:55 ` Christoph Lameter
2008-02-29 20:17 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 19:55 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> On Thu, Feb 28, 2008 at 04:59:59PM -0800, Christoph Lameter wrote:
> > And thus the device driver may stop receiving data on a UP system? It will
> > never get the ack.
>
> Not sure to follow, sorry.
>
> My idea was:
>
> post the invalidate in the mmio region of the device
> smp_call_function()
> while (mmio device wait-bitflag is on);
So the device driver on UP can only operate through interrupts? If you are
hogging the only cpu then driver operations may not be possible.
> > invalidate_page_before/end could be realized as an
> > invalidate_range_begin/end on a page sized range?
>
> If we go this route, once you add support to xpmem, you'll have to
> make the anon_vma lock a mutex too, that would be fine with me
> though. The main reason invalidate_page exists, is to allow you to
> leave it as non-sleep-capable even after you make invalidate_range
> sleep capable, and to implement the mmu_rmap_notifiers sleep capable
> in all the paths that invalidate_page would be called. That was the
> strategy you had in your patch. I'll try to drop invalidate_page. I
> wonder if then you won't need the mmu_rmap_notifiers anymore.
I am mainly concerned with making the mmu notifier a generally useful
feature for multiple users. Xpmem is one example of a different user. It
should be considered as one example of a different type of callback user.
It is not the gold standard that you make it to be. RDMA is another and
there are likely scores of others (DMA engines etc) once it becomes clear
that such a feature is available. In general the mmu notifier will allows
us to fix the problems caused by memory pinning and mlock by various
devices and other mechanisms that need to directly access memory.
And yes I would like to get rid of the mmu_rmap_notifiers altogether. It
would be much cleaner with just one mmu_notifier that can sleep in all
functions.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 19:55 ` Christoph Lameter
@ 2008-02-29 20:17 ` Andrea Arcangeli
2008-02-29 21:03 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 20:17 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, Feb 29, 2008 at 11:55:17AM -0800, Christoph Lameter wrote:
> > post the invalidate in the mmio region of the device
> > smp_call_function()
> > while (mmio device wait-bitflag is on);
>
> So the device driver on UP can only operate through interrupts? If you are
> hogging the only cpu then driver operations may not be possible.
There was no irq involved in the above pseudocode, the irq if
something would run in the remote system. Still irqs can run fine
during the while loop like they run fine on top of
smp_call_function. The send-irq and the following spin-on-a-bitflag
works exactly as smp_call_function except this isn't a numa-CPU to
invalidate.
> And yes I would like to get rid of the mmu_rmap_notifiers altogether. It
> would be much cleaner with just one mmu_notifier that can sleep in all
> functions.
Agreed. I just thought xpmem needed an invalidate-by-page, but
I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this
regard.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 20:17 ` Andrea Arcangeli
@ 2008-02-29 21:03 ` Christoph Lameter
2008-02-29 21:23 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 21:03 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> Agreed. I just thought xpmem needed an invalidate-by-page, but
> I'm glad if xpmem can go in sync with the KVM/GRU/DRI model in this
> regard.
That means we need both the anon_vma locks and the i_mmap_lock to become
semaphores. I think semaphores are better than mutexes. Rik and Lee saw
some performance improvements because list can be traversed in parallel
when the anon_vma lock is switched to be a rw lock.
Sounds like we get to a conceptually clean version here?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 21:03 ` Christoph Lameter
@ 2008-02-29 21:23 ` Andrea Arcangeli
2008-02-29 21:29 ` Christoph Lameter
2008-02-29 21:34 ` Christoph Lameter
0 siblings, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 21:23 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
> That means we need both the anon_vma locks and the i_mmap_lock to become
> semaphores. I think semaphores are better than mutexes. Rik and Lee saw
> some performance improvements because list can be traversed in parallel
> when the anon_vma lock is switched to be a rw lock.
The improvement was with a rw spinlock IIRC, so I don't see how it's
related to this.
Perhaps the rwlock spinlock can be changed to a rw semaphore without
measurable overscheduling in the fast path. However theoretically
speaking the rw_lock spinlock is more efficient than a rw semaphore in
case of a little contention during the page fault fast path because
the critical section is just a list_add so it'd be overkill to
schedule while waiting. That's why currently it's a spinlock (or rw
spinlock).
> Sounds like we get to a conceptually clean version here?
I don't have a strong opinion if it should become a semaphore
unconditionally or only with a CONFIG_XPMEM=y. But keep in mind
preempt-rt runs quite a bit slower, or we could rip spinlocks out of
the kernel in the first place ;)
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 21:23 ` Andrea Arcangeli
@ 2008-02-29 21:29 ` Christoph Lameter
2008-02-29 21:34 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 21:29 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> I don't have a strong opinion if it should become a semaphore
> unconditionally or only with a CONFIG_XPMEM=y. But keep in mind
> preempt-rt runs quite a bit slower, or we could rip spinlocks out of
> the kernel in the first place ;)
D you just skip comments of people on the mmu_notifier? It took me to
remind you about Andrew's comments to note those. And I just responded on
the XPmem issue in the morning.
Again for the gazillionth time: There will be no CONFIG_XPMEM because the
functionality needs to be generic and not XPMEM specific.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 21:23 ` Andrea Arcangeli
2008-02-29 21:29 ` Christoph Lameter
@ 2008-02-29 21:34 ` Christoph Lameter
2008-02-29 21:48 ` Andrea Arcangeli
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 21:34 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
> > That means we need both the anon_vma locks and the i_mmap_lock to become
> > semaphores. I think semaphores are better than mutexes. Rik and Lee saw
> > some performance improvements because list can be traversed in parallel
> > when the anon_vma lock is switched to be a rw lock.
>
> The improvement was with a rw spinlock IIRC, so I don't see how it's
> related to this.
AFAICT The rw semaphore fastpath is similar in performance to a rw
spinlock.
> Perhaps the rwlock spinlock can be changed to a rw semaphore without
> measurable overscheduling in the fast path. However theoretically
Overscheduling? You mean overhead?
> speaking the rw_lock spinlock is more efficient than a rw semaphore in
> case of a little contention during the page fault fast path because
> the critical section is just a list_add so it'd be overkill to
> schedule while waiting. That's why currently it's a spinlock (or rw
> spinlock).
On the other hand a semaphore puts the process to sleep and may actually
improve performance because there is less time spend in a busy loop.
Other processes may do something useful and we stay off the contended
cacheline reducing traffic on the interconnect.
> preempt-rt runs quite a bit slower, or we could rip spinlocks out of
> the kernel in the first place ;)
The question is why that is the case and it seesm that there are issues
with interrupt on/off that are important here and particularly significant
with the SLAB allocator (significant hacks there to deal with that issue).
The fastpath that we have in the works for SLUB may address a large
part of that issue because it no longer relies on disabling interrupts.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 21:34 ` Christoph Lameter
@ 2008-02-29 21:48 ` Andrea Arcangeli
2008-02-29 22:12 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 21:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, Feb 29, 2008 at 01:34:34PM -0800, Christoph Lameter wrote:
> On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
>
> > On Fri, Feb 29, 2008 at 01:03:16PM -0800, Christoph Lameter wrote:
> > > That means we need both the anon_vma locks and the i_mmap_lock to become
> > > semaphores. I think semaphores are better than mutexes. Rik and Lee saw
> > > some performance improvements because list can be traversed in parallel
> > > when the anon_vma lock is switched to be a rw lock.
> >
> > The improvement was with a rw spinlock IIRC, so I don't see how it's
> > related to this.
>
> AFAICT The rw semaphore fastpath is similar in performance to a rw
> spinlock.
read side is taken in the slow path.
write side is taken in the fast path.
pagefault is fast path, VM during swapping is slow path.
> > Perhaps the rwlock spinlock can be changed to a rw semaphore without
> > measurable overscheduling in the fast path. However theoretically
>
> Overscheduling? You mean overhead?
The only possible overhead that a rw semaphore could ever generate vs
a rw lock is overscheduling.
> > speaking the rw_lock spinlock is more efficient than a rw semaphore in
> > case of a little contention during the page fault fast path because
> > the critical section is just a list_add so it'd be overkill to
> > schedule while waiting. That's why currently it's a spinlock (or rw
> > spinlock).
>
> On the other hand a semaphore puts the process to sleep and may actually
> improve performance because there is less time spend in a busy loop.
> Other processes may do something useful and we stay off the contended
> cacheline reducing traffic on the interconnect.
Yes, that's the positive side, the negative side is that you'll put
the task in uninterruptible sleep and call schedule() and require a
wakeup, because a list_add taking <1usec is running in the
other cpu. No other downside. But that's the only reason it's a
spinlock right now, infact there can't be any other reason.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 21:48 ` Andrea Arcangeli
@ 2008-02-29 22:12 ` Christoph Lameter
2008-02-29 22:41 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-02-29 22:12 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
> > AFAICT The rw semaphore fastpath is similar in performance to a rw
> > spinlock.
>
> read side is taken in the slow path.
Slowpath meaning VM slowpath or lock slow path? Its seems that the rwsem
read side path is pretty efficient:
static inline void __down_read(struct rw_semaphore *sem)
{
__asm__ __volatile__(
"# beginning down_read\n\t"
LOCK_PREFIX " incl (%%eax)\n\t" /* adds 0x00000001, returns the old value */
" jns 1f\n"
" call call_rwsem_down_read_failed\n"
"1:\n\t"
"# ending down_read\n\t"
: "+m" (sem->count)
: "a" (sem)
: "memory", "cc");
}
>
> write side is taken in the fast path.
>
> pagefault is fast path, VM during swapping is slow path.
Not sure what you are saying here. A pagefault should be considered as a
fast path and swapping is not performance critical?
> > > Perhaps the rwlock spinlock can be changed to a rw semaphore without
> > > measurable overscheduling in the fast path. However theoretically
> >
> > Overscheduling? You mean overhead?
>
> The only possible overhead that a rw semaphore could ever generate vs
> a rw lock is overscheduling.
Ok too many calls to schedule() because the slow path (of the semaphore)
is taken?
> > On the other hand a semaphore puts the process to sleep and may actually
> > improve performance because there is less time spend in a busy loop.
> > Other processes may do something useful and we stay off the contended
> > cacheline reducing traffic on the interconnect.
>
> Yes, that's the positive side, the negative side is that you'll put
> the task in uninterruptible sleep and call schedule() and require a
> wakeup, because a list_add taking <1usec is running in the
> other cpu. No other downside. But that's the only reason it's a
> spinlock right now, infact there can't be any other reason.
But that is only happening for the contended case. Certainly a spinlock is
better for 2p system but the more processors content for the lock (and
the longer the hold off is, typical for the processors with 4p or 8p or
more) the better a semaphore will work.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-29 22:12 ` Christoph Lameter
@ 2008-02-29 22:41 ` Andrea Arcangeli
0 siblings, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-02-29 22:41 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, akpm, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
steiner, linux-kernel, linux-mm, daniel.blueman
On Fri, Feb 29, 2008 at 02:12:57PM -0800, Christoph Lameter wrote:
> On Fri, 29 Feb 2008, Andrea Arcangeli wrote:
>
> > > AFAICT The rw semaphore fastpath is similar in performance to a rw
> > > spinlock.
> >
> > read side is taken in the slow path.
>
> Slowpath meaning VM slowpath or lock slow path? Its seems that the rwsem
With slow path I meant the VM. Sorry if that was confusing given locks
also have fast paths (no contention) and slow paths (contention).
> read side path is pretty efficient:
Yes. The assembly doesn't worry me at all.
> > pagefault is fast path, VM during swapping is slow path.
>
> Not sure what you are saying here. A pagefault should be considered as a
> fast path and swapping is not performance critical?
Yes, swapping is I/O bound and it rarely becomes CPU hog in the common
case.
There are corner case workloads (including OOM) where swapping can
become cpu bound (that's also where rwlock helps). But certainly the
speed of fork() and a page fault, is critical for _everyone_, not just
a few workloads and setups.
> Ok too many calls to schedule() because the slow path (of the semaphore)
> is taken?
Yes, that's the only possible worry when converting a spinlock to
mutex.
> But that is only happening for the contended case. Certainly a spinlock is
> better for 2p system but the more processors content for the lock (and
> the longer the hold off is, typical for the processors with 4p or 8p or
> more) the better a semaphore will work.
Sure. That's also why the PT lock switches for >4way compiles. Config
option helps to keep the VM optimal for everyone. Here it is possible
it won't be necessary but I can't be sure given both i_mmap_lock and
anon-vma lock are used in some many places. Some TPC comparison would
be nice before making a default switch IMHO.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-02-27 22:35 ` Christoph Lameter
` (2 preceding siblings ...)
2008-02-28 0:11 ` Andrea Arcangeli
@ 2008-03-03 5:11 ` Nick Piggin
2008-03-03 19:28 ` Christoph Lameter
3 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-03-03 5:11 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Thursday 28 February 2008 09:35, Christoph Lameter wrote:
> On Wed, 20 Feb 2008, Nick Piggin wrote:
> > On Friday 15 February 2008 17:49, Christoph Lameter wrote:
> > Also, what we are going to need here are not skeleton drivers
> > that just do all the *easy* bits (of registering their callbacks),
> > but actual fully working examples that do everything that any
> > real driver will need to do. If not for the sanity of the driver
> > writer, then for the sanity of the VM developers (I don't want
> > to have to understand xpmem or infiniband in order to understand
> > how the VM works).
>
> There are 3 different drivers that can already use it but the code is
> complex and not easy to review. Skeletons are easy to allow people to get
> started with it.
Your skeleton is just registering notifiers and saying
/* you fill the hard part in */
If somebody needs a skeleton in order just to register the notifiers,
then almost by definition they are unqualified to write the hard
part ;)
> > > lru_add_drain();
> > > tlb = tlb_gather_mmu(mm, 0);
> > > update_hiwater_rss(mm);
> > > + mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
> > > end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
> > > if (tlb)
> > > tlb_finish_mmu(tlb, address, end);
> > > + mmu_notifier(invalidate_range_end, mm, address, end, atomic);
> > > return end;
> > > }
> >
> > Where do you invalidate for munmap()?
>
> zap_page_range() called from unmap_vmas().
But it is not allowed to sleep. Where do you call the sleepable one
from?
> > Also, how to you resolve the case where you are not allowed to sleep?
> > I would have thought either you have to handle it, in which case nobody
> > needs to sleep; or you can't handle it, in which case the code is
> > broken.
>
> That can be done in a variety of ways:
>
> 1. Change VM locking
>
> 2. Not handle file backed mappings (XPmem could work mostly in such a
> config)
>
> 3. Keep the refcount elevated until pages are freed in another execution
> context.
OK, there are ways to solve it or hack around it. But this is exactly
why I think the implementations should be kept seperate. Andrea's
notifiers are coherent, work on all types of mappings, and will
hopefully match closely the regular TLB invalidation sequence in the
Linux VM (at the moment it is quite close, but I hope to make it a
bit closer) so that it requires almost no changes to the mm.
All the other things to try to make it sleep are either hacking holes
in it (eg by removing coherency). So I don't think it is reasonable to
require that any patch handle all cases. I actually think Andrea's
patch is quite nice and simple itself, wheras I am against the patches
that you posted.
What about a completely different approach... XPmem runs over NUMAlink,
right? Why not provide some non-sleeping way to basically IPI remote
nodes over the NUMAlink where they can process the invalidation? If you
intra-node cache coherency has to run over this link anyway, then
presumably it is capable.
Or another idea, why don't you LD_PRELOAD in the MPT library to also
intercept munmap, mprotect, mremap etc as well as just fork()? That
would give you similarly "good enough" coherency as the mmu notifier
patches except that you can't swap (which Robin said was not a big
problem).
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-03-03 5:11 ` Nick Piggin
@ 2008-03-03 19:28 ` Christoph Lameter
2008-03-03 19:50 ` Nick Piggin
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-03-03 19:28 UTC (permalink / raw)
To: Nick Piggin
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Mon, 3 Mar 2008, Nick Piggin wrote:
> Your skeleton is just registering notifiers and saying
>
> /* you fill the hard part in */
>
> If somebody needs a skeleton in order just to register the notifiers,
> then almost by definition they are unqualified to write the hard
> part ;)
Its also providing a locking scheme.
> OK, there are ways to solve it or hack around it. But this is exactly
> why I think the implementations should be kept seperate. Andrea's
> notifiers are coherent, work on all types of mappings, and will
> hopefully match closely the regular TLB invalidation sequence in the
> Linux VM (at the moment it is quite close, but I hope to make it a
> bit closer) so that it requires almost no changes to the mm.
Then put it into the arch code for TLB invalidation. Paravirt ops gives
good examples on how to do that.
> What about a completely different approach... XPmem runs over NUMAlink,
> right? Why not provide some non-sleeping way to basically IPI remote
> nodes over the NUMAlink where they can process the invalidation? If you
> intra-node cache coherency has to run over this link anyway, then
> presumably it is capable.
There is another Linux instance at the remote end that first has to
remove its own ptes. Also would not work for Inifiniband and other
solutions. All the approaches that require evictions in an atomic context
are limiting the approach and do not allow the generic functionality that
we want in order to not add alternate APIs for this.
> Or another idea, why don't you LD_PRELOAD in the MPT library to also
> intercept munmap, mprotect, mremap etc as well as just fork()? That
> would give you similarly "good enough" coherency as the mmu notifier
> patches except that you can't swap (which Robin said was not a big
> problem).
The good enough solution right now is to pin pages by elevating
refcounts.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-03-03 19:28 ` Christoph Lameter
@ 2008-03-03 19:50 ` Nick Piggin
2008-03-04 18:58 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2008-03-03 19:50 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Tuesday 04 March 2008 06:28, Christoph Lameter wrote:
> On Mon, 3 Mar 2008, Nick Piggin wrote:
> > Your skeleton is just registering notifiers and saying
> >
> > /* you fill the hard part in */
> >
> > If somebody needs a skeleton in order just to register the notifiers,
> > then almost by definition they are unqualified to write the hard
> > part ;)
>
> Its also providing a locking scheme.
Not the full locking scheme. If you have a look at the real code
required to do it, it is non trivial.
> > OK, there are ways to solve it or hack around it. But this is exactly
> > why I think the implementations should be kept seperate. Andrea's
> > notifiers are coherent, work on all types of mappings, and will
> > hopefully match closely the regular TLB invalidation sequence in the
> > Linux VM (at the moment it is quite close, but I hope to make it a
> > bit closer) so that it requires almost no changes to the mm.
>
> Then put it into the arch code for TLB invalidation. Paravirt ops gives
> good examples on how to do that.
Put what into arch code?
> > What about a completely different approach... XPmem runs over NUMAlink,
> > right? Why not provide some non-sleeping way to basically IPI remote
> > nodes over the NUMAlink where they can process the invalidation? If you
> > intra-node cache coherency has to run over this link anyway, then
> > presumably it is capable.
>
> There is another Linux instance at the remote end that first has to
> remove its own ptes.
Yeah, what's the problem?
> Also would not work for Inifiniband and other
> solutions.
infiniband doesn't want it. Other solutions is just handwaving,
because if we don't know what the other soloutions are, then we can't
make any sort of informed choices.
> All the approaches that require evictions in an atomic context
> are limiting the approach and do not allow the generic functionality that
> we want in order to not add alternate APIs for this.
The only generic way to do this that I have seen (and the only proposed
way that doesn't add alternate APIs for that matter) is turning VM locks
into sleeping locks. In which case, Andrea's notifiers will work just
fine (except for relatively minor details like rcu list scanning).
So I don't see what you're arguing for. There is no requirement that we
support sleeping notifiers in the same patch as non-sleeping ones.
Considering the simplicity of the non-sleeping notifiers and the
problems with sleeping ones, I think it is pretty clear that they are
different beasts (unless VM locking is changed).
> > Or another idea, why don't you LD_PRELOAD in the MPT library to also
> > intercept munmap, mprotect, mremap etc as well as just fork()? That
> > would give you similarly "good enough" coherency as the mmu notifier
> > patches except that you can't swap (which Robin said was not a big
> > problem).
>
> The good enough solution right now is to pin pages by elevating
> refcounts.
Which kind of leads to the question of why do you need any further
kernel patches if that is good enough?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-03-03 19:50 ` Nick Piggin
@ 2008-03-04 18:58 ` Christoph Lameter
2008-03-05 0:52 ` Nick Piggin
0 siblings, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-03-04 18:58 UTC (permalink / raw)
To: Nick Piggin
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Tue, 4 Mar 2008, Nick Piggin wrote:
> > Then put it into the arch code for TLB invalidation. Paravirt ops gives
> > good examples on how to do that.
>
> Put what into arch code?
The mmu notifier code.
> > > What about a completely different approach... XPmem runs over NUMAlink,
> > > right? Why not provide some non-sleeping way to basically IPI remote
> > > nodes over the NUMAlink where they can process the invalidation? If you
> > > intra-node cache coherency has to run over this link anyway, then
> > > presumably it is capable.
> >
> > There is another Linux instance at the remote end that first has to
> > remove its own ptes.
>
> Yeah, what's the problem?
The remote end has to invalidate the page which involves locking etc.
> > Also would not work for Inifiniband and other
> > solutions.
>
> infiniband doesn't want it. Other solutions is just handwaving,
> because if we don't know what the other soloutions are, then we can't
> make any sort of informed choices.
We need a solution in general to avoid the pinning problems. Infiniband
has those too.
> > All the approaches that require evictions in an atomic context
> > are limiting the approach and do not allow the generic functionality that
> > we want in order to not add alternate APIs for this.
>
> The only generic way to do this that I have seen (and the only proposed
> way that doesn't add alternate APIs for that matter) is turning VM locks
> into sleeping locks. In which case, Andrea's notifiers will work just
> fine (except for relatively minor details like rcu list scanning).
No they wont. As you pointed out the callback need RCU locking.
> > The good enough solution right now is to pin pages by elevating
> > refcounts.
>
> Which kind of leads to the question of why do you need any further
> kernel patches if that is good enough?
Well its good enough with severe problems during reclaim, livelocks etc.
One could improve on that scheme through Rik's work trying to add a new
page flag that mark pinned pages and then keep them off the LRUs and
limiting their number. Having pinned page would limit the ability to
reclaim by the VM and make page migration, memory unplug etc impossible.
It is better to have notifier scheme that allows to tell a device driver
to free up the memory it has mapped.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
2008-03-04 18:58 ` Christoph Lameter
@ 2008-03-05 0:52 ` Nick Piggin
0 siblings, 0 replies; 116+ messages in thread
From: Nick Piggin @ 2008-03-05 0:52 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman
On Wednesday 05 March 2008 05:58, Christoph Lameter wrote:
> On Tue, 4 Mar 2008, Nick Piggin wrote:
> > > Then put it into the arch code for TLB invalidation. Paravirt ops gives
> > > good examples on how to do that.
> >
> > Put what into arch code?
>
> The mmu notifier code.
It isn't arch specific.
> > > > What about a completely different approach... XPmem runs over
> > > > NUMAlink, right? Why not provide some non-sleeping way to basically
> > > > IPI remote nodes over the NUMAlink where they can process the
> > > > invalidation? If you intra-node cache coherency has to run over this
> > > > link anyway, then presumably it is capable.
> > >
> > > There is another Linux instance at the remote end that first has to
> > > remove its own ptes.
> >
> > Yeah, what's the problem?
>
> The remote end has to invalidate the page which involves locking etc.
I don't see what the problem is.
> > > Also would not work for Inifiniband and other
> > > solutions.
> >
> > infiniband doesn't want it. Other solutions is just handwaving,
> > because if we don't know what the other soloutions are, then we can't
> > make any sort of informed choices.
>
> We need a solution in general to avoid the pinning problems. Infiniband
> has those too.
>
> > > All the approaches that require evictions in an atomic context
> > > are limiting the approach and do not allow the generic functionality
> > > that we want in order to not add alternate APIs for this.
> >
> > The only generic way to do this that I have seen (and the only proposed
> > way that doesn't add alternate APIs for that matter) is turning VM locks
> > into sleeping locks. In which case, Andrea's notifiers will work just
> > fine (except for relatively minor details like rcu list scanning).
>
> No they wont. As you pointed out the callback need RCU locking.
That can be fixed easily.
> > > The good enough solution right now is to pin pages by elevating
> > > refcounts.
> >
> > Which kind of leads to the question of why do you need any further
> > kernel patches if that is good enough?
>
> Well its good enough with severe problems during reclaim, livelocks etc.
> One could improve on that scheme through Rik's work trying to add a new
> page flag that mark pinned pages and then keep them off the LRUs and
> limiting their number. Having pinned page would limit the ability to
> reclaim by the VM and make page migration, memory unplug etc impossible.
Well not impossible. You could have a callback to invalidate the remote
TLB and drop the pin on a given page.
> It is better to have notifier scheme that allows to tell a device driver
> to free up the memory it has mapped.
Yeah, it would be nice for those people with clusters of Altixes. Doesn't
mean it has to go upstream, though.
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 1/6] mmu_notifier: Core code
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
To: akpm
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman
[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 18134 bytes --]
MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.
The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.
The notification scheme is much better than the current scheme of
avoiding the danger of the VM removing pages that are externally
mapped. We currently mlock pages used for RDMA, XPmem etc in memory.
Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.
Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.
The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning.
This patch: Core portion
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
Documentation/mmu_notifier/README | 99 +++++++++++++++++++++
include/linux/mm_types.h | 7 +
include/linux/mmu_notifier.h | 175 ++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 2
mm/mmu_notifier.c | 76 ++++++++++++++++
8 files changed, 366 insertions(+)
Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README 2008-02-08 12:30:47.000000000 -0800
@@ -0,0 +1,99 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current scheme of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory.
+
+Mlock causes problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. It is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+ Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-02-08 12:30:47.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
#endif
};
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+ struct hlist_head head;
+#endif
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
#ifdef CONFIG_CGROUP_MEM_CONT
struct mem_cgroup *mem_cgroup;
#endif
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-02-08 12:35:14.000000000 -0800
@@ -0,0 +1,175 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If pages are
+ * removed from an address space then callbacks are performed.
+ *
+ * Spinlocks must be held in order to walk reverse maps. The
+ * invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * The invalidate_range_start/end callbacks can be performed in contexts
+ * where sleeping is allowed or in atomic contexts. A flag is passed
+ * to indicate an atomic context.
+ *
+ * Pages must be marked dirty if dirty bits are found to be set in
+ * the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * The release notifier is called when no other execution threads
+ * are left. Synchronization is not necessary.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ /*
+ * age_page is called from contexts where the pte_lock is held
+ */
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /* invalidate_page is called from contexts where the pte_lock is held */
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * invalidate_range_begin() and invalidate_range_end() must paired.
+ *
+ * Multiple invalidate_range_begin/ends may be nested or called
+ * concurrently. That is legit. However, no new external references
+ * may be established as long as any invalidate_xxx is running or
+ * any invalidate_range_begin() and has not been completed through a
+ * corresponding call to invalidate_range_end().
+ *
+ * Locking within the notifier needs to serialize events correspondingly.
+ *
+ * invalidate_range_begin() must clear all references in the range
+ * and stop the establishment of new references.
+ *
+ * invalidate_range_end() reenables the establishment of references.
+ *
+ * atomic indicates that the function is called in an atomic context.
+ * We can sleep if atomic == 0.
+ */
+ void (*invalidate_range_begin)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int atomic);
+
+ void (*invalidate_range_end)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Kconfig 2008-02-08 12:30:47.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Makefile 2008-02-08 12:30:47.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c 2008-02-08 12:44:24.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *t;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ hlist_for_each_entry_safe(mn, n, t,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del_init(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/kernel/fork.c 2008-02-08 12:30:47.000000000 -0800
@@ -53,6 +53,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-02-08 12:43:59.000000000 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long end;
/* mm's last user has gone, and its about to be pulled down */
+ mmu_notifier_release(mm);
arch_exit_mmap(mm);
lru_add_drain();`
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-05 18:05 ` Andy Whitcroft
2008-02-05 18:17 ` Peter Zijlstra
@ 2008-02-05 18:19 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:19 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Tue, 5 Feb 2008, Andy Whitcroft wrote:
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + rcu_read_lock();
> > + hlist_for_each_entry_safe_rcu(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > + if (mn->ops->release)
> > + mn->ops->release(mn, mm);
>
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.
Right that was fixed in a later release and discussed extensively later.
See V5.
> I am not sure it makes sense to add a _safe_rcu variant. As I understand
> things an _safe variant is used where we are going to remove the current
It was dropped in V5.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-02-05 18:05 ` Andy Whitcroft
@ 2008-02-05 18:17 ` Peter Zijlstra
2008-02-05 18:19 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2008-02-05 18:17 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Tue, 2008-02-05 at 18:05 +0000, Andy Whitcroft wrote:
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + rcu_read_lock();
> > + hlist_for_each_entry_safe_rcu(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > + if (mn->ops->release)
> > + mn->ops->release(mn, mm);
>
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.
>
> If it does that seems wrong. This is an RCU hlist, therefore the list
> integrity must be maintained through the next grace period in case there
> are parallell readers using the element, in particular its forward
> pointer for traversal.
That is not quite so, list elements must be preserved, not the list
order.
>
> > + hlist_del(&mn->hlist);
>
> For this to be updating the list, you must have some form of "write-side"
> exclusion as these primatives are not "parallel write safe". It would
> be helpful for this routine to state what that write side exclusion is.
Yeah, has been noticed, read on in the thread :-)
> I am not sure it makes sense to add a _safe_rcu variant. As I understand
> things an _safe variant is used where we are going to remove the current
> list element in the middle of a list walk. However the key feature of an
> RCU data structure is that it will always be in a "safe" state until any
> parallel readers have completed. For an hlist this means that the removed
> entry and its forward link must remain valid for as long as there may be
> a parallel reader traversing this list, ie. until the next grace period.
> If this link is valid for the parallel reader, then it must be valid for
> us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
> to cope in the face of entries being unlinked as we traverse the list.
It does make sense, hlist_del_rcu() maintains the fwd reference, but it
does unlink it from the list proper. As long as there is a write side
exclusion around the actual removal as you noted.
rcu_read_lock();
hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) {
if (foo) {
spin_lock(write_lock);
hlist_del_rcu(tpos);
spin_unlock(write_unlock);
}
}
rcu_read_unlock();
is a safe construct in that the list itself stays a proper list, and
even items that might be caught in the to-be-deleted entries will have a
fwd way out.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
` (3 preceding siblings ...)
2008-01-29 16:07 ` Robin Holt
@ 2008-02-05 18:05 ` Andy Whitcroft
2008-02-05 18:17 ` Peter Zijlstra
2008-02-05 18:19 ` Christoph Lameter
4 siblings, 2 replies; 116+ messages in thread
From: Andy Whitcroft @ 2008-02-05 18:05 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> Core code for mmu notifiers.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
>
> ---
> include/linux/list.h | 14 ++
> include/linux/mm_types.h | 6 +
> include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/page-flags.h | 10 ++
> kernel/fork.c | 2
> mm/Kconfig | 4
> mm/Makefile | 1
> mm/mmap.c | 2
> mm/mmu_notifier.c | 101 ++++++++++++++++++++
> 9 files changed, 350 insertions(+)
>
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h 2008-01-28 11:35:22.000000000 -0800
> @@ -153,6 +153,10 @@ struct vm_area_struct {
> #endif
> };
>
> +struct mmu_notifier_head {
> + struct hlist_head head;
> +};
> +
> struct mm_struct {
> struct vm_area_struct * mmap; /* list of VMAs */
> struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
> /* aio bits */
> rwlock_t ioctx_list_lock;
> struct kioctx *ioctx_list;
> +
> + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> };
>
> #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 11:43:03.000000000 -0800
> @@ -0,0 +1,210 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier
> + *
> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * the external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.
> + *
> + * These fall into two classes
> + *
> + * 1. mmu_notifier
> + *
> + * These are callbacks registered with an mm_struct. If mappings are
> + * removed from an address space then callbacks are performed.
> + * Spinlocks must be held in order to the walk reverse maps and the
> + * notifications are performed while the spinlock is held.
> + *
> + *
> + * 2. mmu_rmap_notifier
> + *
> + * Callbacks for subsystems that provide their own rmaps. These
> + * need to walk their own rmaps for a page. The invalidate_page
> + * callback is outside of locks so that we are not in a strictly
> + * atomic context (but we may be in a PF_MEMALLOC context if the
> + * notifier is called from reclaim code) and are able to sleep.
> + * Rmap notifiers need an extra page bit and are only available
> + * on 64 bit platforms. It is up to the subsystem to mark pags
> + * as PageExternalRmap as needed to trigger the callbacks. Pages
> + * must be marked dirty if dirty bits are set in the external
> + * pte.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> + struct hlist_node hlist;
> + const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> + /*
> + * Note: The mmu_notifier structure must be released with
> + * call_rcu() since other processors are only guaranteed to
> + * see the changes after a quiescent period.
> + */
> + void (*release)(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> + int (*age_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> +
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> +
> + /*
> + * lock indicates that the function is called under spinlock.
> + */
> + void (*invalidate_range)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int lock);
> +};
> +
> +struct mmu_rmap_notifier_ops;
> +
> +struct mmu_rmap_notifier {
> + struct hlist_node hlist;
> + const struct mmu_rmap_notifier_ops *ops;
> +};
> +
> +struct mmu_rmap_notifier_ops {
> + /*
> + * Called with the page lock held after ptes are modified or removed
> + * so that a subsystem with its own rmap's can remove remote ptes
> + * mapping a page.
> + */
> + void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
> + struct page *page);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void __mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +/* Will acquire mmap_sem for write*/
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +/*
> + * Will acquire mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> + unsigned long address);
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> + INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + struct mmu_notifier *__mn; \
> + struct hlist_node *__n; \
> + \
> + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> + rcu_read_lock(); \
> + hlist_for_each_entry_rcu(__mn, __n, \
> + &(mm)->mmu_notifier.head, \
> + hlist) \
> + if (__mn->ops->function) \
> + __mn->ops->function(__mn, \
> + mm, \
> + args); \
> + rcu_read_unlock(); \
> + } \
> + } while (0)
> +
> +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
> +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
> +
> +extern struct hlist_head mmu_rmap_notifier_list;
> +
> +#define mmu_rmap_notifier(function, args...) \
> + do { \
> + struct mmu_rmap_notifier *__mrn; \
> + struct hlist_node *__n; \
> + \
> + rcu_read_lock(); \
> + hlist_for_each_entry_rcu(__mrn, __n, \
> + &mmu_rmap_notifier_list, \
> + hlist) \
> + if (__mrn->ops->function) \
> + __mrn->ops->function(__mrn, args); \
> + rcu_read_unlock(); \
> + } while (0);
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...) \
> + do { \
> + if (0) { \
> + struct mmu_notifier *__mn; \
> + \
> + __mn = (struct mmu_notifier *)(0x00ff); \
> + __mn->ops->function(__mn, mm, args); \
> + }; \
> + } while (0)
> +
> +#define mmu_rmap_notifier(function, args...) \
> + do { \
> + if (0) { \
> + struct mmu_rmap_notifier *__mrn; \
> + \
> + __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
> + __mrn->ops->function(__mrn, args); \
> + } \
> + } while (0);
> +
> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> + unsigned long address)
> +{
> + return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> + {}
> +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> + {}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/page-flags.h 2008-01-28 11:35:22.000000000 -0800
> @@ -105,6 +105,7 @@
> * 64 bit | FIELDS | ?????? FLAGS |
> * 63 32 0
> */
> +#define PG_external_rmap 30 /* Page has external rmap */
> #define PG_uncached 31 /* Page has been mapped as uncached */
> #endif
>
> @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
> #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
> #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
>
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
> +#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
> +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
> +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
> + &(page)->flags)
> +#else
> +#define PageExternalRmap(page) 0
> +#endif
> +
> struct page; /* forward declaration */
>
> extern void cancel_dirty_page(struct page *page, unsigned int account_size);
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Kconfig 2008-01-28 11:35:22.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
> config VIRT_TO_BUS
> def_bool y
> depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> + def_bool y
> + bool "MMU notifier, for paging KVM/RDMA"
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Makefile 2008-01-28 11:35:22.000000000 -0800
> @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c 2008-01-28 11:35:22.000000000 -0800
> @@ -0,0 +1,101 @@
> +/*
> + * linux/mm/mmu_notifier.c
> + *
> + * Copyright (C) 2008 Qumranet, Inc.
> + * Copyright (C) 2008 SGI
> + * Christoph Lameter <clameter@sgi.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n, *t;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + rcu_read_lock();
> + hlist_for_each_entry_safe_rcu(mn, n, t,
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
Does this ->release actually release the 'nm' and its associated hlist?
I see in this thread that this ordering is deemed "use after free" which
implies so.
If it does that seems wrong. This is an RCU hlist, therefore the list
integrity must be maintained through the next grace period in case there
are parallell readers using the element, in particular its forward
pointer for traversal.
> + hlist_del(&mn->hlist);
For this to be updating the list, you must have some form of "write-side"
exclusion as these primatives are not "parallel write safe". It would
be helpful for this routine to state what that write side exclusion is.
> + }
> + rcu_read_unlock();
> + synchronize_rcu();
> + }
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> + int young = 0;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n,
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->age_page)
> + young |= mn->ops->age_page(mn, mm, address);
> + }
> + rcu_read_unlock();
> + }
> +
> + return young;
> +}
> +
> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + down_write(&mm->mmap_sem);
> + __mmu_notifier_register(mn, mm);
> + up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + down_write(&mm->mmap_sem);
> + hlist_del_rcu(&mn->hlist);
> + up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> +
> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);
> +HLIST_HEAD(mmu_rmap_notifier_list);
> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(&mmu_notifier_list_lock);
> + hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
> + spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> + spin_lock(&mmu_notifier_list_lock);
> + hlist_del_rcu(&mrn->hlist);
> + spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
> +
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/kernel/fork.c 2008-01-28 11:35:22.000000000 -0800
> @@ -51,6 +51,7 @@
> #include <linux/random.h>
> #include <linux/tty.h>
> #include <linux/proc_fs.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/pgtable.h>
> #include <asm/pgalloc.h>
> @@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
> + mmu_notifier_head_init(&mm->mmu_notifier);
> return mm;
> }
> free_mm(mm);
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/mmap.c 2008-01-28 11:37:53.000000000 -0800
> @@ -26,6 +26,7 @@
> #include <linux/mount.h>
> #include <linux/mempolicy.h>
> #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
> vm_unacct_memory(nr_accounted);
> free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
> tlb_finish_mmu(tlb, 0, end);
> + mmu_notifier_release(mm);
>
> /*
> * Walk the list again, actually closing and freeing it,
> Index: linux-2.6/include/linux/list.h
> ===================================================================
> --- linux-2.6.orig/include/linux/list.h 2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/list.h 2008-01-28 11:35:22.000000000 -0800
> @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
> ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> pos = pos->next)
>
> +/**
> + * hlist_for_each_entry_safe_rcu - iterate over list of given type
> + * @tpos: the type * to use as a loop cursor.
> + * @pos: the &struct hlist_node to use as a loop cursor.
> + * @n: temporary pointer
> + * @head: the head for your list.
> + * @member: the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \
> + for (pos = (head)->first; \
> + rcu_dereference(pos) && ({ n = pos->next; 1;}) && \
> + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> + pos = n)
> +
> #else
> #warning "don't include kernel headers in userspace"
> #endif /* __KERNEL__ */
I am not sure it makes sense to add a _safe_rcu variant. As I understand
things an _safe variant is used where we are going to remove the current
list element in the middle of a list walk. However the key feature of an
RCU data structure is that it will always be in a "safe" state until any
parallel readers have completed. For an hlist this means that the removed
entry and its forward link must remain valid for as long as there may be
a parallel reader traversing this list, ie. until the next grace period.
If this link is valid for the parallel reader, then it must be valid for
us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
to cope in the face of entries being unlinked as we traverse the list.
-apw
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 23:38 ` Andrea Arcangeli
@ 2008-01-30 23:55 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 23:55 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Robin Holt, Jack Steiner, Avi Kivity, Izik Eidus, Nick Piggin,
kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Thu, 31 Jan 2008, Andrea Arcangeli wrote:
> > I think Andrea's original concept of the lock in the mmu_notifier_head
> > structure was the best. I agree with him that it should be a spinlock
> > instead of the rw_lock.
>
> BTW, I don't see the scalability concern with huge number of tasks:
> the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
> up_write(mm->mmap_sem) is always going to scale worse than
> spin_lock(mm->somethingelse); oneinstruction;
> spin_unlock(mm->somethinglese).
If we put it elsewhere in the mm then we increase the size of the memory
used in the mm_struct.
> Furthermore if we go this route and we don't relay on implicit
> serialization of all the mmu notifier users against exit_mmap
> (i.e. the mmu notifier user must agree to stop calling
> mmu_notifier_register on a mm after the last mmput) the autodisarming
> feature will likely have to be removed or it can't possibly be safe to
> run mmu_notifier_unregister while mmu_notifier_release runs. With the
> auto-disarming feature, there is no way to safely know if
> mmu_notifier_unregister has to be called or not. I'm ok with removing
> the auto-disarming feature and to have as self-contained-as-possible
> locking. Then mmu_notifier_release can just become the
> invalidate_all_after and invalidate_all, invalidate_all_before.
Hmmmm.. exit_mmap is only called when the last reference is removed
against the mm right? So no tasks are running anymore. No pages are left.
Do we need to serialize at all for mmu_notifier_release?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 22:20 ` Robin Holt
@ 2008-01-30 23:38 ` Andrea Arcangeli
2008-01-30 23:55 ` Christoph Lameter
0 siblings, 1 reply; 116+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 23:38 UTC (permalink / raw)
To: Robin Holt
Cc: Christoph Lameter, Jack Steiner, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Wed, Jan 30, 2008 at 04:20:35PM -0600, Robin Holt wrote:
> On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> > On Wed, 30 Jan 2008, Jack Steiner wrote:
> >
> > > Moving to a different lock solves the problem.
> >
> > Well it gets us back to the issue why we removed the lock. As Robin said
> > before: If its global then we can have a huge number of tasks contending
> > for the lock on startup of a process with a large number of ranks. The
> > reason to go to mmap_sem was that it was placed in the mm_struct and so we
> > would just have a couple of contentions per mm_struct.
> >
> > I'll be looking for some other way to do this.
>
> I think Andrea's original concept of the lock in the mmu_notifier_head
> structure was the best. I agree with him that it should be a spinlock
> instead of the rw_lock.
BTW, I don't see the scalability concern with huge number of tasks:
the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
up_write(mm->mmap_sem) is always going to scale worse than
spin_lock(mm->somethingelse); oneinstruction;
spin_unlock(mm->somethinglese).
Furthermore if we go this route and we don't relay on implicit
serialization of all the mmu notifier users against exit_mmap
(i.e. the mmu notifier user must agree to stop calling
mmu_notifier_register on a mm after the last mmput) the autodisarming
feature will likely have to be removed or it can't possibly be safe to
run mmu_notifier_unregister while mmu_notifier_release runs. With the
auto-disarming feature, there is no way to safely know if
mmu_notifier_unregister has to be called or not. I'm ok with removing
the auto-disarming feature and to have as self-contained-as-possible
locking. Then mmu_notifier_release can just become the
invalidate_all_after and invalidate_all, invalidate_all_before.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 19:19 ` Christoph Lameter
@ 2008-01-30 22:20 ` Robin Holt
2008-01-30 23:38 ` Andrea Arcangeli
0 siblings, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-01-30 22:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Jack Steiner, Andrea Arcangeli, Robin Holt, Avi Kivity,
Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman,
Hugh Dickins
On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
>
> > Moving to a different lock solves the problem.
>
> Well it gets us back to the issue why we removed the lock. As Robin said
> before: If its global then we can have a huge number of tasks contending
> for the lock on startup of a process with a large number of ranks. The
> reason to go to mmap_sem was that it was placed in the mm_struct and so we
> would just have a couple of contentions per mm_struct.
>
> I'll be looking for some other way to do this.
I think Andrea's original concept of the lock in the mmu_notifier_head
structure was the best. I agree with him that it should be a spinlock
instead of the rw_lock.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 17:10 ` Peter Zijlstra
@ 2008-01-30 19:28 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
How about just taking the mmap_sem writelock in release? We have only a
single caller of mmu_notifier_release() in mm/mmap.c and we know that we
are not holding mmap_sem at that point. So just acquire it when needed?
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 11:21:57.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 11:24:59.000000000 -0800
@@ -18,6 +19,7 @@ void mmu_notifier_release(struct mm_stru
struct hlist_node *n, *t;
if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ down_write(&mm->mmap_sem);
rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
&mm->mmu_notifier.head, hlist) {
@@ -26,6 +28,7 @@ void mmu_notifier_release(struct mm_stru
mn->ops->release(mn, mm);
}
rcu_read_unlock();
+ up_write(&mm->mmap_sem);
synchronize_rcu();
}
}
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 15:53 ` Jack Steiner
2008-01-30 16:38 ` Andrea Arcangeli
@ 2008-01-30 19:19 ` Christoph Lameter
2008-01-30 22:20 ` Robin Holt
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:19 UTC (permalink / raw)
To: Jack Steiner
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Wed, 30 Jan 2008, Jack Steiner wrote:
> Moving to a different lock solves the problem.
Well it gets us back to the issue why we removed the lock. As Robin said
before: If its global then we can have a huge number of tasks contending
for the lock on startup of a process with a large number of ranks. The
reason to go to mmap_sem was that it was placed in the mm_struct and so we
would just have a couple of contentions per mm_struct.
I'll be looking for some other way to do this.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 18:02 ` Robin Holt
2008-01-30 19:08 ` Christoph Lameter
@ 2008-01-30 19:14 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:14 UTC (permalink / raw)
To: Robin Holt
Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
Ok. So I added the following patch:
---
include/linux/mmu_notifier.h | 1 +
mm/mmu_notifier.c | 12 ++++++++++++
2 files changed, 13 insertions(+)
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h 2008-01-30 11:09:06.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-30 11:10:38.000000000 -0800
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
extern struct hlist_head mmu_rmap_notifier_list;
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-30 11:09:01.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-30 11:12:10.000000000 -0800
@@ -99,3 +99,15 @@ void mmu_rmap_notifier_unregister(struct
}
EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 18:02 ` Robin Holt
@ 2008-01-30 19:08 ` Christoph Lameter
2008-01-30 19:14 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:08 UTC (permalink / raw)
To: Robin Holt
Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Wed, 30 Jan 2008, Robin Holt wrote:
> Index: git-linus/mm/mmu_notifier.c
> ===================================================================
> --- git-linus.orig/mm/mmu_notifier.c 2008-01-30 11:43:45.000000000 -0600
> +++ git-linus/mm/mmu_notifier.c 2008-01-30 11:56:08.000000000 -0600
> @@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
> }
> EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
>
> +void mmu_rmap_export_page(struct page *page)
> +{
> + SetPageExternalRmap(page);
> +}
> +EXPORT_SYMBOL(mmu_rmap_export_page);
Then mmu_rmap_export_page would have to be called before the subsystem
establishes the rmap entry for the page. Could we do all PageExternalRmap
modifications under Pagelock?
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-30 15:37 ` Andrea Arcangeli
@ 2008-01-30 18:02 ` Robin Holt
2008-01-30 19:08 ` Christoph Lameter
2008-01-30 19:14 ` Christoph Lameter
1 sibling, 2 replies; 116+ messages in thread
From: Robin Holt @ 2008-01-30 18:02 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
Back to one of Andrea's points from a couple days ago, I think we still
have a problem with the PageExternalRmap page flag.
If I had two drivers with external rmap implementations, there is no way
I can think of for a simple flag to coordinate a single page being
exported and maintained by the two.
Since the intended use seems to point in the direction of the external
rmap must be maintained consistent with the all pages the driver has
exported and the driver will already need to handle cases where the page
does not appear in its rmap, I would propose the setting and clearing
should be handled in the mmu_notifier code.
This is the first of two patches. This one is intended as an addition
to patch 1/6. I will post the other shortly under the patch 3/6 thread.
Index: git-linus/include/linux/mmu_notifier.h
===================================================================
--- git-linus.orig/include/linux/mmu_notifier.h 2008-01-30 11:43:45.000000000 -0600
+++ git-linus/include/linux/mmu_notifier.h 2008-01-30 11:44:35.000000000 -0600
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
extern struct hlist_head mmu_rmap_notifier_list;
Index: git-linus/mm/mmu_notifier.c
===================================================================
--- git-linus.orig/mm/mmu_notifier.c 2008-01-30 11:43:45.000000000 -0600
+++ git-linus/mm/mmu_notifier.c 2008-01-30 11:56:08.000000000 -0600
@@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
}
EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+void mmu_rmap_export_page(struct page *page)
+{
+ SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 15:37 ` Andrea Arcangeli
2008-01-30 15:53 ` Jack Steiner
@ 2008-01-30 17:10 ` Peter Zijlstra
2008-01-30 19:28 ` Christoph Lameter
1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2008-01-30 17:10 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Wed, 2008-01-30 at 16:37 +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > + struct mmu_notifier *mn;
> > + struct hlist_node *n, *t;
> > +
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + rcu_read_lock();
> > + hlist_for_each_entry_safe_rcu(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > + hlist_del_rcu(&mn->hlist);
>
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.
Agreed, sorry for this oversight.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 15:53 ` Jack Steiner
@ 2008-01-30 16:38 ` Andrea Arcangeli
2008-01-30 19:19 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 16:38 UTC (permalink / raw)
To: Jack Steiner
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Wed, Jan 30, 2008 at 09:53:06AM -0600, Jack Steiner wrote:
> That will also resolve the problem we discussed yesterday.
> I want to unregister my mmu_notifier when a GRU segment is
> unmapped. This would not necessarily be at task termination.
My proof that there is something wrong in the smp locking of the
current code is very simple: it can't be right to use
hlist_for_each_entry_safe_rcu and rcu_read_lock inside
mmu_notifier_release, and then to call hlist_del_rcu without any
spinlock or semaphore. If we walk the list with
hlist_for_each_entry_safe_rcu (and not with
hlist_for_each_entry_safe), it means the list _can_ change from under
us, and in turn the hlist_del_rcu must be surrounded by a spinlock or
sempahore too!
If by design the list _can't_ change from under us and calling
hlist_del_rcu was safe w/o locks, then hlist_for_each_entry_safe is
_sure_ enough for mmu_notifier_release, and rcu_read_lock most
certainly can be removed too.
To make an usage case where the race could trigger, I was thinking at
somebody bumping the mm_count (not mm_users) and registering a
notifier while mmu_notifier_release runs and relaying on ->release to
know if it has to run mmu_notifier_unregister. However I now started
wondering how it can relay on ->release to know that if ->release is
called after hlist_del_rcu because with the latest changes ->release
will also allow the mn to release itself ;). It's unsafe to call
list_del_rcu twice (the second will crash on a poisoned entry).
This starts to make me think we should remove the auto-disarming
feature and require the notifier-user to have the ->release call
mmu_notifier_unregister first and to free the "mn" inside ->release
too if needed. Or alternatively the notifier-user can bump mm_count
and to call a mmu_notifier_unregister before calling mmdrop (like kvm
could do).
Another approach is to simply define mmu_notifier_release as
implicitly serialized by other code design, with a real lock (not rcu)
against the whole register/unregister operations. So to guarantee the
notifier list can't change from under us while mmu_notifier_release
runs. If we go this route, yes, the auto-disarming hlist_del can be
kept, the current code would have been safe, but to avoid confusion
the mmu_notifier_release shall become this:
void mmu_notifier_release(struct mm_struct *mm)
{
struct mmu_notifier *mn;
struct hlist_node *n, *t;
if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
hlist_for_each_entry_safe(mn, n, t,
&mm->mmu_notifier.head, hlist) {
hlist_del(&mn->hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);
}
}
}
> However, the mmap_sem is already held for write by the core
> VM at the point I would call the unregister function.
> Currently, there is no __mmu_notifier_unregister() defined.
>
> Moving to a different lock solves the problem.
Unless the mmu_notifier_release becomes like above and we rely on the
user of the mmu notifiers to implement a highlevel external lock that
will we definitely forbid to bump the mm_count of the mm, and to call
register/unregister while mmu_notifier_release could run, 1) moving to a
different lock and 2) removing the auto-disarming hlist_del_rcu from
mmu_notifier_release sounds the only possible smp safe way.
As far as KVM is concerned mmu_notifier_released could be changed to
the version I written above and everything should be ok. For KVM the
mm_count bump is done by the task that also holds a mm_user, so when
exit_mmap runs I don't think the list could possible change anymore.
Anyway those are details that can be perfected after mainline merging,
so this isn't something to worry about too much right now. My idea is
to keep working to perfect it while I hope progress is being made by
Christoph to merge the mmu notifiers V3 patchset in mainline ;).
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 15:37 ` Andrea Arcangeli
@ 2008-01-30 15:53 ` Jack Steiner
2008-01-30 16:38 ` Andrea Arcangeli
2008-01-30 19:19 ` Christoph Lameter
2008-01-30 17:10 ` Peter Zijlstra
1 sibling, 2 replies; 116+ messages in thread
From: Jack Steiner @ 2008-01-30 15:53 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
On Wed, Jan 30, 2008 at 04:37:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > + struct mmu_notifier *mn;
> > + struct hlist_node *n, *t;
> > +
> > + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > + rcu_read_lock();
> > + hlist_for_each_entry_safe_rcu(mn, n, t,
> > + &mm->mmu_notifier.head, hlist) {
> > + hlist_del_rcu(&mn->hlist);
>
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.
Agree.
That will also resolve the problem we discussed yesterday.
I want to unregister my mmu_notifier when a GRU segment is
unmapped. This would not necessarily be at task termination.
However, the mmap_sem is already held for write by the core
VM at the point I would call the unregister function.
Currently, there is no __mmu_notifier_unregister() defined.
Moving to a different lock solves the problem.
-- jack
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-30 15:37 ` Andrea Arcangeli
2008-01-30 15:53 ` Jack Steiner
2008-01-30 17:10 ` Peter Zijlstra
2008-01-30 18:02 ` Robin Holt
1 sibling, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 15:37 UTC (permalink / raw)
To: Christoph Lameter
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n, *t;
> +
> + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> + rcu_read_lock();
> + hlist_for_each_entry_safe_rcu(mn, n, t,
> + &mm->mmu_notifier.head, hlist) {
> + hlist_del_rcu(&mn->hlist);
This will race and kernel crash against mmu_notifier_register in
SMP. You should resurrect the per-mmu_notifier_head lock in my last
patch (except it can be converted from a rwlock_t to a regular
spinlock_t) and drop the mmap_sem from
mmu_notifier_register/unregister.
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 1/6] mmu_notifier: Core code
2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
@ 2008-01-30 2:29 ` Christoph Lameter
2008-01-30 15:37 ` Andrea Arcangeli
2008-01-30 18:02 ` Robin Holt
0 siblings, 2 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-30 2:29 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15337 bytes --]
Core code for mmu notifiers.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
include/linux/list.h | 14 ++
include/linux/mm_types.h | 6 +
include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 10 ++
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 2
mm/mmu_notifier.c | 101 ++++++++++++++++++++
9 files changed, 350 insertions(+)
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-01-29 16:56:36.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
#endif
};
+struct mmu_notifier_head {
+ struct hlist_head head;
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-29 16:56:36.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ * Spinlocks must be held in order to the walk reverse maps and the
+ * notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms. It is up to the subsystem to mark pags
+ * as PageExternalRmap as needed to trigger the callbacks. Pages
+ * must be marked dirty if dirty bits are set in the external
+ * pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * Note: The mmu_notifier structure must be released with
+ * call_rcu() since other processors are only guaranteed to
+ * see the changes after a quiescent period.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * lock indicates that the function is called under spinlock.
+ */
+ void (*invalidate_range)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+ struct hlist_node hlist;
+ const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed
+ * so that a subsystem with its own rmap's can remove remote ptes
+ * mapping a page.
+ */
+ void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+ struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ struct mmu_rmap_notifier *__mrn; \
+ struct hlist_node *__n; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mrn, __n, \
+ &mmu_rmap_notifier_list, \
+ hlist) \
+ if (__mrn->ops->function) \
+ __mrn->ops->function(__mrn, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ if (0) { \
+ struct mmu_rmap_notifier *__mrn; \
+ \
+ __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
+ __mrn->ops->function(__mrn, args); \
+ } \
+ } while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+ {}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+ {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h 2008-01-29 16:56:36.000000000 -0800
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_external_rmap 30 /* Page has external rmap */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+ &(page)->flags)
+#else
+#define PageExternalRmap(page) 0
+#endif
+
struct page; /* forward declaration */
extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Kconfig 2008-01-29 16:56:36.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Makefile 2008-01-29 16:56:36.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c 2008-01-29 16:57:26.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *t;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_safe_rcu(mn, n, t,
+ &mm->mmu_notifier.head, hlist) {
+ hlist_del_rcu(&mn->hlist);
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ }
+ rcu_read_unlock();
+ synchronize_rcu();
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ __mmu_notifier_register(mn, mm);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ hlist_del_rcu(&mn->hlist);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_del_rcu(&mrn->hlist);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/kernel/fork.c 2008-01-29 16:56:36.000000000 -0800
@@ -52,6 +52,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-01-29 16:56:36.000000000 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);
/*
* Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h 2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/list.h 2008-01-29 16:56:36.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
pos = pos->next)
+/**
+ * hlist_for_each_entry_safe_rcu - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @n: temporary pointer
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \
+ for (pos = (head)->first; \
+ rcu_dereference(pos) && ({ n = pos->next; 1;}) && \
+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = n)
+
#else
#warning "don't include kernel headers in userspace"
#endif /* __KERNEL__ */
--
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-29 19:49 ` Christoph Lameter
@ 2008-01-29 20:41 ` Avi Kivity
0 siblings, 0 replies; 116+ messages in thread
From: Avi Kivity @ 2008-01-29 20:41 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
>
>>> + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>>> };
>>>
>> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
>> regression (a minor one though).
>>
>
> Andrew does not like #ifdefs and it makes it possible to verify calling
> conventions if !CONFIG_MMU_NOTIFIER.
>
>
You could define mmu_notifier_head as an empty struct in that case.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-29 13:59 ` Andrea Arcangeli
2008-01-29 14:34 ` Andrea Arcangeli
@ 2008-01-29 19:49 ` Christoph Lameter
2008-01-29 20:41 ` Avi Kivity
1 sibling, 1 reply; 116+ messages in thread
From: Christoph Lameter @ 2008-01-29 19:49 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
> > + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> > };
>
> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
> regression (a minor one though).
Andrew does not like #ifdefs and it makes it possible to verify calling
conventions if !CONFIG_MMU_NOTIFIER.
> It's out of my reach how can you be ok with lock=1. You said you have
> to block, if you can deal with lock=1 once, why can't you deal with
> lock=1 _always_?
Not sure yet. We may have to do more in that area. Need to have feedback
from Robin.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
` (2 preceding siblings ...)
2008-01-29 13:59 ` Andrea Arcangeli
@ 2008-01-29 16:07 ` Robin Holt
2008-02-05 18:05 ` Andy Whitcroft
4 siblings, 0 replies; 116+ messages in thread
From: Robin Holt @ 2008-01-29 16:07 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
I am going to seperate my comments into individual replies to help
reduce the chance they are lost.
> +void mmu_notifier_release(struct mm_struct *mm)
...
> + hlist_for_each_entry_safe_rcu(mn, n, t,
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + hlist_del(&mn->hlist);
This is a use-after-free issue. The hlist_del_rcu needs to be done before
the callout as the structure containing the mmu_notifier structure will
need to be freed from within the ->release callout.
Thanks,
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-29 13:59 ` Andrea Arcangeli
@ 2008-01-29 14:34 ` Andrea Arcangeli
2008-01-29 19:49 ` Christoph Lameter
1 sibling, 0 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 14:34 UTC (permalink / raw)
To: Christoph Lameter
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Tue, Jan 29, 2008 at 02:59:14PM +0100, Andrea Arcangeli wrote:
> The down_write is garbage. The caller should put it around
> mmu_notifier_register if something. The same way the caller should
> call synchronize_rcu after mmu_notifier_register if it needs
> synchronous behavior from the notifiers. The default version of
> mmu_notifier_register shouldn't be cluttered with unnecessary locking.
Ooops my spinlock was gone from the notifier head.... so the above
comment is wrong sorry! I thought down_write was needed to serialize
against some _external_ event, not to serialize the list updates in
place of my explicit lock. The critical section is so small that a
semaphore is the wrong locking choice, that's why I assumed it was for
an external event. Anyway RCU won't be optimal for a huge flood of
register/unregister, I agree the down_write shouldn't create much
contention and it saves 4 bytes from each mm_struct, and we can always
change it to a proper spinlock later if needed.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-28 22:06 ` Christoph Lameter
2008-01-29 0:05 ` Robin Holt
@ 2008-01-29 13:59 ` Andrea Arcangeli
2008-01-29 14:34 ` Andrea Arcangeli
2008-01-29 19:49 ` Christoph Lameter
2008-01-29 16:07 ` Robin Holt
2008-02-05 18:05 ` Andy Whitcroft
4 siblings, 2 replies; 116+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 13:59 UTC (permalink / raw)
To: Christoph Lameter
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> +struct mmu_notifier_head {
> + struct hlist_head head;
> +};
> +
> struct mm_struct {
> struct vm_area_struct * mmap; /* list of VMAs */
> struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
> /* aio bits */
> rwlock_t ioctx_list_lock;
> struct kioctx *ioctx_list;
> +
> + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> };
Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
regression (a minor one though).
> + /*
> + * lock indicates that the function is called under spinlock.
> + */
> + void (*invalidate_range)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + int lock);
> +};
It's out of my reach how can you be ok with lock=1. You said you have
to block, if you can deal with lock=1 once, why can't you deal with
lock=1 _always_?
> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + down_write(&mm->mmap_sem);
> + __mmu_notifier_register(mn, mm);
> + up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
The down_write is garbage. The caller should put it around
mmu_notifier_register if something. The same way the caller should
call synchronize_rcu after mmu_notifier_register if it needs
synchronous behavior from the notifiers. The default version of
mmu_notifier_register shouldn't be cluttered with unnecessary locking.
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-29 0:05 ` Robin Holt
@ 2008-01-29 1:19 ` Christoph Lameter
0 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-29 1:19 UTC (permalink / raw)
To: Robin Holt
Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
On Mon, 28 Jan 2008, Robin Holt wrote:
> USE_AFTER_FREE!!! I made this same comment as well as other relavent
> comments last week.
Must have slipped somehow. Patch needs to be applied after the rcu fix.
Please repeat the other relevant comments if they are still relevant.... I
thought I had worked through them.
mmu_notifier_release: remove mmu_notifier struct from list before calling ->release
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/mmu_notifier.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 17:17:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-28 17:17:10.000000000 -0800
@@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru
rcu_read_lock();
hlist_for_each_entry_safe_rcu(mn, n, t,
&mm->mmu_notifier.head, hlist) {
+ hlist_del_rcu(&mn->hlist);
if (mn->ops->release)
mn->ops->release(mn, mm);
- hlist_del_rcu(&mn->hlist);
}
rcu_read_unlock();
synchronize_rcu();
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-28 22:06 ` Christoph Lameter
@ 2008-01-29 0:05 ` Robin Holt
2008-01-29 1:19 ` Christoph Lameter
2008-01-29 13:59 ` Andrea Arcangeli
` (2 subsequent siblings)
4 siblings, 1 reply; 116+ messages in thread
From: Robin Holt @ 2008-01-29 0:05 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins
> +void mmu_notifier_release(struct mm_struct *mm)
...
> + hlist_for_each_entry_safe_rcu(mn, n, t,
> + &mm->mmu_notifier.head, hlist) {
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + hlist_del(&mn->hlist);
USE_AFTER_FREE!!! I made this same comment as well as other relavent
comments last week.
Robin
^ permalink raw reply [flat|nested] 116+ messages in thread
* Re: [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-28 22:06 ` Christoph Lameter
2008-01-29 0:05 ` Robin Holt
` (3 subsequent siblings)
4 siblings, 0 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-28 22:06 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
mmu core: Need to use hlist_del
Wrong type of list del in mmu_notifier_release()
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
mm/mmu_notifier.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c 2008-01-28 14:02:18.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c 2008-01-28 14:02:30.000000000 -0800
@@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru
&mm->mmu_notifier.head, hlist) {
if (mn->ops->release)
mn->ops->release(mn, mm);
- hlist_del(&mn->hlist);
+ hlist_del_rcu(&mn->hlist);
}
rcu_read_unlock();
synchronize_rcu();
^ permalink raw reply [flat|nested] 116+ messages in thread
* [patch 1/6] mmu_notifier: Core code
2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
2008-01-28 22:06 ` Christoph Lameter
` (4 more replies)
0 siblings, 5 replies; 116+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
linux-mm, daniel.blueman, Hugh Dickins
[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15333 bytes --]
Core code for mmu notifiers.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---
include/linux/list.h | 14 ++
include/linux/mm_types.h | 6 +
include/linux/mmu_notifier.h | 210 +++++++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 10 ++
kernel/fork.c | 2
mm/Kconfig | 4
mm/Makefile | 1
mm/mmap.c | 2
mm/mmu_notifier.c | 101 ++++++++++++++++++++
9 files changed, 350 insertions(+)
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h 2008-01-28 11:35:22.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
#endif
};
+struct mmu_notifier_head {
+ struct hlist_head head;
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+
+ struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h 2008-01-28 11:43:03.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * These are callbacks registered with an mm_struct. If mappings are
+ * removed from an address space then callbacks are performed.
+ * Spinlocks must be held in order to the walk reverse maps and the
+ * notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ * Callbacks for subsystems that provide their own rmaps. These
+ * need to walk their own rmaps for a page. The invalidate_page
+ * callback is outside of locks so that we are not in a strictly
+ * atomic context (but we may be in a PF_MEMALLOC context if the
+ * notifier is called from reclaim code) and are able to sleep.
+ * Rmap notifiers need an extra page bit and are only available
+ * on 64 bit platforms. It is up to the subsystem to mark pags
+ * as PageExternalRmap as needed to trigger the callbacks. Pages
+ * must be marked dirty if dirty bits are set in the external
+ * pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+ struct hlist_node hlist;
+ const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+ /*
+ * Note: The mmu_notifier structure must be released with
+ * call_rcu() since other processors are only guaranteed to
+ * see the changes after a quiescent period.
+ */
+ void (*release)(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+ int (*age_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ void (*invalidate_page)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address);
+
+ /*
+ * lock indicates that the function is called under spinlock.
+ */
+ void (*invalidate_range)(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+ struct hlist_node hlist;
+ const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+ /*
+ * Called with the page lock held after ptes are modified or removed
+ * so that a subsystem with its own rmap's can remove remote ptes
+ * mapping a page.
+ */
+ void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+ struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+ INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ struct mmu_notifier *__mn; \
+ struct hlist_node *__n; \
+ \
+ if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mn, __n, \
+ &(mm)->mmu_notifier.head, \
+ hlist) \
+ if (__mn->ops->function) \
+ __mn->ops->function(__mn, \
+ mm, \
+ args); \
+ rcu_read_unlock(); \
+ } \
+ } while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ struct mmu_rmap_notifier *__mrn; \
+ struct hlist_node *__n; \
+ \
+ rcu_read_lock(); \
+ hlist_for_each_entry_rcu(__mrn, __n, \
+ &mmu_rmap_notifier_list, \
+ hlist) \
+ if (__mrn->ops->function) \
+ __mrn->ops->function(__mrn, args); \
+ rcu_read_unlock(); \
+ } while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...) \
+ do { \
+ if (0) { \
+ struct mmu_notifier *__mn; \
+ \
+ __mn = (struct mmu_notifier *)(0x00ff); \
+ __mn->ops->function(__mn, mm, args); \
+ }; \
+ } while (0)
+
+#define mmu_rmap_notifier(function, args...) \
+ do { \
+ if (0) { \
+ struct mmu_rmap_notifier *__mrn; \
+ \
+ __mrn = (struct mmu_rmap_notifier *)(0x00ff); \
+ __mrn->ops->function(__mrn, args); \
+ } \
+ } while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+ struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+ unsigned long address)
+{
+ return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+ {}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+ {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h 2008-01-28 11:35:22.000000000 -0800
@@ -105,6 +105,7 @@
* 64 bit | FIELDS | ?????? FLAGS |
* 63 32 0
*/
+#define PG_external_rmap 30 /* Page has external rmap */
#define PG_uncached 31 /* Page has been mapped as uncached */
#endif
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page) test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+ &(page)->flags)
+#else
+#define PageExternalRmap(page) 0
+#endif
+
struct page; /* forward declaration */
extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Kconfig 2008-01-28 11:35:22.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+ def_bool y
+ bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Makefile 2008-01-28 11:35:22.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c 2008-01-28 11:35:22.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ * linux/mm/mmu_notifier.c
+ *
+ * Copyright (C) 2008 Qumranet, Inc.
+ * Copyright (C) 2008 SGI
+ * Christoph Lameter <clameter@sgi.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n, *t;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_safe_rcu(mn, n, t,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->release)
+ mn->ops->release(mn, mm);
+ hlist_del(&mn->hlist);
+ }
+ rcu_read_unlock();
+ synchronize_rcu();
+ }
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+ struct mmu_notifier *mn;
+ struct hlist_node *n;
+ int young = 0;
+
+ if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(mn, n,
+ &mm->mmu_notifier.head, hlist) {
+ if (mn->ops->age_page)
+ young |= mn->ops->age_page(mn, mm, address);
+ }
+ rcu_read_unlock();
+ }
+
+ return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ __mmu_notifier_register(mn, mm);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+ down_write(&mm->mmap_sem);
+ hlist_del_rcu(&mn->hlist);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+ spin_lock(&mmu_notifier_list_lock);
+ hlist_del_rcu(&mrn->hlist);
+ spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/kernel/fork.c 2008-01-28 11:35:22.000000000 -0800
@@ -51,6 +51,7 @@
#include <linux/random.h>
#include <linux/tty.h>
#include <linux/proc_fs.h>
+#include <linux/mmu_notifier.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+ mmu_notifier_head_init(&mm->mmu_notifier);
return mm;
}
free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/mmap.c 2008-01-28 11:37:53.000000000 -0800
@@ -26,6 +26,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
vm_unacct_memory(nr_accounted);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+ mmu_notifier_release(mm);
/*
* Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h 2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/list.h 2008-01-28 11:35:22.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
pos = pos->next)
+/**
+ * hlist_for_each_entry_safe_rcu - iterate over list of given type
+ * @tpos: the type * to use as a loop cursor.
+ * @pos: the &struct hlist_node to use as a loop cursor.
+ * @n: temporary pointer
+ * @head: the head for your list.
+ * @member: the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) \
+ for (pos = (head)->first; \
+ rcu_dereference(pos) && ({ n = pos->next; 1;}) && \
+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+ pos = n)
+
#else
#warning "don't include kernel headers in userspace"
#endif /* __KERNEL__ */
--
^ permalink raw reply [flat|nested] 116+ messages in thread
end of thread, other threads:[~2008-03-05 0:53 UTC | newest]
Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-15 6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15 6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-16 8:45 ` Avi Kivity
2008-02-16 8:56 ` Andrew Morton
2008-02-16 9:21 ` Avi Kivity
2008-02-16 10:41 ` Brice Goglin
2008-02-16 10:58 ` Andrew Morton
2008-02-16 19:31 ` Christoph Lameter
2008-02-16 19:21 ` Christoph Lameter
2008-02-17 3:01 ` Andrea Arcangeli
2008-02-17 12:24 ` Robin Holt
2008-02-17 5:04 ` Doug Maxey
2008-02-18 22:33 ` Roland Dreier
2008-02-15 6:49 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-16 19:26 ` Christoph Lameter
2008-02-19 8:54 ` Nick Piggin
2008-02-19 13:34 ` Andrea Arcangeli
2008-02-27 22:23 ` Christoph Lameter
2008-02-27 23:57 ` Andrea Arcangeli
2008-02-19 23:08 ` Nick Piggin
2008-02-20 1:00 ` Andrea Arcangeli
2008-02-20 3:00 ` Robin Holt
2008-02-20 3:11 ` Nick Piggin
2008-02-20 3:19 ` Robin Holt
2008-02-27 22:39 ` Christoph Lameter
2008-02-28 0:38 ` Andrea Arcangeli
2008-02-27 22:35 ` Christoph Lameter
2008-02-27 22:42 ` Jack Steiner
2008-02-28 0:10 ` Christoph Lameter
2008-02-28 0:11 ` Andrea Arcangeli
2008-02-28 0:14 ` Christoph Lameter
2008-02-28 0:52 ` Andrea Arcangeli
2008-02-28 1:03 ` Christoph Lameter
2008-02-28 1:10 ` Andrea Arcangeli
2008-02-28 18:43 ` Christoph Lameter
2008-02-29 0:55 ` Andrea Arcangeli
2008-02-29 0:59 ` Christoph Lameter
2008-02-29 13:13 ` Andrea Arcangeli
2008-02-29 19:55 ` Christoph Lameter
2008-02-29 20:17 ` Andrea Arcangeli
2008-02-29 21:03 ` Christoph Lameter
2008-02-29 21:23 ` Andrea Arcangeli
2008-02-29 21:29 ` Christoph Lameter
2008-02-29 21:34 ` Christoph Lameter
2008-02-29 21:48 ` Andrea Arcangeli
2008-02-29 22:12 ` Christoph Lameter
2008-02-29 22:41 ` Andrea Arcangeli
2008-02-28 10:53 ` Robin Holt
2008-03-03 5:11 ` Nick Piggin
2008-03-03 19:28 ` Christoph Lameter
2008-03-03 19:50 ` Nick Piggin
2008-03-04 18:58 ` Christoph Lameter
2008-03-05 0:52 ` Nick Piggin
2008-02-15 6:49 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-16 11:07 ` Andrea Arcangeli
2008-02-16 19:22 ` Christoph Lameter
2008-02-16 19:54 ` Avi Kivity
2008-02-19 8:46 ` Nick Piggin
2008-02-19 13:30 ` Andrea Arcangeli
2008-02-18 1:51 ` Nick Piggin
2008-02-15 6:49 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
2008-02-15 6:49 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-16 3:37 ` Andrew Morton
2008-02-16 19:28 ` Christoph Lameter
2008-02-19 23:55 ` Nick Piggin
2008-02-20 3:12 ` Robin Holt
2008-02-20 3:51 ` Nick Piggin
2008-02-20 9:00 ` Robin Holt
2008-02-20 9:05 ` Robin Holt
2008-02-21 4:20 ` Nick Piggin
2008-02-21 10:58 ` Robin Holt
2008-02-26 6:11 ` Nick Piggin
2008-02-26 7:21 ` [ofa-general] " Gleb Natapov
2008-02-26 8:52 ` Nick Piggin
2008-02-26 9:38 ` Gleb Natapov
2008-02-26 9:52 ` KOSAKI Motohiro
2008-02-26 12:28 ` Robin Holt
2008-02-26 12:29 ` Robin Holt
2008-02-27 22:43 ` Christoph Lameter
2008-02-28 0:42 ` Andrea Arcangeli
2008-02-28 1:01 ` Christoph Lameter
2008-02-15 6:49 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
2008-02-16 10:48 ` [PATCH] KVM swapping with MMU Notifiers V7 Andrea Arcangeli
2008-02-16 11:08 ` Andrew Morton
2008-02-18 12:17 ` Andrea Arcangeli
2008-02-16 11:51 ` Robin Holt
2008-02-18 12:35 ` Andrea Arcangeli
-- strict thread matches above, loose matches on Subject: below --
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-30 2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
2008-01-30 2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-30 15:37 ` Andrea Arcangeli
2008-01-30 15:53 ` Jack Steiner
2008-01-30 16:38 ` Andrea Arcangeli
2008-01-30 19:19 ` Christoph Lameter
2008-01-30 22:20 ` Robin Holt
2008-01-30 23:38 ` Andrea Arcangeli
2008-01-30 23:55 ` Christoph Lameter
2008-01-30 17:10 ` Peter Zijlstra
2008-01-30 19:28 ` Christoph Lameter
2008-01-30 18:02 ` Robin Holt
2008-01-30 19:08 ` Christoph Lameter
2008-01-30 19:14 ` Christoph Lameter
2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-28 22:06 ` Christoph Lameter
2008-01-29 0:05 ` Robin Holt
2008-01-29 1:19 ` Christoph Lameter
2008-01-29 13:59 ` Andrea Arcangeli
2008-01-29 14:34 ` Andrea Arcangeli
2008-01-29 19:49 ` Christoph Lameter
2008-01-29 20:41 ` Avi Kivity
2008-01-29 16:07 ` Robin Holt
2008-02-05 18:05 ` Andy Whitcroft
2008-02-05 18:17 ` Peter Zijlstra
2008-02-05 18:19 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).