LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/6] MMU Notifiers V6
@ 2008-02-08 22:06 Christoph Lameter
  2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                   ` (7 more replies)
  0 siblings, 8 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU). MMU
notifiers allow us to get rid of the page pinning for RDMA and various
other purposes. It gets rid of the broken use of mlock for page pinning.
(mlock really does *not* pin pages....)

More information on the rationale and the technical details can be found in
the first patch and the README provided by that patch in
Documentation/mmu_notifiers.

The known immediate users are

KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated but
  no reverse maps.

GRU
- Simple additional hardware TLB (possibly covering multiple instances of
  Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
  fall back to get_user_pages().
- No page reference possible since no page status is kept..

XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
  cleared.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
  and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
  we are the only executing thread. This is also true for invalidate_all() so
  we could drop off the mmu_notifier there early. Use hlist_del_init instead
  of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
  holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
  structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
  of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
  ourselves. In some contexts where we want to register/unregister we are
  already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
  all patches for KVM and GRU.

V5->V6:
- Provide missing range callouts for mprotect
- Fix do_wp_page control path sequencing
- Clarify locking conventions
- GRU and XPmem confirmed to work with this patchset.
- Provide skeleton code for GRU/KVM type callback and for XPmem type.
- Rework documentation and put it into Documentation/mmu_notifier.

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 18134 bytes --]

MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.

The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.

The notification scheme is much better than the current scheme of
avoiding the danger of the VM removing pages that are externally
mapped. We currently mlock pages used for RDMA, XPmem etc in memory.

Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.

Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.

The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning.

This patch: Core portion

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 Documentation/mmu_notifier/README |   99 +++++++++++++++++++++
 include/linux/mm_types.h          |    7 +
 include/linux/mmu_notifier.h      |  175 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                     |    2 
 mm/Kconfig                        |    4 
 mm/Makefile                       |    1 
 mm/mmap.c                         |    2 
 mm/mmu_notifier.c                 |   76 ++++++++++++++++
 8 files changed, 366 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README	2008-02-08 12:30:47.000000000 -0800
@@ -0,0 +1,99 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current scheme of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory.
+
+Mlock causes problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. It is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+	Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-02-08 12:30:47.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-02-08 12:35:14.000000000 -0800
@@ -0,0 +1,175 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/* invalidate_page is called from contexts where the pte_lock is held */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-02-08 12:30:47.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-02-08 12:30:47.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-02-08 12:44:24.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-02-08 12:30:47.000000000 -0800
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-02-08 12:43:59.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();`

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
  2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 11235 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.

If invalidate_range_begin() is called with locks held then we
pass a flag into invalidate_range() to indicate that no sleeping is
possible. Locks are only held for truncate and huge pages.

In two cases we use invalidate_range_begin/end to invalidate
single pages because the pair allows holding off new references
(idea by Robin Holt).

do_wp_page(): We hold off new references while we update the pte.

xip_unmap: We are not taking the PageLock so we cannot
use the invalidate_page mmu_rmap_notifier. invalidate_range_begin/end
stands in.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap_xip.c |    5 +++++
 mm/fremap.c      |    3 +++
 mm/hugetlb.c     |    3 +++
 mm/memory.c      |   35 +++++++++++++++++++++++++++++------
 mm/mmap.c        |    2 ++
 mm/mprotect.c    |    3 +++
 mm/mremap.c      |    7 ++++++-
 7 files changed, 51 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-02-08 13:18:58.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-02-08 13:25:22.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range_begin, mm, start, start + size, 0);
 	err = populate_range(mm, vma, start, size, pgoff);
+	mmu_notifier(invalidate_range_end, mm, start, start + size, 0);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-02-08 13:22:14.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-02-08 13:25:22.000000000 -0800
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -611,6 +612,9 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_begin, src_mm, addr, end, 0);
+
 	dst_pgd = pgd_offset(dst_mm, addr);
 	src_pgd = pgd_offset(src_mm, addr);
 	do {
@@ -621,6 +625,11 @@ int copy_page_range(struct mm_struct *ds
 						vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_range_end, src_mm,
+						vma->vm_start, end, 0);
+
 	return 0;
 }
 
@@ -893,13 +902,16 @@ unsigned long zap_page_range(struct vm_a
 	struct mmu_gather *tlb;
 	unsigned long end = address + size;
 	unsigned long nr_accounted = 0;
+	int atomic = details ? (details->i_mmap_lock != 0) : 0;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, address, end, atomic);
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range_end, mm, address, end, atomic);
 	return end;
 }
 
@@ -1337,7 +1349,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1371,6 +1383,7 @@ int remap_pfn_range(struct vm_area_struc
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1378,6 +1391,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1461,10 +1475,11 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1472,6 +1487,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1612,8 +1628,10 @@ static int do_wp_page(struct mm_struct *
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
 			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
-				goto unlock;
+			if (!pte_same(*page_table, orig_pte)) {
+				pte_unmap_unlock(page_table, ptl);
+				goto check_dirty;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -1629,7 +1647,8 @@ static int do_wp_page(struct mm_struct *
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, entry);
 		ret |= VM_FAULT_WRITE;
-		goto unlock;
+		pte_unmap_unlock(page_table, ptl);
+		goto check_dirty;
 	}
 
 	/*
@@ -1651,6 +1670,8 @@ gotten:
 	if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
+	mmu_notifier(invalidate_range_begin, mm, address,
+				address + PAGE_SIZE, 0);
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1689,8 +1710,10 @@ gotten:
 		page_cache_release(new_page);
 	if (old_page)
 		page_cache_release(old_page);
-unlock:
 	pte_unmap_unlock(page_table, ptl);
+	mmu_notifier(invalidate_range_end, mm,
+				address, address + PAGE_SIZE, 0);
+check_dirty:
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-02-08 13:25:21.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-02-08 13:25:22.000000000 -0800
@@ -1748,11 +1748,13 @@ static void unmap_region(struct mm_struc
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-02-08 13:22:14.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-02-08 13:25:22.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -753,6 +754,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier(invalidate_range_begin, mm, start, end, 1);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -773,6 +775,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range_end, mm, start, end, 1);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-02-08 13:22:14.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-02-08 13:25:22.000000000 -0800
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 #include <asm/tlbflush.h>
 
@@ -190,6 +191,8 @@ __xip_unmap (struct address_space * mapp
 		address = vma->vm_start +
 			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+		mmu_notifier(invalidate_range_begin, mm, address,
+					address + PAGE_SIZE, 1);
 		pte = page_check_address(page, mm, address, &ptl);
 		if (pte) {
 			/* Nuke the page table entry. */
@@ -201,6 +204,8 @@ __xip_unmap (struct address_space * mapp
 			pte_unmap_unlock(pte, ptl);
 			page_cache_release(page);
 		}
+		mmu_notifier(invalidate_range_end, mm,
+				address, address + PAGE_SIZE, 1);
 	}
 	spin_unlock(&mapping->i_mmap_lock);
 }
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c	2008-02-08 13:18:58.000000000 -0800
+++ linux-2.6/mm/mremap.c	2008-02-08 13:25:22.000000000 -0800
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -124,12 +125,15 @@ unsigned long move_page_tables(struct vm
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len)
 {
-	unsigned long extent, next, old_end;
+	unsigned long extent, next, old_start, old_end;
 	pmd_t *old_pmd, *new_pmd;
 
+	old_start = old_addr;
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier(invalidate_range_begin, vma->vm_mm,
+					old_addr, old_end, 0);
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -150,6 +154,7 @@ unsigned long move_page_tables(struct vm
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
 	}
+	mmu_notifier(invalidate_range_end, vma->vm_mm, old_start, old_end, 0);
 
 	return len + old_addr - old_end;	/* how much done */
 }
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c	2008-02-08 13:18:58.000000000 -0800
+++ linux-2.6/mm/mprotect.c	2008-02-08 13:25:22.000000000 -0800
@@ -21,6 +21,7 @@
 #include <linux/syscalls.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
@@ -198,10 +199,12 @@ success:
 		dirty_accountable = 1;
 	}
 
+	mmu_notifier(invalidate_range_begin, mm, start, end, 0);
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
 	else
 		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+	mmu_notifier(invalidate_range_end, mm, start, end, 0);
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	return 0;

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 3/6] mmu_notifier: invalidate_page callbacks
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
  2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_invalidate_page --]
[-- Type: text/plain, Size: 3026 bytes --]

Two callbacks to remove individual pages as done in rmap code

	invalidate_page()

Called from the inner loop of rmap walks to invalidate pages.

	age_page()

Called for the determination of the page referenced status.

If we do not care about page referenced status then an age_page callback
may be be omitted. PageLock and pte lock are held when either of the
functions is called.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-02-07 16:49:32.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-02-07 17:25:25.000000000 -0800
@@ -49,6 +49,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -287,7 +288,8 @@ static int page_referenced_one(struct pa
 	if (vma->vm_flags & VM_LOCKED) {
 		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+	} else if (ptep_clear_flush_young(vma, address, pte) |
+		   mmu_notifier_age_page(mm, address))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -455,6 +457,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -712,7 +715,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young(vma, address, pte) |
+				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -720,6 +724,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mmu_notifier(invalidate_page, mm, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -844,12 +849,14 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte) |
+		    mmu_notifier_age_page(mm, address))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_skeleton --]
[-- Type: text/plain, Size: 7362 bytes --]

This is example code for a simple device driver interface to unmap
pages that were externally mapped.

Locking is simple through a single lock that is used to protect the
device drivers data structures as well as a counter that tracks the
active invalidates on a single address space.

The invalidation of extern ptes must be possible with code that does
not require sleeping. The lock is taken for all driver operations on
the mmu that the driver manages. Locking could be made more sophisticated
but I think this is going to be okay for most uses.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/mmu_notifier/skeleton.c |  239 ++++++++++++++++++++++++++++++++++
 1 file changed, 239 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/skeleton.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/skeleton.c	2008-02-08 13:14:16.000000000 -0800
@@ -0,0 +1,239 @@
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+
+/*
+ * Skeleton for an mmu notifier without rmap callbacks and no need to slepp
+ * during invalidate_page().
+ *
+ * (C) 2008 Silicon Graphics, Inc.
+ * 		Christoph Lameter <clameter@sgi.com>
+ *
+ * Note that the locking is fairly basic. One can add various optimizations
+ * here and there. There is a single lock for an address space which should be
+ * satisfactory for most cases. If not then the lock can be split like the
+ * pte_lock in Linux. It is most likely best to place the locks in the
+ * page table structure or into whatever the external mmu uses to
+ * track the mappings.
+ */
+
+struct my_mmu {
+	/* MMU notifier specific fields */
+	struct mmu_notifier notifier;
+	spinlock_t lock;	/* Protects counter and invidual zaps */
+	int invalidates;	/* Number of active range_invalidates */
+};
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_insert_page(struct my_mmu *m,
+		unsigned long address, unsigned long pfn)
+{
+	/* Must be provided */
+	printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n",
+							m, address, pfn);
+}
+
+/*
+ * Called with m->lock held (optional but usually required to
+ * protect data structures of the driver).
+ */
+static void my_mmu_zap_page(struct my_mmu *m, unsigned long address)
+{
+	/* Must be provided */
+	printk(KERN_INFO "zap page %p address=%lx\n", m, address);
+}
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_zap_range(struct my_mmu *m,
+	unsigned long start, unsigned long end, int atomic)
+{
+	/* Must be provided */
+	printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n",
+						m, start, end, atomic);
+}
+
+/*
+ * Zap an individual page.
+ *
+ * The page must be locked and a refcount on the page must
+ * be held when this function is called. The page lock is also
+ * acquired when new references are established and the
+ * page lock effecively takes on the role of synchronization.
+ *
+ * The m->lock is only taken to preserve the integrity fo the
+ * drivers data structures since we may also race with
+ * invalidate_range() which will likely access the same mmu
+ * control structures.
+ * m->lock is therefore optional here.
+ */
+static void my_mmu_invalidate_page(struct mmu_notifier *mn,
+	struct mm_struct *mm, unsigned long address)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	spin_lock(&m->lock);
+	my_mmu_zap_page(m, address);
+	spin_unlock(&m->lock);
+}
+
+/*
+ * Increment and decrement of the number of range invalidates
+ */
+static inline void inc_active(struct my_mmu *m)
+{
+	spin_lock(&m->lock);
+	m->invalidates++;
+	spin_unlock(&m->lock);
+}
+
+static inline void dec_active(struct my_mmu *m)
+{
+	spin_lock(&m->lock);
+	m->invalidates--;
+	spin_unlock(&m->lock);
+}
+
+static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn,
+	struct mm_struct *mm, unsigned long start, unsigned long end,
+	int atomic)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	inc_active(m);	/* Holds off new references */
+	my_mmu_zap_range(m, start, end, atomic);
+}
+
+static void my_mmu_invalidate_range_end(struct mmu_notifier *mn,
+	struct mm_struct *mm, unsigned long start, unsigned long end,
+	int atomic)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	dec_active(m);		/* Enables new references */
+}
+
+/*
+ * Populate a page.
+ *
+ * A return value of-EAGAIN means please retry this operation.
+ *
+ * Acquisition of mmap_sem can be omitted if the caller already holds
+ * the semaphore.
+ */
+struct page *my_mmu_populate_page(struct my_mmu *m,
+	struct vm_area_struct *vma,
+	unsigned long address, int atomic, int write)
+{
+	struct page *page = ERR_PTR(-EAGAIN);
+	int err;
+
+	/* No need to do anything if a range invalidate is running */
+	if (m->invalidates)
+		goto out;
+
+	if (atomic) {
+
+		if (!down_read_trylock(&vma->vm_mm->mmap_sem))
+			goto out;
+
+		/* No concurrent invalidates */
+		page = follow_page(vma, address, FOLL_GET +
+					(write ? FOLL_WRITE : 0));
+
+		up_read(&vma->vm_mm->mmap_sem);
+		if (!page || IS_ERR(page) || TestSetPageLocked(page))
+			goto out;
+
+	} else {
+
+		down_read(&vma->vm_mm->mmap_sem);
+		err = get_user_pages(current, vma->vm_mm, address, 1,
+						write, 1, &page, NULL);
+
+		up_read(&vma->vm_mm->mmap_sem);
+		if (err < 0) {
+			page = ERR_PTR(err);
+			goto out;
+		}
+		lock_page(page);
+
+	}
+
+	/*
+	 * The page is now locked and we are holding a refcount on it.
+	 * So things are tied down. Now we can check the page status.
+	 */
+	if (page_mapped(page)) {
+		/*
+		 * Must take the m->lock here to hold off concurrent
+		 * invalidate_range_b/e. Serialization with invalidate_page()
+		 * occurs because we are holding the page lock.
+		 */
+		spin_lock(&m->lock);
+		if (!m->invalidates)
+			my_mmu_insert_page(m, address, page_to_pfn(page));
+		spin_unlock(&m->lock);
+	}
+	unlock_page(page);
+	put_page(page);
+out:
+	return page;
+}
+
+/*
+ * All other threads accessing this mm_struct must have terminated by now.
+ */
+static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	my_mmu_zap_range(m, 0, TASK_SIZE, 0);
+	kfree(m);
+	printk(KERN_INFO "MMU Notifier detaching\n");
+}
+
+static struct mmu_notifier_ops my_mmu_ops = {
+	my_mmu_release,
+	NULL,			/* No aging function */
+	my_mmu_invalidate_page,
+	my_mmu_invalidate_range_begin,
+	my_mmu_invalidate_range_end
+};
+
+/*
+ * This function must be called to activate callbacks from a process
+ */
+int my_mmu_attach_to_process(struct mm_struct *mm)
+{
+	struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL);
+
+	if (!m)
+		return -ENOMEM;
+
+	m->notifier.ops = &my_mmu_ops;
+	spin_lock_init(&mm->lock);
+
+	/*
+	 * mmap_sem handling can be omitted if it is guaranteed that
+	 * the context from which my_mmu_attach_to_process is called
+	 * is already holding a writelock on mmap_sem.
+	 */
+	down_write(&mm->mmap_sem);
+	mmu_notifier_register(&m->notifier, mm);
+	up_write(&mm->mmap_sem);
+
+	/*
+	 * RCU sync is expensive but necessary if we need to guarantee
+	 * that multiple threads running on other cpus have seen the
+	 * notifier changes.
+	 */
+	synchronize_rcu();
+	return 0;
+}
+

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_rmap_support --]
[-- Type: text/plain, Size: 8290 bytes --]

These special additional callbacks are required because XPmem (and likely
other mechanisms) do use their own rmap (multiple processes on a series
of remote Linux instances may be accessing the memory of a process).
F.e. XPmem may have to send out notifications to remote Linux instances
and receive confirmation before a page can be freed.

So we handle this like an additional Linux reverse map that is walked after
the existing rmaps have been walked. We leave the walking to the driver that
is then able to use something else than a spinlock to walk its reverse
maps. So we can actually call the driver without holding spinlocks while
we hold the Pagelock.

However, we cannot determine the mm_struct that a page belongs to at
that point. The mm_struct can only be determined from the rmaps by the
device driver.

We add another pageflag (PageExternalRmap) that is set if a page has
been remotely mapped (f.e. by a process from another Linux instance).
We can then only perform the callbacks for pages that are actually in
remote use.

Rmap notifiers need an extra page bit and are only available
on 64 bit platforms. This functionality is not available on 32 bit!

A notifier that uses the reverse maps callbacks does not need to provide
the invalidate_page() method that is called when locks are held.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mmu_notifier.h |   65 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   11 +++++++
 mm/mmu_notifier.c            |   34 ++++++++++++++++++++++
 mm/rmap.c                    |    9 +++++
 4 files changed, 119 insertions(+)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-02-08 12:35:14.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-02-08 12:44:33.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -296,6 +297,16 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define ClearPageExternalRmap(page) do {} while (0)
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-02-08 12:35:14.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-02-08 12:44:33.000000000 -0800
@@ -23,6 +23,18 @@
  * 	where sleeping is allowed or in atomic contexts. A flag is passed
  * 	to indicate an atomic context.
  *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms.
+ *
  *	Pages must be marked dirty if dirty bits are found to be set in
  *	the external ptes.
  */
@@ -89,6 +101,23 @@ struct mmu_notifier_ops {
 				 int atomic);
 };
 
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -139,6 +168,27 @@ static inline void mmu_notifier_head_ini
 		}							\
 	} while (0)
 
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+/* Must hold PageLock */
+extern void mmu_rmap_export_page(struct page *page);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, hlist)		\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 /*
@@ -157,6 +207,16 @@ static inline void mmu_notifier_head_ini
 		};							\
 	} while (0)
 
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
 static inline void mmu_notifier_register(struct mmu_notifier *mn,
 						struct mm_struct *mm) {}
 static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
@@ -170,6 +230,11 @@ static inline int mmu_notifier_age_page(
 
 static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
 
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-02-08 12:44:24.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-02-08 12:44:33.000000000 -0800
@@ -74,3 +74,37 @@ void mmu_notifier_unregister(struct mmu_
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
+#ifdef CONFIG_64BIT
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);
+
+#endif
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-02-08 12:44:30.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-02-08 12:44:33.000000000 -0800
@@ -497,6 +497,10 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+			if (unlikely(PageExternalRmap(page))) {
+				mmu_rmap_notifier(invalidate_page, page);
+				ClearPageExternalRmap(page);
+			}
 			if (page_test_dirty(page)) {
 				page_clear_dirty(page);
 				ret = 1;
@@ -1013,6 +1017,11 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (unlikely(PageExternalRmap(page))) {
+		mmu_rmap_notifier(invalidate_page, page);
+		ClearPageExternalRmap(page);
+	}
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  2008-02-08 22:23 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
  2008-02-13 14:31 ` Jack Steiner
  7 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_rmap_skeleton --]
[-- Type: text/plain, Size: 7832 bytes --]

The skeleton for the rmap notifier leaves the invalidate_page method of
the mmu_notifier empty and hooks a new invalidate_page callback into the
global chain for mmu_rmap_notifiers.

There are seveal simplifications in here to avoid making this too complex.
The reverse maps need to consit of references to vma f.e.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/mmu_notifier/skeleton_rmap.c |  265 +++++++++++++++++++++++++++++
 1 file changed, 265 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/skeleton_rmap.c	2008-02-08 13:25:28.000000000 -0800
@@ -0,0 +1,265 @@
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+
+/*
+ * Skeleton for an mmu notifier with rmap callbacks and sleeping during
+ * invalidate_page.
+ *
+ * (C) 2008 Silicon Graphics, Inc.
+ * 		Christoph Lameter <clameter@sgi.com>
+ *
+ * Note that the locking is fairly basic. One can add various optimizations
+ * here and there. There is a single lock for an address space which should be
+ * satisfactory for most cases. If not then the lock can be split like the
+ * pte_lock in Linux. It is most likely best to place the locks in the
+ * page table structure or into whatever the external mmu uses to
+ * track the mappings.
+ */
+
+struct my_mmu {
+	/* MMU notifier specific fields */
+	struct mmu_notifier notifier;
+	spinlock_t lock;	/* Protects counter and invidual zaps */
+	int invalidates;	/* Number of active range_invalidate */
+
+       /* Rmap support */
+       struct list_head list;	/* rmap list of my_mmu structs */
+       unsigned long base;
+};
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_insert_page(struct my_mmu *m,
+		unsigned long address, unsigned long pfn)
+{
+	/* Must be provided */
+	printk(KERN_INFO "insert page %p address=%lx pfn=%ld\n",
+							m, address, pfn);
+}
+
+/*
+ * Called with m->lock held
+ */
+static void my_mmu_zap_range(struct my_mmu *m,
+	unsigned long start, unsigned long end, int atomic)
+{
+	/* Must be provided */
+	printk(KERN_INFO "zap range %p address=%lx-%lx atomic=%d\n",
+						m, start, end, atomic);
+}
+
+/*
+ * Called with m->lock held (optional but usually required to
+ * protect data structures of the driver).
+ */
+static void my_mmu_zap_page(struct my_mmu *m, unsigned long address)
+{
+	/* Must be provided */
+	printk(KERN_INFO "zap page %p address=%lx\n", m, address);
+}
+
+/*
+ * Increment and decrement of the number of range invalidates
+ */
+static inline void inc_active(struct my_mmu *m)
+{
+	spin_lock(&m->lock);
+	m->invalidates++;
+	spin_unlock(&m->lock);
+}
+
+static inline void dec_active(struct my_mmu *m)
+{
+	spin_lock(&m->lock);
+	m->invalidates--;
+	spin_unlock(&m->lock);
+}
+
+static void my_mmu_invalidate_range_begin(struct mmu_notifier *mn,
+	struct mm_struct *mm, unsigned long start, unsigned long end,
+	int atomic)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	inc_active(m);	/* Holds off new references */
+	my_mmu_zap_range(m, start, end, atomic);
+}
+
+static void my_mmu_invalidate_range_end(struct mmu_notifier *mn,
+	struct mm_struct *mm, unsigned long start, unsigned long end,
+	int atomic)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	dec_active(m);	/* Enables new references */
+}
+
+/*
+ * Populate a page.
+ *
+ * A return value of-EAGAIN means please retry this operation.
+ *
+ * Acuisition of mmap_sem can be omitted if the caller already holds
+ * the semaphore.
+ */
+struct page *my_mmu_populate_page(struct my_mmu *m,
+	struct vm_area_struct *vma,
+	unsigned long address, int write)
+{
+	struct page *page = ERR_PTR(-EAGAIN);
+	int err;
+
+	/*
+	 * No need to do anything if a range invalidate is running
+	 * Could use a wait queue here to avoid returning -EAGAIN.
+	 */
+	if (m->invalidates)
+		goto out;
+
+	down_read(&vma->vm_mm->mmap_sem);
+	err = get_user_pages(current, vma->vm_mm, address, 1,
+						write, 1, &page, NULL);
+
+	up_read(&vma->vm_mm->mmap_sem);
+	if (err < 0) {
+		page = ERR_PTR(err);
+		goto out;
+	}
+	lock_page(page);
+
+	/*
+	 * The page is now locked and we are holding a refcount on it.
+	 * So things are tied down. Now we can check the page status.
+	 */
+	if (page_mapped(page)) {
+		/* Could do some preprocessing here. Can sleep */
+		spin_lock(&m->lock);
+		if (!m->invalidates)
+			my_mmu_insert_page(m, address, page_to_pfn(page));
+		spin_unlock(&m->lock);
+		/* Could do some postprocessing here. Can sleep */
+	}
+	unlock_page(page);
+	put_page(page);
+out:
+	return page;
+}
+
+/*
+ * All other threads accessing this mm_struct must have terminated by now.
+ */
+static void my_mmu_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct my_mmu *m = container_of(mn, struct my_mmu, notifier);
+
+	my_mmu_zap_range(m, 0, TASK_SIZE, 0);
+	/* No concurrent processes thus no worries about RCU */
+	list_del(&m->list);
+	kfree(m);
+	printk(KERN_INFO "MMU Notifier terminating\n");
+}
+
+static struct mmu_notifier_ops my_mmu_ops = {
+	my_mmu_release,
+	NULL,		/* No aging function */
+	NULL,		/* No atomic invalidate_page function */
+	my_mmu_invalidate_range_begin,
+	my_mmu_invalidate_range_end
+};
+
+/* Rmap specific fields */
+static LIST_HEAD(my_mmu_list);
+static struct rw_semaphore listlock;
+
+/*
+ * This function must be called to activate callbacks from a process
+ */
+int my_mmu_attach_to_process(struct mm_struct *mm)
+{
+	struct my_mmu *m = kzalloc(sizeof(struct my_mmu), GFP_KERNEL);
+
+	if (!m)
+		return -ENOMEM;
+
+	m->notifier.ops = &my_mmu_ops;
+	spin_lock_init(&m->lock);
+
+	/*
+	 * mmap_sem handling can be omitted if it is guaranteed that
+	 * the context from which my_mmu_attach_to_process is called
+	 * is already holding a writelock on mmap_sem.
+	 */
+	down_write(&mm->mmap_sem);
+	mmu_notifier_register(&m->notifier, mm);
+	up_write(&mm->mmap_sem);
+	down_write(&listlock);
+	list_add(&m->list, &my_mmu_list);
+	up_write(&listlock);
+
+	/*
+	 * RCU sync is expensive but necessary if we need to guarantee
+	 * that multiple threads running on other cpus have seen the
+	 * notifier changes.
+	 */
+	synchronize_rcu();
+	return 0;
+}
+
+
+static void my_sleeping_invalidate_page(struct my_mmu *m, unsigned long address)
+{
+	/* Must be provided */
+	spin_lock(&m->lock);	/* Only taken to ensure mmu data integrity */
+	my_mmu_zap_page(m, address);
+	spin_unlock(&m->lock);
+	printk(KERN_INFO "Sleeping invalidate_page %p address=%lx\n",
+                                                               m, address);
+}
+
+static unsigned long my_mmu_find_addr(struct my_mmu *m, struct page *page)
+{
+	/* Determine the address of a page in a mmu segment */
+	return -EFAULT;
+}
+
+/*
+ * A reference must be held on the page passed and the page passed
+ * must be locked. No spinlocks are held. invalidate_page() is held
+ * off by us holding the page lock.
+ */
+static void my_mmu_rmap_invalidate_page(struct mmu_rmap_notifier *mrn,
+							struct page *page)
+{
+	struct my_mmu *m;
+
+	BUG_ON(!PageLocked(page));
+	down_read(&listlock);
+	list_for_each_entry(m, &my_mmu_list, list) {
+		unsigned long address = my_mmu_find_addr(m, page);
+
+		if (address != -EFAULT)
+			my_sleeping_invalidate_page(m, address);
+	}
+	up_read(&listlock);
+}
+
+static struct mmu_rmap_notifier_ops my_mmu_rmap_ops = {
+	.invalidate_page = my_mmu_rmap_invalidate_page
+};
+
+static struct mmu_rmap_notifier my_mmu_rmap_notifier = {
+	.ops = &my_mmu_rmap_ops
+};
+
+static int __init my_mmu_init(void)
+{
+	mmu_rmap_notifier_register(&my_mmu_rmap_notifier);
+	return 0;
+}
+
+late_initcall(my_mmu_init);
+

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
@ 2008-02-08 22:23 ` Andrew Morton
  2008-02-08 23:32   ` Christoph Lameter
  2008-02-13 14:31 ` Jack Steiner
  7 siblings, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2008-02-08 22:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: andrea, holt, avi, izike, kvm-devel, a.p.zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, general

On Fri, 08 Feb 2008 14:06:16 -0800
Christoph Lameter <clameter@sgi.com> wrote:

> This is a patchset implementing MMU notifier callbacks based on Andrea's
> earlier work. These are needed if Linux pages are referenced from something
> else than tracked by the rmaps of the kernel (an external MMU). MMU
> notifiers allow us to get rid of the page pinning for RDMA and various
> other purposes. It gets rid of the broken use of mlock for page pinning.
> (mlock really does *not* pin pages....)
> 
> More information on the rationale and the technical details can be found in
> the first patch and the README provided by that patch in
> Documentation/mmu_notifiers.
> 
> The known immediate users are
> 
> KVM
> - Establishes a refcount to the page via get_user_pages().
> - External references are called spte.
> - Has page tables to track pages whose refcount was elevated but
>   no reverse maps.
> 
> GRU
> - Simple additional hardware TLB (possibly covering multiple instances of
>   Linux)
> - Needs TLB shootdown when the VM unmaps pages.
> - Determines page address via follow_page (from interrupt context) but can
>   fall back to get_user_pages().
> - No page reference possible since no page status is kept..
> 
> XPmem
> - Allows use of a processes memory by remote instances of Linux.
> - Provides its own reverse mappings to track remote pte.
> - Established refcounts on the exported pages.
> - Must sleep in order to wait for remote acks of ptes that are being
>   cleared.
> 

What about ib_umem_get()?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 22:23 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
@ 2008-02-08 23:32   ` Christoph Lameter
  2008-02-08 23:36     ` Robin Holt
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrea, holt, avi, izike, kvm-devel, a.p.zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008, Andrew Morton wrote:

> What about ib_umem_get()?

Ok. It pins using an elevated refcount. Same as XPmem right now. With that 
we effectively pin a page (page migration will fail) but we will 
continually be reclaiming the page and may repeatedly try to move it. We 
have issues with XPmem causing too many pages to be pinned and thus the 
OOM getting into weird behavior modes (OOM or stop lru scanning due to 
all_reclaimable set).

An elevated refcount will also not be noticed by any of the schemes under 
consideration to improve LRU scanning performance.

 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 23:32   ` Christoph Lameter
@ 2008-02-08 23:36     ` Robin Holt
  2008-02-08 23:41       ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Robin Holt @ 2008-02-08 23:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, andrea, holt, avi, izike, kvm-devel, a.p.zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, general

On Fri, Feb 08, 2008 at 03:32:19PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Andrew Morton wrote:
> 
> > What about ib_umem_get()?
> 
> Ok. It pins using an elevated refcount. Same as XPmem right now. With that 
> we effectively pin a page (page migration will fail) but we will 
> continually be reclaiming the page and may repeatedly try to move it. We 
> have issues with XPmem causing too many pages to be pinned and thus the 
> OOM getting into weird behavior modes (OOM or stop lru scanning due to 
> all_reclaimable set).
> 
> An elevated refcount will also not be noticed by any of the schemes under 
> consideration to improve LRU scanning performance.

Christoph, I am not sure what you are saying here.  With v4 and later,
I thought we were able to use the rmap invalidation to remove the ref
count that XPMEM was holding and therefore be able to swapout.  Did I miss
something?  I agree the existing XPMEM does pin.  I hope we are not saying
the XPMEM based upon these patches will not be able to swap/migrate.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 23:36     ` Robin Holt
@ 2008-02-08 23:41       ` Christoph Lameter
  2008-02-08 23:43         ` Robin Holt
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-08 23:41 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrew Morton, andrea, avi, izike, kvm-devel, a.p.zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008, Robin Holt wrote:

> > > What about ib_umem_get()?
> > 
> > Ok. It pins using an elevated refcount. Same as XPmem right now. With that 
> > we effectively pin a page (page migration will fail) but we will 
> > continually be reclaiming the page and may repeatedly try to move it. We 
> > have issues with XPmem causing too many pages to be pinned and thus the 
> > OOM getting into weird behavior modes (OOM or stop lru scanning due to 
> > all_reclaimable set).
> > 
> > An elevated refcount will also not be noticed by any of the schemes under 
> > consideration to improve LRU scanning performance.
> 
> Christoph, I am not sure what you are saying here.  With v4 and later,
> I thought we were able to use the rmap invalidation to remove the ref
> count that XPMEM was holding and therefore be able to swapout.  Did I miss
> something?  I agree the existing XPMEM does pin.  I hope we are not saying
> the XPMEM based upon these patches will not be able to swap/migrate.

Correct.

You missed the turn of the conversation to how ib_umem_get() works. 
Currently it seems to pin the same way that the SLES10 XPmem works.




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 23:41       ` Christoph Lameter
@ 2008-02-08 23:43         ` Robin Holt
  2008-02-08 23:56           ` Andrew Morton
  0 siblings, 1 reply; 119+ messages in thread
From: Robin Holt @ 2008-02-08 23:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Andrew Morton, andrea, avi, izike, kvm-devel,
	a.p.zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman,
	general

On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Robin Holt wrote:
> 
> > > > What about ib_umem_get()?
> 
> Correct.
> 
> You missed the turn of the conversation to how ib_umem_get() works. 
> Currently it seems to pin the same way that the SLES10 XPmem works.

Ah.  I took Andrew's question as more of a probe about whether we had
worked with the IB folks to ensure this fits the ib_umem_get needs
as well.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 23:43         ` Robin Holt
@ 2008-02-08 23:56           ` Andrew Morton
  2008-02-09  0:05             ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2008-02-08 23:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, andrea, avi, izike, kvm-devel, a.p.zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008 17:43:02 -0600 Robin Holt <holt@sgi.com> wrote:

> On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote:
> > On Fri, 8 Feb 2008, Robin Holt wrote:
> > 
> > > > > What about ib_umem_get()?
> > 
> > Correct.
> > 
> > You missed the turn of the conversation to how ib_umem_get() works. 
> > Currently it seems to pin the same way that the SLES10 XPmem works.
> 
> Ah.  I took Andrew's question as more of a probe about whether we had
> worked with the IB folks to ensure this fits the ib_umem_get needs
> as well.
> 

You took it correctly, and I didn't understand the answer ;)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 23:56           ` Andrew Morton
@ 2008-02-09  0:05             ` Christoph Lameter
  2008-02-09  0:12               ` [ofa-general] " Roland Dreier
  2008-02-09  0:12               ` [patch 0/6] MMU Notifiers V6 Andrew Morton
  0 siblings, 2 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  0:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robin Holt, andrea, avi, izike, kvm-devel, a.p.zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008, Andrew Morton wrote:

> You took it correctly, and I didn't understand the answer ;)

We have done several rounds of discussion on linux-kernel about this so 
far and the IB folks have not shown up to join in. I have tried to make 
this as general as possible.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:05             ` Christoph Lameter
@ 2008-02-09  0:12               ` Roland Dreier
  2008-02-09  0:16                 ` Christoph Lameter
  2008-02-09  0:12               ` [patch 0/6] MMU Notifiers V6 Andrew Morton
  1 sibling, 1 reply; 119+ messages in thread
From: Roland Dreier @ 2008-02-09  0:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, andrea, a.p.zijlstra, linux-mm, izike, steiner,
	linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt,
	general

 > We have done several rounds of discussion on linux-kernel about this so 
 > far and the IB folks have not shown up to join in. I have tried to make 
 > this as general as possible.

Sorry, this has been on my "things to look at" list for a while, but I
haven't gotten a chance to really understand where things are yet.

In general, this MMU notifier stuff will only be useful to a subset of
InfiniBand/RDMA hardware.  Some adapters are smart enough to handle
changing the IO virtual -> bus/physical mapping on the fly, but some
aren't.  For the dumb adapters, I think the current ib_umem_get() is
pretty close to as good as we can get: we have to keep the physical
pages pinned for as long as the adapter is allowed to DMA into the
memory region.

For the smart adapters, we just need a chance to change the adapter's
page table when the kernel/CPU's mapping changes, and naively, this
stuff looks like it would work.

Andrew, does that help?

- R.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:05             ` Christoph Lameter
  2008-02-09  0:12               ` [ofa-general] " Roland Dreier
@ 2008-02-09  0:12               ` Andrew Morton
  2008-02-09  0:18                 ` Christoph Lameter
  1 sibling, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2008-02-09  0:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, andrea, avi, izike, kvm-devel, a.p.zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008 16:05:00 -0800 (PST) Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 8 Feb 2008, Andrew Morton wrote:
> 
> > You took it correctly, and I didn't understand the answer ;)
> 
> We have done several rounds of discussion on linux-kernel about this so 
> far and the IB folks have not shown up to join in. I have tried to make 
> this as general as possible.

infiniband would appear to be the major present in-kernel client of this new
interface.  So as a part of proving its usefulness, correctness, etc we
should surely work on converting infiniband to use it, and prove its
goodness.

Quite possibly none of the infiniband developers even know about it..

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:12               ` [ofa-general] " Roland Dreier
@ 2008-02-09  0:16                 ` Christoph Lameter
  2008-02-09  0:22                   ` Roland Dreier
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  0:16 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Andrew Morton, andrea, a.p.zijlstra, linux-mm, izike, steiner,
	linux-kernel, avi, kvm-devel, daniel.blueman, Robin Holt,
	general

On Fri, 8 Feb 2008, Roland Dreier wrote:

> In general, this MMU notifier stuff will only be useful to a subset of
> InfiniBand/RDMA hardware.  Some adapters are smart enough to handle
> changing the IO virtual -> bus/physical mapping on the fly, but some
> aren't.  For the dumb adapters, I think the current ib_umem_get() is
> pretty close to as good as we can get: we have to keep the physical
> pages pinned for as long as the adapter is allowed to DMA into the
> memory region.

I thought the adaptor can always remove the mapping by renegotiating 
with the remote side? Even if its dumb then a callback could notify the 
driver that it may be required to tear down the mapping. We then hold the 
pages until we get okay by the driver that the mapping has been removed.

We could also let the unmapping fail if the driver indicates that the 
mapping must stay.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:12               ` [patch 0/6] MMU Notifiers V6 Andrew Morton
@ 2008-02-09  0:18                 ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  0:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robin Holt, andrea, avi, izike, kvm-devel, a.p.zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, general

On Fri, 8 Feb 2008, Andrew Morton wrote:

> Quite possibly none of the infiniband developers even know about it..

Well Andrea's initial approach was even featured on LWN a couple of 
weeks back.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:16                 ` Christoph Lameter
@ 2008-02-09  0:22                   ` Roland Dreier
  2008-02-09  0:36                     ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Roland Dreier @ 2008-02-09  0:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

 > I thought the adaptor can always remove the mapping by renegotiating 
 > with the remote side? Even if its dumb then a callback could notify the 
 > driver that it may be required to tear down the mapping. We then hold the 
 > pages until we get okay by the driver that the mapping has been removed.

Of course we can always destroy the memory region but that would break
the semantics that applications expect.  Basically an application can
register some chunk of its memory and get a key that it can pass to a
remote peer to let the remote peer operate on its memory via RDMA.
And that memory region/key is expected to stay valid until there is an
application-level operation to destroy it (or until the app crashes or
gets killed, etc).

 > We could also let the unmapping fail if the driver indicates that the 
 > mapping must stay.

That would of course work -- dumb adapters would just always fail,
which might be inefficient.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:22                   ` Roland Dreier
@ 2008-02-09  0:36                     ` Christoph Lameter
  2008-02-09  1:24                       ` Andrea Arcangeli
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  0:36 UTC (permalink / raw)
  To: Roland Dreier
  Cc: andrea, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel, Rik van Riel

On Fri, 8 Feb 2008, Roland Dreier wrote:

> That would of course work -- dumb adapters would just always fail,
> which might be inefficient.

Hmmmm.. that means we need something that actually pins pages for good so 
that the VM can avoid reclaiming it and so that page migration can avoid 
trying to migrate them. Something like yet another page flag.

Ccing Rik.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  0:36                     ` Christoph Lameter
@ 2008-02-09  1:24                       ` Andrea Arcangeli
  2008-02-09  1:27                         ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Andrea Arcangeli @ 2008-02-09  1:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Roland Dreier, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel, Rik van Riel

On Fri, Feb 08, 2008 at 04:36:16PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Roland Dreier wrote:
> 
> > That would of course work -- dumb adapters would just always fail,
> > which might be inefficient.
> 
> Hmmmm.. that means we need something that actually pins pages for good so 
> that the VM can avoid reclaiming it and so that page migration can avoid 
> trying to migrate them. Something like yet another page flag.

What's wrong with pinning with the page count like now? Dumb adapters
would simply not register themself in the mmu notifier list no?

> 
> Ccing Rik.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  1:24                       ` Andrea Arcangeli
@ 2008-02-09  1:27                         ` Christoph Lameter
  2008-02-09  1:56                           ` Andrea Arcangeli
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  1:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Roland Dreier, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel, Rik van Riel

On Sat, 9 Feb 2008, Andrea Arcangeli wrote:

> > Hmmmm.. that means we need something that actually pins pages for good so 
> > that the VM can avoid reclaiming it and so that page migration can avoid 
> > trying to migrate them. Something like yet another page flag.
> 
> What's wrong with pinning with the page count like now? Dumb adapters
> would simply not register themself in the mmu notifier list no?

Pages will still be on the LRU and cycle through rmap again and again. 
If page migration is used on those pages then the code may make repeated 
attempt to migrate the page thinking that the page count must at some 
point drop.

I do not think that the page count was intended to be used to pin pages 
permanently. If we had a marker on such pages then we could take them off 
the LRU and not try to migrate them.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  1:27                         ` Christoph Lameter
@ 2008-02-09  1:56                           ` Andrea Arcangeli
  2008-02-09  2:16                             ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Andrea Arcangeli @ 2008-02-09  1:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Roland Dreier, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel, Rik van Riel

On Fri, Feb 08, 2008 at 05:27:03PM -0800, Christoph Lameter wrote:
> Pages will still be on the LRU and cycle through rmap again and again. 
> If page migration is used on those pages then the code may make repeated 
> attempt to migrate the page thinking that the page count must at some 
> point drop.
>
> I do not think that the page count was intended to be used to pin pages 
> permanently. If we had a marker on such pages then we could take them off 
> the LRU and not try to migrate them.

The VM shouldn't break if try_to_unmap doesn't actually make the page
freeable for whatever reason. Permanent pins shouldn't happen anyway,
so defining an ad-hoc API for that doesn't sound too appealing. Not
sure if old hardware deserves those special lru-size-reduction
optimizations but it's not my call (certainly swapoff/mlock would get
higher priority in that lru-size-reduction area).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  1:56                           ` Andrea Arcangeli
@ 2008-02-09  2:16                             ` Christoph Lameter
  2008-02-09 12:55                               ` Rik van Riel
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09  2:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Roland Dreier, a.p.zijlstra, izike, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel, Rik van Riel

On Sat, 9 Feb 2008, Andrea Arcangeli wrote:

> The VM shouldn't break if try_to_unmap doesn't actually make the page
> freeable for whatever reason. Permanent pins shouldn't happen anyway,

VM is livelocking if too many page are pinned that way right now. The 
higher the processors per node the higher the risk of livelock because 
more processors are in the process of cycling through pages that have an 
elevated refcount.

> so defining an ad-hoc API for that doesn't sound too appealing. Not
> sure if old hardware deserves those special lru-size-reduction
> optimizations but it's not my call (certainly swapoff/mlock would get
> higher priority in that lru-size-reduction area).

Rik has a patchset under development that addresses issues like this. The 
elevated refcount pin problem is not really relevant to the patchset we 
are discussing here.
 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09  2:16                             ` Christoph Lameter
@ 2008-02-09 12:55                               ` Rik van Riel
  2008-02-09 21:46                                 ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Rik van Riel @ 2008-02-09 12:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Roland Dreier, a.p.zijlstra, izike, steiner,
	linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general,
	Andrew Morton, kvm-devel

On Fri, 8 Feb 2008 18:16:16 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> On Sat, 9 Feb 2008, Andrea Arcangeli wrote:
> 
> > The VM shouldn't break if try_to_unmap doesn't actually make the page
> > freeable for whatever reason. Permanent pins shouldn't happen anyway,
> 
> VM is livelocking if too many page are pinned that way right now.

> Rik has a patchset under development that addresses issues like this

PG_mlock is on the way and can easily be reused for this, too.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6
  2008-02-09 12:55                               ` Rik van Riel
@ 2008-02-09 21:46                                 ` Christoph Lameter
  2008-02-11 22:40                                   ` Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-09 21:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Roland Dreier, a.p.zijlstra, izike, steiner,
	linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general,
	Andrew Morton, kvm-devel

On Sat, 9 Feb 2008, Rik van Riel wrote:

> PG_mlock is on the way and can easily be reused for this, too.

Note that a pinned page is different from an mlocked page. A mlocked page 
can be moved through page migration and/or memory hotplug. A pinned page 
must make both fail.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Demand paging for memory regions (was Re: MMU Notifiers V6)
  2008-02-09 21:46                                 ` Christoph Lameter
@ 2008-02-11 22:40                                   ` Roland Dreier
  2008-02-12 22:01                                     ` Steve Wise
  0 siblings, 1 reply; 119+ messages in thread
From: Roland Dreier @ 2008-02-11 22:40 UTC (permalink / raw)
  To: general
  Cc: Christoph Lameter, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, steiner, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

[Adding general@lists.openfabrics.org to get the IB/RDMA people involved]

This thread has patches that add support for notifying drivers when a
process's memory map changes.  The hope is that this is useful for
letting RDMA devices handle registered memory without pinning the
underlying pages, by updating the RDMA device's translation tables
whenever the host kernel's tables change.

Is anyone interested in working on using this for drivers/infiniband?
I am interested in participating, but I don't think I have enough time
to do this by myself.

Also, at least naively it seems that this is only useful for hardware
that has support for this type of demand paging, and can handle
not-present pages, generating interrupts for page faults, etc.  I know
that Mellanox HCAs should have this support; are there any other
devices that can do this?

The beginning of this thread is at <http://lkml.org/lkml/2008/2/8/458>.

 - R.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Demand paging for memory regions (was Re: MMU Notifiers V6)
  2008-02-11 22:40                                   ` Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier
@ 2008-02-12 22:01                                     ` Steve Wise
  2008-02-12 22:10                                       ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Steve Wise @ 2008-02-12 22:01 UTC (permalink / raw)
  To: Roland Dreier
  Cc: general, Christoph Lameter, Rik van Riel, Andrea Arcangeli,
	a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, Andrew Morton, kvm-devel

Roland Dreier wrote:
> [Adding general@lists.openfabrics.org to get the IB/RDMA people involved]
> 
> This thread has patches that add support for notifying drivers when a
> process's memory map changes.  The hope is that this is useful for
> letting RDMA devices handle registered memory without pinning the
> underlying pages, by updating the RDMA device's translation tables
> whenever the host kernel's tables change.
> 
> Is anyone interested in working on using this for drivers/infiniband?
> I am interested in participating, but I don't think I have enough time
> to do this by myself.

I don't have time, although it would be interesting work!

> 
> Also, at least naively it seems that this is only useful for hardware
> that has support for this type of demand paging, and can handle
> not-present pages, generating interrupts for page faults, etc.  I know
> that Mellanox HCAs should have this support; are there any other
> devices that can do this?
>

Chelsio's T3 HW doesn't support this.


Steve.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Demand paging for memory regions (was Re: MMU Notifiers V6)
  2008-02-12 22:01                                     ` Steve Wise
@ 2008-02-12 22:10                                       ` Christoph Lameter
  2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-12 22:10 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, general, Rik van Riel, Andrea Arcangeli,
	a.p.zijlstra, izike, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Steve Wise wrote:

> Chelsio's T3 HW doesn't support this.

Not so far I guess but it could be equipped with these features right? 

Having the VM manage the memory area for Infiniband allows more reliable 
system operations and enables the sharing of large memory areas via 
Infiniband without the risk of livelocks or OOMs.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 22:10                                       ` Christoph Lameter
@ 2008-02-12 22:41                                         ` Roland Dreier
  2008-02-12 23:14                                           ` Felix Marti
                                                             ` (3 more replies)
  0 siblings, 4 replies; 119+ messages in thread
From: Roland Dreier @ 2008-02-12 22:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steve Wise, Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike,
	steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt,
	general, Andrew Morton, kvm-devel

 > > Chelsio's T3 HW doesn't support this.

 > Not so far I guess but it could be equipped with these features right? 

I don't know anything about the T3 internals, but it's not clear that
you could do this without a new chip design in general.  Lot's of RDMA
devices were designed expecting that when a packet arrives, the HW can
look up the bus address for a given memory region/offset and place the
packet immediately.  It seems like a major change to be able to
generate a "page fault" interrupt when a page isn't present, or even
just wait to scatter some data until the host finishes updating page
tables when the HW needs the translation.

 - R.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
@ 2008-02-12 23:14                                           ` Felix Marti
  2008-02-13  0:57                                             ` Christoph Lameter
  2008-02-14 15:09                                             ` Steve Wise
  2008-02-12 23:23                                           ` Jason Gunthorpe
                                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 119+ messages in thread
From: Felix Marti @ 2008-02-12 23:14 UTC (permalink / raw)
  To: Roland Dreier, Christoph Lameter
  Cc: Rik van Riel, steiner, Andrea Arcangeli, a.p.zijlstra, izike,
	linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general,
	Andrew Morton, kvm-devel



> -----Original Message-----
> From: general-bounces@lists.openfabrics.org [mailto:general-
> bounces@lists.openfabrics.org] On Behalf Of Roland Dreier
> Sent: Tuesday, February 12, 2008 2:42 PM
> To: Christoph Lameter
> Cc: Rik van Riel; steiner@sgi.com; Andrea Arcangeli;
> a.p.zijlstra@chello.nl; izike@qumranet.com; linux-
> kernel@vger.kernel.org; avi@qumranet.com; linux-mm@kvack.org;
> daniel.blueman@quadrics.com; Robin Holt;
general@lists.openfabrics.org;
> Andrew Morton; kvm-devel@lists.sourceforge.net
> Subject: Re: [ofa-general] Re: Demand paging for memory regions
> 
>  > > Chelsio's T3 HW doesn't support this.
> 
>  > Not so far I guess but it could be equipped with these features
> right?
> 
> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general.  Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place the
> packet immediately.  It seems like a major change to be able to
> generate a "page fault" interrupt when a page isn't present, or even
> just wait to scatter some data until the host finishes updating page
> tables when the HW needs the translation.

That is correct, not a change we can make for T3. We could, in theory,
deal with changing mappings though. The change would need to be
synchronized though: the VM would need to tell us which mapping were
about to change and the driver would then need to disable DMA to/from
it, do the change and resume DMA.

> 
>  - R.
> 
> _______________________________________________
> general mailing list
> general@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
  2008-02-12 23:14                                           ` Felix Marti
@ 2008-02-12 23:23                                           ` Jason Gunthorpe
  2008-02-13  1:01                                             ` Christoph Lameter
  2008-02-13  0:56                                           ` Christoph Lameter
  2008-02-13 12:11                                           ` Christoph Raisch
  3 siblings, 1 reply; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-12 23:23 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Christoph Lameter, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, Feb 12, 2008 at 02:41:48PM -0800, Roland Dreier wrote:
>  > > Chelsio's T3 HW doesn't support this.
> 
>  > Not so far I guess but it could be equipped with these features right? 
> 
> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general.  Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place
> the

Well, certainly today the memfree IB devices store the page tables in
host memory so they are already designed to hang onto packets during
the page lookup over PCIE, adding in faulting makes this time
larger.

But this is not a good thing at all, IB's congestion model is based on
the notion that end ports can always accept packets without making
input contigent on output. If you take a software interrupt to fill in
the page pointer then you could potentially deadlock on the
fabric. For example using this mechanism to allow swap-in of RDMA target
pages and then putting the storage over IB would be deadlock
prone. Even without deadlock slowing down the input path will cause
network congestion and poor performance for other nodes. It is not a
desirable thing to do..

I expect that iwarp running over flow controlled ethernet has similar
kinds of problems for similar reasons..

In general the best I think you can hope for with RDMA hardware is
page migration using some atomic operations with the adaptor and a cpu
page copy with retry sort of scheme - but is pure page migration
interesting at all?

Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
  2008-02-12 23:14                                           ` Felix Marti
  2008-02-12 23:23                                           ` Jason Gunthorpe
@ 2008-02-13  0:56                                           ` Christoph Lameter
  2008-02-13 12:11                                           ` Christoph Raisch
  3 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13  0:56 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Steve Wise, Rik van Riel, Andrea Arcangeli, a.p.zijlstra, izike,
	steiner, linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt,
	general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Roland Dreier wrote:

> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general.  Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place the
> packet immediately.  It seems like a major change to be able to
> generate a "page fault" interrupt when a page isn't present, or even
> just wait to scatter some data until the host finishes updating page
> tables when the HW needs the translation.

Well if the VM wants to invalidate a page then the remote end first has to 
remove its mapping. 

If a page has been removed then the remote end would encounter a fault and 
then would have to wait for the local end to reestablish its mapping 
before proceeding.

So the packet would only be generated when both ends are in sync.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 23:14                                           ` Felix Marti
@ 2008-02-13  0:57                                             ` Christoph Lameter
  2008-02-14 15:09                                             ` Steve Wise
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13  0:57 UTC (permalink / raw)
  To: Felix Marti
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Felix Marti wrote:

> > I don't know anything about the T3 internals, but it's not clear that
> > you could do this without a new chip design in general.  Lot's of RDMA
> > devices were designed expecting that when a packet arrives, the HW can
> > look up the bus address for a given memory region/offset and place the
> > packet immediately.  It seems like a major change to be able to
> > generate a "page fault" interrupt when a page isn't present, or even
> > just wait to scatter some data until the host finishes updating page
> > tables when the HW needs the translation.
> 
> That is correct, not a change we can make for T3. We could, in theory,
> deal with changing mappings though. The change would need to be
> synchronized though: the VM would need to tell us which mapping were
> about to change and the driver would then need to disable DMA to/from
> it, do the change and resume DMA.

Right. That is the intend of the patchset.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 23:23                                           ` Jason Gunthorpe
@ 2008-02-13  1:01                                             ` Christoph Lameter
  2008-02-13  1:26                                               ` Jason Gunthorpe
  2008-02-13  1:55                                               ` Christian Bell
  0 siblings, 2 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13  1:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> Well, certainly today the memfree IB devices store the page tables in
> host memory so they are already designed to hang onto packets during
> the page lookup over PCIE, adding in faulting makes this time
> larger.

You really do not need a page table to use it. What needs to be maintained 
is knowledge on both side about what pages are currently shared across 
RDMA. If the VM decides to reclaim a page then the notification is used to 
remove the remote entry. If the remote side then tries to access the page 
again then the page fault on the remote side will stall until the local 
page has been brought back. RDMA can proceed after both sides again agree 
on that page now being sharable.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  1:01                                             ` Christoph Lameter
@ 2008-02-13  1:26                                               ` Jason Gunthorpe
  2008-02-13  1:45                                                 ` Steve Wise
  2008-02-13  2:35                                                 ` Christoph Lameter
  2008-02-13  1:55                                               ` Christian Bell
  1 sibling, 2 replies; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-13  1:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > Well, certainly today the memfree IB devices store the page tables in
> > host memory so they are already designed to hang onto packets during
> > the page lookup over PCIE, adding in faulting makes this time
> > larger.
> 
> You really do not need a page table to use it. What needs to be maintained 
> is knowledge on both side about what pages are currently shared across 
> RDMA. If the VM decides to reclaim a page then the notification is used to 
> remove the remote entry. If the remote side then tries to access the page 
> again then the page fault on the remote side will stall until the local 
> page has been brought back. RDMA can proceed after both sides again agree 
> on that page now being sharable.

The problem is that the existing wire protocols do not have a
provision for doing an 'are you ready' or 'I am not ready' exchange
and they are not designed to store page tables on both sides as you
propose. The remote side can send RDMA WRITE traffic at any time after
the RDMA region is established. The local side must be able to handle
it. There is no way to signal that a page is not ready and the remote
should not send.

This means the only possible implementation is to stall/discard at the
local adaptor when a RDMA WRITE is recieved for a page that has been
reclaimed. This is what leads to deadlock/poor performance..

Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  1:26                                               ` Jason Gunthorpe
@ 2008-02-13  1:45                                                 ` Steve Wise
  2008-02-13  2:35                                                 ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Steve Wise @ 2008-02-13  1:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Lameter, Roland Dreier, Rik van Riel, steiner,
	Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

Jason Gunthorpe wrote:
> On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote:
>> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>>
>>> Well, certainly today the memfree IB devices store the page tables in
>>> host memory so they are already designed to hang onto packets during
>>> the page lookup over PCIE, adding in faulting makes this time
>>> larger.
>> You really do not need a page table to use it. What needs to be maintained 
>> is knowledge on both side about what pages are currently shared across 
>> RDMA. If the VM decides to reclaim a page then the notification is used to 
>> remove the remote entry. If the remote side then tries to access the page 
>> again then the page fault on the remote side will stall until the local 
>> page has been brought back. RDMA can proceed after both sides again agree 
>> on that page now being sharable.
> 
> The problem is that the existing wire protocols do not have a
> provision for doing an 'are you ready' or 'I am not ready' exchange
> and they are not designed to store page tables on both sides as you
> propose. The remote side can send RDMA WRITE traffic at any time after
> the RDMA region is established. The local side must be able to handle
> it. There is no way to signal that a page is not ready and the remote
> should not send.
> 
> This means the only possible implementation is to stall/discard at the
> local adaptor when a RDMA WRITE is recieved for a page that has been
> reclaimed. This is what leads to deadlock/poor performance..
>

If the events are few and far between then this model is probably ok. 
For iWARP, it means TCP retransmit and slow start and all that, but if 
its an infrequent event, then its ok if it helps the host better manage 
memory.

Maybe... ;-)


Steve.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  1:01                                             ` Christoph Lameter
  2008-02-13  1:26                                               ` Jason Gunthorpe
@ 2008-02-13  1:55                                               ` Christian Bell
  2008-02-13  2:19                                                 ` Christoph Lameter
  1 sibling, 1 reply; 119+ messages in thread
From: Christian Bell @ 2008-02-13  1:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Christoph Lameter wrote:

> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > Well, certainly today the memfree IB devices store the page tables in
> > host memory so they are already designed to hang onto packets during
> > the page lookup over PCIE, adding in faulting makes this time
> > larger.
> 
> You really do not need a page table to use it. What needs to be maintained 
> is knowledge on both side about what pages are currently shared across 
> RDMA. If the VM decides to reclaim a page then the notification is used to 
> remove the remote entry. If the remote side then tries to access the page 
> again then the page fault on the remote side will stall until the local 
> page has been brought back. RDMA can proceed after both sides again agree 
> on that page now being sharable.

HPC environments won't be amenable to a pessimistic approach of
synchronizing before every data transfer.  RDMA is assumed to be a
low-level data movement mechanism that has no implied
synchronization.  In some parallel programming models, it's not
uncommon to use RDMA to send 8-byte messages.  It can be difficult to
make and hold guarantees about in-memory pages when many concurrent
RDMA operations are in flight (not uncommon in reasonably large
machines).  Some of the in-memory page information could be shared
with some form of remote caching strategy but then it's a different
problem with its own scalability challenges.

I think there are very potential clients of the interface when an
optimistic approach is used.  Part of the trick, however, has to do
with being able to re-start transfers instead of buffering the data
or making guarantees about delivery that could cause deadlock (as was
alluded to earlier in this thread).  InfiniBand is constrained in
this regard since it requires message-ordering between endpoints (or
queue pairs).  One could argue that this is still possible with IB,
at the cost of throwing more packets away when a referenced page is
not in memory.  With this approach, the worse case demand paging
scenario is met when the active working set of referenced pages is
larger than the amount physical memory -- but HPC applications are
already bound by this anyway.

You'll find that Quadrics has the most experience in this area and
that their entire architecture is adapted to being optimistic about
demand paging in RDMA transfers -- they've been maintaining a patchset
to do this for years.

    . . christian


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  1:55                                               ` Christian Bell
@ 2008-02-13  2:19                                                 ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13  2:19 UTC (permalink / raw)
  To: Christian Bell
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Christian Bell wrote:

> I think there are very potential clients of the interface when an
> optimistic approach is used.  Part of the trick, however, has to do
> with being able to re-start transfers instead of buffering the data
> or making guarantees about delivery that could cause deadlock (as was
> alluded to earlier in this thread).  InfiniBand is constrained in
> this regard since it requires message-ordering between endpoints (or
> queue pairs).  One could argue that this is still possible with IB,
> at the cost of throwing more packets away when a referenced page is
> not in memory.  With this approach, the worse case demand paging
> scenario is met when the active working set of referenced pages is
> larger than the amount physical memory -- but HPC applications are
> already bound by this anyway.
> 
> You'll find that Quadrics has the most experience in this area and
> that their entire architecture is adapted to being optimistic about
> demand paging in RDMA transfers -- they've been maintaining a patchset
> to do this for years.

The notifier patchset that we are discussing here was mostly inspired by 
their work. 

There is no need to restart transfers that you have never started in the 
first place. The remote side would never start a transfer if the page 
reference has been torn down. In order to start the transfer a fault 
handler on the remote side would have to setup the association between the 
memory on both ends again.




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  1:26                                               ` Jason Gunthorpe
  2008-02-13  1:45                                                 ` Steve Wise
@ 2008-02-13  2:35                                                 ` Christoph Lameter
  2008-02-13  3:25                                                   ` Jason Gunthorpe
  2008-02-13  4:09                                                   ` Christian Bell
  1 sibling, 2 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13  2:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> The problem is that the existing wire protocols do not have a
> provision for doing an 'are you ready' or 'I am not ready' exchange
> and they are not designed to store page tables on both sides as you
> propose. The remote side can send RDMA WRITE traffic at any time after
> the RDMA region is established. The local side must be able to handle
> it. There is no way to signal that a page is not ready and the remote
> should not send.
> 
> This means the only possible implementation is to stall/discard at the
> local adaptor when a RDMA WRITE is recieved for a page that has been
> reclaimed. This is what leads to deadlock/poor performance..

You would only use the wire protocols *after* having established the RDMA 
region. The notifier chains allows a RDMA region (or parts thereof) to be 
down on demand by the VM. The region can be reestablished if one of 
the side accesses it. I hope I got that right. Not much exposure to 
Infiniband so far.

Lets say you have a two systems A and B. Each has their memory region MemA 
and MemB. Each side also has page tables for this region PtA and PtB.

Now you establish a RDMA connection between both side. The pages in both
MemB and MemA are present and so are entries in PtA and PtB. RDMA 
traffic can proceed.

The VM on system A now gets into a situation in which memory becomes 
heavily used by another (maybe non RDMA process) and after checking that 
there was no recent reference to MemA and MemB (via a notifier aging 
callback) decides to reclaim the memory from MemA.

In that case it will notify the RDMA subsystem on A that it is trying to
reclaim a certain page.

The RDMA subsystem on A will then send a message to B notifying it that 
the memory will be going away. B now has to remove its corresponding page 
from memory (and drop the entry in PtB) and confirm to A that this has 
happened. RDMA traffic is then stopped for this page. Then A can also 
remove its page, the corresponding entry in PtA and the page is reclaimed 
or pushed out to swap completing the page reclaim.

If either side then accesses the page again then the reverse process 
happens. If B accesses the page then it wil first of all incur a page 
fault because the entry in PtB is missing. The fault will then cause a 
message to be send to A to establish the page again. A will create an 
entry in PtA and will then confirm to B that the page was established. At 
that point RDMA operations can occur again.

So the whole scheme does not really need a hardware page table in the RDMA 
hardware. The page tables of the two systems A and B are sufficient.

The scheme can also be applied to a larger range than only a single page. 
The RDMA subsystem could tear down a large section when reclaim is 
pushing on it and then reestablish it as needed.

Swapping and page reclaim is certainly not something that improves the 
speed of the application affected by swapping and page reclaim but it 
allows the VM to manage memory effectively if multiple loads are runing on 
a system.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  2:35                                                 ` Christoph Lameter
@ 2008-02-13  3:25                                                   ` Jason Gunthorpe
  2008-02-13  3:56                                                     ` Patrick Geoffray
  2008-02-13 18:51                                                     ` Christoph Lameter
  2008-02-13  4:09                                                   ` Christian Bell
  1 sibling, 2 replies; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-13  3:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > The problem is that the existing wire protocols do not have a
> > provision for doing an 'are you ready' or 'I am not ready' exchange
> > and they are not designed to store page tables on both sides as you
> > propose. The remote side can send RDMA WRITE traffic at any time after
> > the RDMA region is established. The local side must be able to handle
> > it. There is no way to signal that a page is not ready and the remote
> > should not send.
> > 
> > This means the only possible implementation is to stall/discard at the
> > local adaptor when a RDMA WRITE is recieved for a page that has been
> > reclaimed. This is what leads to deadlock/poor performance..
> 
> You would only use the wire protocols *after* having established the RDMA 
> region. The notifier chains allows a RDMA region (or parts thereof) to be 
> down on demand by the VM. The region can be reestablished if one of 
> the side accesses it. I hope I got that right. Not much exposure to 
> Infiniband so far.

[clip explaination]

But this isn't how IB or iwarp work at all. What you describe is a
significant change to the general RDMA operation and requires changes to
both sides of the connection and the wire protocol.

A few comments on RDMA operation that might clarify things a little
bit more:
 - In RDMA (iwarp and IB versions) the hardware page tables exist to
   linearize the local memory so the remote does not need to be aware
   of non-linearities in the physical address space. The main
   motivation for this is kernel bypass where the user space app wants
   to instruct the remote side to DMA into memory using user space
   addresses. Hardware provides the page tables to switch from
   incoming user space virtual addresses to physical addresess.

   This greatly simplifies the user space programming model since you
   don't need to pass around or create s/g lists for memory that is
   already virtually continuous.

   Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
   for access control and enforcing the liftime of the mapping.

   The page tables in the RDMA hardware exist primarily to support
   this, and not for other reasons. The pinning of pages is one part
   to support the HW page tables and one part to support the RDMA
   lifetime rules, the liftime rules are what cause problems for
   the VM.
 - The wire protocol consists of packets that say 'Write XXX bytes to
   offset YY in Region RRR'. Creating a region produces the RRR label
   and currently pins the pages. So long as the RRR label is valid the
   remote side can issue write packets at any time without any
   further synchronization. There is no wire level events associated
   with creating RRR. You can pass RRR to the other machine in any
   fashion, even using carrier pigeons :)
 - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
   are built on top of it and they specify the lifetime rules and
   protocol for exchanging RRR.

   Every protocol is different. In kernel protocols like SRP and NFS
   RDMA seem to have very short lifetimes for RRR and work more like
   pci_map_* in real SCSI hardware.
 - HPC userspace apps, like MPI apps, have different lifetime rules
   and tend to be really long lived. These people will not want
   anything that makes their OPs more expensive and also probably
   don't care too much about the VM problems you are looking at (?)
 - There is no protocol support to exchange RRR. This is all done
   by upper level protocols (ala HTTP vs TCP). You cannot assert
   and revoke RRR in a general way. Every protocol is different
   and optimized.

   This is your step 'A will then send a message to B notifying..'.
   It simply does not exist in the protocol specifications

I don't know much about Quadrics, but I would be hesitant to lump it
in too much with these RDMA semantics. Christian's comments sound like
they operate closer to what you described and that is why the have an
existing patch set. I don't know :)

What it boils down to is that to implement true removal of pages in a
general way the kernel and HCA must either drop packets or stall
incoming packets, both are big performance problems - and I can't see
many users wanting this. Enterprise style people using SCSI, NFS, etc
already have short pin periods and HPC MPI users probably won't care
about the VM issues enough to warrent the performance overhead.

Regards,
Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  3:25                                                   ` Jason Gunthorpe
@ 2008-02-13  3:56                                                     ` Patrick Geoffray
  2008-02-13  4:26                                                       ` Jason Gunthorpe
  2008-02-13 18:51                                                     ` Christoph Lameter
  1 sibling, 1 reply; 119+ messages in thread
From: Patrick Geoffray @ 2008-02-13  3:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Lameter, Roland Dreier, Rik van Riel, steiner,
	Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi,
	linux-mm, daniel.blueman

Jason,

Jason Gunthorpe wrote:
> I don't know much about Quadrics, but I would be hesitant to lump it
> in too much with these RDMA semantics. Christian's comments sound like
> they operate closer to what you described and that is why the have an
> existing patch set. I don't know :)

The Quadrics folks have been doing RDMA for 10 years, there is a reason 
why they maintained a patch.

> What it boils down to is that to implement true removal of pages in a
> general way the kernel and HCA must either drop packets or stall
> incoming packets, both are big performance problems - and I can't see
> many users wanting this. Enterprise style people using SCSI, NFS, etc
> already have short pin periods and HPC MPI users probably won't care
> about the VM issues enough to warrent the performance overhead.

This is not true, HPC people do care about the VM issues a lot. Memory 
registration (pinning and translating) is usually too expensive to be 
performed in the critical path before and after each send or receive. So 
they factor it out by registering a buffer the first time it is used, 
and keeping it registered in a registration cache. However, the 
application may free() a buffer that is in the registration cache, so 
HPC people provide their own malloc to catch free(). They also try to 
catch sbrk() and munmap() to deregister memory before it is released to 
the OS. This is a Major pain that a VM notifier would easily solve. 
Being able to swap registered pages to disk or migrate them in a NUMA 
system is a welcome bonus.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  2:35                                                 ` Christoph Lameter
  2008-02-13  3:25                                                   ` Jason Gunthorpe
@ 2008-02-13  4:09                                                   ` Christian Bell
  2008-02-13 19:00                                                     ` Christoph Lameter
  2008-02-13 23:23                                                     ` Pete Wyckoff
  1 sibling, 2 replies; 119+ messages in thread
From: Christian Bell @ 2008-02-13  4:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Christoph Lameter wrote:

> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > The problem is that the existing wire protocols do not have a
> > provision for doing an 'are you ready' or 'I am not ready' exchange
> > and they are not designed to store page tables on both sides as you
> > propose. The remote side can send RDMA WRITE traffic at any time after
> > the RDMA region is established. The local side must be able to handle
> > it. There is no way to signal that a page is not ready and the remote
> > should not send.
> > 
> > This means the only possible implementation is to stall/discard at the
> > local adaptor when a RDMA WRITE is recieved for a page that has been
> > reclaimed. This is what leads to deadlock/poor performance..

You're arguing that a HW page table is not needed by describing a use
case that is essentially what all RDMA solutions already do above the
wire protocols (all solutions except Quadrics, of course).

> You would only use the wire protocols *after* having established the RDMA 
> region. The notifier chains allows a RDMA region (or parts thereof) to be 
> down on demand by the VM. The region can be reestablished if one of 
> the side accesses it. I hope I got that right. Not much exposure to 
> Infiniband so far.

RDMA is already always used *after* memory regions are set up --
they are set up out-of-band w.r.t RDMA but essentially this is the
"before" part.

> Lets say you have a two systems A and B. Each has their memory region MemA 
> and MemB. Each side also has page tables for this region PtA and PtB.
> 
> Now you establish a RDMA connection between both side. The pages in both
> MemB and MemA are present and so are entries in PtA and PtB. RDMA 
> traffic can proceed.
> 
> The VM on system A now gets into a situation in which memory becomes 
> heavily used by another (maybe non RDMA process) and after checking that 
> there was no recent reference to MemA and MemB (via a notifier aging 
> callback) decides to reclaim the memory from MemA.
> 
> In that case it will notify the RDMA subsystem on A that it is trying to
> reclaim a certain page.
> 
> The RDMA subsystem on A will then send a message to B notifying it that 
> the memory will be going away. B now has to remove its corresponding page 
> from memory (and drop the entry in PtB) and confirm to A that this has 
> happened. RDMA traffic is then stopped for this page. Then A can also 
> remove its page, the corresponding entry in PtA and the page is reclaimed 
> or pushed out to swap completing the page reclaim.
> 
> If either side then accesses the page again then the reverse process 
> happens. If B accesses the page then it wil first of all incur a page 
> fault because the entry in PtB is missing. The fault will then cause a 
> message to be send to A to establish the page again. A will create an 
> entry in PtA and will then confirm to B that the page was established. At 
> that point RDMA operations can occur again.

The notifier-reclaim cycle you describe is akin to the out-of-band
pin-unpin control messages used by existing communication libraries.
Also, I think what you are proposing can have problems at scale -- A
must keep track of all of the (potentially many systems) of memA and
cooperatively get an agreement from all these systems before reclaiming
the page.

When messages are sufficiently large, the control messaging necessary
to setup/teardown the regions is relatively small.  This is not
always the case however -- in programming models that employ smaller
messages, the one-sided nature of RDMA is the most attractive part of
it.  

> So the whole scheme does not really need a hardware page table in the RDMA 
> hardware. The page tables of the two systems A and B are sufficient.
> 
> The scheme can also be applied to a larger range than only a single page. 
> The RDMA subsystem could tear down a large section when reclaim is 
> pushing on it and then reestablish it as needed.

Nothing any communication/runtime system can't already do today.  The
point of RDMA demand paging is enabling the possibility of using RDMA
without the implied synchronization -- the optimistic part.  Using
the notifiers to duplicate existing memory region handling for RDMA
hardware that doesn't have HW page tables is possible but undermines
the more important consumer of your patches in my opinion.

One other area that has not been brought up yet (I think) is the
applicability of notifiers in letting users know when pinned memory
is reclaimed by the kernel.  This is useful when a lower-level
library employs lazy deregistration strategies on memory regions that
are subsequently released to the kernel via the application's use of
munmap or sbrk.  Ohio Supercomputing Center has work in this area but
a generalized approach in the kernel would certainly be welcome.


    . . christian

-- 
christian.bell@qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  3:56                                                     ` Patrick Geoffray
@ 2008-02-13  4:26                                                       ` Jason Gunthorpe
  2008-02-13  4:47                                                         ` Patrick Geoffray
  0 siblings, 1 reply; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-13  4:26 UTC (permalink / raw)
  To: Patrick Geoffray
  Cc: Christoph Lameter, Roland Dreier, linux-kernel, linux-mm, general

[mangled CC list trimmed]
On Tue, Feb 12, 2008 at 10:56:26PM -0500, Patrick Geoffray wrote:

> Jason Gunthorpe wrote:
>> I don't know much about Quadrics, but I would be hesitant to lump it
>> in too much with these RDMA semantics. Christian's comments sound like
>> they operate closer to what you described and that is why the have an
>> existing patch set. I don't know :)
>
> The Quadrics folks have been doing RDMA for 10 years, there is a reason why 
> they maintained a patch.

This wasn't ment as a slight against Quadrics, only to point out that
the specific wire protcols used by IB and iwarp are what cause this
limitation, it would be easy to imagine that Quadrics has some
additional twist that can make this easier..

>> What it boils down to is that to implement true removal of pages in a
>> general way the kernel and HCA must either drop packets or stall
>> incoming packets, both are big performance problems - and I can't see
>> many users wanting this. Enterprise style people using SCSI, NFS, etc
>> already have short pin periods and HPC MPI users probably won't care
>> about the VM issues enough to warrent the performance overhead.
>
> This is not true, HPC people do care about the VM issues a lot. Memory
> registration (pinning and translating) is usually too expensive to

I ment that HPC users are unlikely to want to swap active RDMA pages
if this causes a performance cost on normal operations. None of my
comments are ment to imply that lazy de-registration or page migration
are not good things.

Regards,
Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  4:26                                                       ` Jason Gunthorpe
@ 2008-02-13  4:47                                                         ` Patrick Geoffray
  0 siblings, 0 replies; 119+ messages in thread
From: Patrick Geoffray @ 2008-02-13  4:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Lameter, Roland Dreier, linux-kernel, linux-mm, general

Jason Gunthorpe wrote:
> [mangled CC list trimmed]
Thanks, noticed that afterwards.

> This wasn't ment as a slight against Quadrics, only to point out that
> the specific wire protcols used by IB and iwarp are what cause this
> limitation, it would be easy to imagine that Quadrics has some
> additional twist that can make this easier..

The wire protocols are similar, nothing fancy. The specificity of 
Quadrics (and many others) is that they can change the behavior of the 
NIC in firmware, so they adapt to what the OS offers. They had the VM 
notifier support in Tru64 back in the days, they just ported the 
functionality to Linux.

> I ment that HPC users are unlikely to want to swap active RDMA pages
> if this causes a performance cost on normal operations. None of my

Swapping to disk is not a normal operations in HPC, it's going to be 
slow anyway. The main problem for HPC users is not swapping, it's that 
they do not know when a registered page is released to the OS through 
free(), sbrk() or munmap(). Like swapping, they don't expect that it 
will happen often, but they have to handle it gracefully.

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
                                                             ` (2 preceding siblings ...)
  2008-02-13  0:56                                           ` Christoph Lameter
@ 2008-02-13 12:11                                           ` Christoph Raisch
  2008-02-13 19:02                                             ` Christoph Lameter
  3 siblings, 1 reply; 119+ messages in thread
From: Christoph Raisch @ 2008-02-13 12:11 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Andrew Morton, Andrea Arcangeli, avi, a.p.zijlstra,
	Christoph Lameter, daniel.blueman, general, general-bounces,
	Robin Holt, izike, kvm-devel, linux-kernel, linux-mm,
	Rik van Riel, steiner


>  > > Chelsio's T3 HW doesn't support this.


For ehca we currently can't modify a large MR when it has been allocated.
EHCA Hardware expects the pages to be there (MRs must not have "holes").
This is also true for the global MR covering all kernel space.
Therefore we still need the memory to be "pinned" if ib_umem_get() is
called.

So with the current implementation we don't have much use for a notifier.


"It is difficult to make predictions, especially about the future"
Gruss / Regards
Christoph Raisch + Hoang-Nam Nguyen




^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 0/6] MMU Notifiers V6
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-02-08 22:23 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
@ 2008-02-13 14:31 ` Jack Steiner
  7 siblings, 0 replies; 119+ messages in thread
From: Jack Steiner @ 2008-02-13 14:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, linux-kernel, linux-mm,
	daniel.blueman

> GRU
> - Simple additional hardware TLB (possibly covering multiple instances of
>   Linux)
> - Needs TLB shootdown when the VM unmaps pages.
> - Determines page address via follow_page (from interrupt context) but can
>   fall back to get_user_pages().
> - No page reference possible since no page status is kept..

I applied the latest mmuops patch to a 2.6.24 kernel & updated the
GRU driver to use it. As far as I can tell, everything works ok.
Although more testing is needed, all current tests of driver functionality
are working on both a system simulator and a hardware simulator.

The driver itself is still a few weeks from being ready to post but I can
send code fragments of the portions related to mmuops or external TLB
management if anyone is interested.


--- jack

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  3:25                                                   ` Jason Gunthorpe
  2008-02-13  3:56                                                     ` Patrick Geoffray
@ 2008-02-13 18:51                                                     ` Christoph Lameter
  2008-02-13 19:51                                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 18:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> But this isn't how IB or iwarp work at all. What you describe is a
> significant change to the general RDMA operation and requires changes to
> both sides of the connection and the wire protocol.

Yes it may require a separate connection between both sides where a 
kind of VM notification protocol is established to tear these things down and 
set them up again. That is if there is nothing in the RDMA protocol that
allows a notification to the other side that the mapping is being down 
down.

>  - In RDMA (iwarp and IB versions) the hardware page tables exist to
>    linearize the local memory so the remote does not need to be aware
>    of non-linearities in the physical address space. The main
>    motivation for this is kernel bypass where the user space app wants
>    to instruct the remote side to DMA into memory using user space
>    addresses. Hardware provides the page tables to switch from
>    incoming user space virtual addresses to physical addresess.

s/switch/translate I guess. That is good and those page tables could be 
used for the notification scheme to enable reclaim. But they are optional 
and are maintaining the driver state. The linearization could be 
reconstructed from the kernel page tables on demand.

>    Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
>    for access control and enforcing the liftime of the mapping.

Well the mapping would have to be on demand to avoid the issues that we 
currently have with pinning. The user API could stay the same. If the 
driver tracks the mappings using the notifier then the VM can make sure 
that the right things happen on exit etc etc.

>    The page tables in the RDMA hardware exist primarily to support
>    this, and not for other reasons. The pinning of pages is one part
>    to support the HW page tables and one part to support the RDMA
>    lifetime rules, the liftime rules are what cause problems for
>    the VM.

So the driver software can tear down and establish page tables 
entries at will? I do not see the problem. The RDMA hardware is one thing, 
the way things are visible to the user another. If the driver can 
establish and remove mappings as needed via RDMA then the user can have 
the illusion of persistent RDMA memory. This is the same as virtual memory 
providing the illusion of a process having lots of memory all for itself.


>  - The wire protocol consists of packets that say 'Write XXX bytes to
>    offset YY in Region RRR'. Creating a region produces the RRR label
>    and currently pins the pages. So long as the RRR label is valid the
>    remote side can issue write packets at any time without any
>    further synchronization. There is no wire level events associated
>    with creating RRR. You can pass RRR to the other machine in any
>    fashion, even using carrier pigeons :)
>  - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
>    are built on top of it and they specify the lifetime rules and
>    protocol for exchanging RRR.

Well yes of course. What is proposed here is an additional notification 
mechanism (could even be via tcp/udp to simplify things) that would manage 
the mappings at a higher level. The writes would not occur if the mapping 
has not been established.
 
>    This is your step 'A will then send a message to B notifying..'.
>    It simply does not exist in the protocol specifications

Of course. You need to create an additional communication layer to get 
that.

> What it boils down to is that to implement true removal of pages in a
> general way the kernel and HCA must either drop packets or stall
> incoming packets, both are big performance problems - and I can't see
> many users wanting this. Enterprise style people using SCSI, NFS, etc
> already have short pin periods and HPC MPI users probably won't care
> about the VM issues enough to warrent the performance overhead.

True maybe you cannot do this by simply staying within the protocol bounds 
of RDMA that is based on page pinning if the RDMA protocol does not 
support a notification to the other side that the mapping is going away. 

If RDMA cannot do this then you would need additional ways of notifying 
the remote side that pages/mappings are invalidated.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  4:09                                                   ` Christian Bell
@ 2008-02-13 19:00                                                     ` Christoph Lameter
  2008-02-13 19:46                                                       ` Christian Bell
  2008-02-13 23:23                                                     ` Pete Wyckoff
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 19:00 UTC (permalink / raw)
  To: Christian Bell
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Tue, 12 Feb 2008, Christian Bell wrote:

> You're arguing that a HW page table is not needed by describing a use
> case that is essentially what all RDMA solutions already do above the
> wire protocols (all solutions except Quadrics, of course).

The HW page table is not essential to the notification scheme. That the 
RDMA uses the page table for linearization is another issue. A chip could 
just have a TLB cache and lookup the entries using the OS page table f.e.

> > Lets say you have a two systems A and B. Each has their memory region MemA 
> > and MemB. Each side also has page tables for this region PtA and PtB.
> > If either side then accesses the page again then the reverse process 
> > happens. If B accesses the page then it wil first of all incur a page 
> > fault because the entry in PtB is missing. The fault will then cause a 
> > message to be send to A to establish the page again. A will create an 
> > entry in PtA and will then confirm to B that the page was established. At 
> > that point RDMA operations can occur again.
> 
> The notifier-reclaim cycle you describe is akin to the out-of-band
> pin-unpin control messages used by existing communication libraries.
> Also, I think what you are proposing can have problems at scale -- A
> must keep track of all of the (potentially many systems) of memA and
> cooperatively get an agreement from all these systems before reclaiming
> the page.

Right. We (SGI) have done something like this for a long time with XPmem 
and it scales ok.

> When messages are sufficiently large, the control messaging necessary
> to setup/teardown the regions is relatively small.  This is not
> always the case however -- in programming models that employ smaller
> messages, the one-sided nature of RDMA is the most attractive part of
> it.  

The messaging would only be needed if a process comes under memory 
pressure. As long as there is enough memory nothing like this will occur.

> Nothing any communication/runtime system can't already do today.  The
> point of RDMA demand paging is enabling the possibility of using RDMA
> without the implied synchronization -- the optimistic part.  Using
> the notifiers to duplicate existing memory region handling for RDMA
> hardware that doesn't have HW page tables is possible but undermines
> the more important consumer of your patches in my opinion.

The notifier schemet should integrate into existing memory region 
handling and not cause a duplication. If you already have library layers 
that do this then it should be possible to integrate it.

> One other area that has not been brought up yet (I think) is the
> applicability of notifiers in letting users know when pinned memory
> is reclaimed by the kernel.  This is useful when a lower-level
> library employs lazy deregistration strategies on memory regions that
> are subsequently released to the kernel via the application's use of
> munmap or sbrk.  Ohio Supercomputing Center has work in this area but
> a generalized approach in the kernel would certainly be welcome.

The driver gets the notifications about memory being reclaimed. The driver 
could then notify user code about the release as well.

Pinned memory current *cannot* be reclaimed by the kernel. The refcount is 
elevated. This means that the VM tries to remove the mappings and then 
sees that it was not able to remove all references. Then it gives up and 
tries again and again and again.... Thus the potential for livelock.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 12:11                                           ` Christoph Raisch
@ 2008-02-13 19:02                                             ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 19:02 UTC (permalink / raw)
  To: Christoph Raisch
  Cc: Roland Dreier, Andrew Morton, Andrea Arcangeli, avi,
	a.p.zijlstra, daniel.blueman, general, general-bounces,
	Robin Holt, izike, kvm-devel, linux-kernel, linux-mm,
	Rik van Riel, steiner

On Wed, 13 Feb 2008, Christoph Raisch wrote:

> For ehca we currently can't modify a large MR when it has been allocated.
> EHCA Hardware expects the pages to be there (MRs must not have "holes").
> This is also true for the global MR covering all kernel space.
> Therefore we still need the memory to be "pinned" if ib_umem_get() is
> called.

It cannot be freed and then reallocated? What happens when a process 
exists?


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 19:00                                                     ` Christoph Lameter
@ 2008-02-13 19:46                                                       ` Christian Bell
  2008-02-13 20:32                                                         ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Christian Bell @ 2008-02-13 19:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Wed, 13 Feb 2008, Christoph Lameter wrote:

> Right. We (SGI) have done something like this for a long time with XPmem 
> and it scales ok.

I'd dispute this based on experience developing PGAS language support
on the Altix but more importantly (and less subjectively), I think
that "scales ok" refers to a very specific case.  Sure, pages (and/or
regions) can be large on some systems and the number of systems may
not always be in the thousands but you're still claiming scalability
for a mechanism that essentially logs who accesses the regions.  Then
there's the fact that reclaim becomes a collective communication
operation over all region accessors.  Makes me nervous.

> > When messages are sufficiently large, the control messaging necessary
> > to setup/teardown the regions is relatively small.  This is not
> > always the case however -- in programming models that employ smaller
> > messages, the one-sided nature of RDMA is the most attractive part of
> > it.  
> 
> The messaging would only be needed if a process comes under memory 
> pressure. As long as there is enough memory nothing like this will occur.
> 
> > Nothing any communication/runtime system can't already do today.  The
> > point of RDMA demand paging is enabling the possibility of using RDMA
> > without the implied synchronization -- the optimistic part.  Using
> > the notifiers to duplicate existing memory region handling for RDMA
> > hardware that doesn't have HW page tables is possible but undermines
> > the more important consumer of your patches in my opinion.
> 

> The notifier schemet should integrate into existing memory region 
> handling and not cause a duplication. If you already have library layers 
> that do this then it should be possible to integrate it.

I appreciate that you're trying to make a general case for the
applicability of notifiers on all types of existing RDMA hardware and
wire protocols.  Also, I'm not disagreeing whether a HW page table
is required or not: clearly it's not required to make *some* use of
the notifier scheme.

However, short of providing user-level notifications for pinned pages
that are inadvertently released to the O/S, I don't believe that the
patchset provides any significant added value for the HPC community
that can't optimistically do RDMA demand paging.


    . . christian


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 18:51                                                     ` Christoph Lameter
@ 2008-02-13 19:51                                                       ` Jason Gunthorpe
  2008-02-13 20:36                                                         ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-13 19:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> 
> > But this isn't how IB or iwarp work at all. What you describe is a
> > significant change to the general RDMA operation and requires changes to
> > both sides of the connection and the wire protocol.
> 
> Yes it may require a separate connection between both sides where a 
> kind of VM notification protocol is established to tear these things down and 
> set them up again. That is if there is nothing in the RDMA protocol that
> allows a notification to the other side that the mapping is being down 
> down.

Well, yes, you could build this thing you are describing on top of the
RDMA protocol and get some support from some of the hardware - but it
is a new set of protocols and they would need to be implemented in
several places. It is not transparent to userspace and it is not
compatible with existing implementations.

Unfortunately it really has little to do with the drivers - changes,
for instance, need to be made to support this in the user space MPI
libraries. The RDMA ops do not pass through the kernel, userspace
talks directly to the hardware which complicates building any sort of
abstraction.

That is where I think you run into trouble, if you ask the MPI people
to add code to their critical path to support swapping they probably
will not be too interested. At a minimum to support your idea you need
to check on every RDMA if the remote page is mapped... Plus the
overheads Christian was talking about in the OOB channel(s).

Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 19:46                                                       ` Christian Bell
@ 2008-02-13 20:32                                                         ` Christoph Lameter
  2008-02-13 22:44                                                           ` Kanoj Sarcar
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 20:32 UTC (permalink / raw)
  To: Christian Bell
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Wed, 13 Feb 2008, Christian Bell wrote:

> not always be in the thousands but you're still claiming scalability
> for a mechanism that essentially logs who accesses the regions.  Then
> there's the fact that reclaim becomes a collective communication
> operation over all region accessors.  Makes me nervous.

Well reclaim is not a very fast process (and we usually try to avoid it 
as much as possible for our HPC). Essentially its only there to allow 
shifts of processing loads and to allow efficient caching of application 
data.

> However, short of providing user-level notifications for pinned pages
> that are inadvertently released to the O/S, I don't believe that the
> patchset provides any significant added value for the HPC community
> that can't optimistically do RDMA demand paging.

We currently also run XPmem with pinning. Its great as long as you just 
run one load on the system. No reclaim ever iccurs.

However, if you do things that require lots of allocations etc etc then 
the page pinning can easily lead to livelock if reclaim is finally 
triggerd and also strange OOM situations since the VM cannot free any 
pages. So the main issue that is addressed here is reliability of pinned 
page operations. Better VM integration avoids these issues because we can 
unpin on request to deal with memory shortages.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 19:51                                                       ` Jason Gunthorpe
@ 2008-02-13 20:36                                                         ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Roland Dreier, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, linux-kernel, avi, linux-mm, daniel.blueman,
	Robin Holt, general, Andrew Morton, kvm-devel

On Wed, 13 Feb 2008, Jason Gunthorpe wrote:

> Unfortunately it really has little to do with the drivers - changes,
> for instance, need to be made to support this in the user space MPI
> libraries. The RDMA ops do not pass through the kernel, userspace
> talks directly to the hardware which complicates building any sort of
> abstraction.

Ok so the notifiers have to be handed over to the user space library that 
has the function of the device driver here...

> That is where I think you run into trouble, if you ask the MPI people
> to add code to their critical path to support swapping they probably
> will not be too interested. At a minimum to support your idea you need
> to check on every RDMA if the remote page is mapped... Plus the
> overheads Christian was talking about in the OOB channel(s).

You only need to check if a handle has been receiving invalidates. If not 
then you can just go ahead as now. You can use the notifier to take down 
the whole region if any reclaim occur against it (probably best and 
simples to implement approach). Then you mark the handle so that the 
mapping is reestablished before the next operation.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 20:32                                                         ` Christoph Lameter
@ 2008-02-13 22:44                                                           ` Kanoj Sarcar
  2008-02-13 23:02                                                             ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Kanoj Sarcar @ 2008-02-13 22:44 UTC (permalink / raw)
  To: Christoph Lameter, Christian Bell
  Cc: Jason Gunthorpe, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel


--- Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 13 Feb 2008, Christian Bell wrote:
> 
> > not always be in the thousands but you're still
> claiming scalability
> > for a mechanism that essentially logs who accesses
> the regions.  Then
> > there's the fact that reclaim becomes a collective
> communication
> > operation over all region accessors.  Makes me
> nervous.
> 
> Well reclaim is not a very fast process (and we
> usually try to avoid it 
> as much as possible for our HPC). Essentially its
> only there to allow 
> shifts of processing loads and to allow efficient
> caching of application 
> data.
> 
> > However, short of providing user-level
> notifications for pinned pages
> > that are inadvertently released to the O/S, I
> don't believe that the
> > patchset provides any significant added value for
> the HPC community
> > that can't optimistically do RDMA demand paging.
> 
> We currently also run XPmem with pinning. Its great
> as long as you just 
> run one load on the system. No reclaim ever iccurs.
> 
> However, if you do things that require lots of
> allocations etc etc then 
> the page pinning can easily lead to livelock if
> reclaim is finally 
> triggerd and also strange OOM situations since the
> VM cannot free any 
> pages. So the main issue that is addressed here is
> reliability of pinned 
> page operations. Better VM integration avoids these
> issues because we can 
> unpin on request to deal with memory shortages.
> 
> 

I have a question on the basic need for the mmu
notifier stuff wrt rdma hardware and pinning memory.

It seems that the need is to solve potential memory
shortage and overcommit issues by being able to
reclaim pages pinned by rdma driver/hardware. Is my
understanding correct?

If I do understand correctly, then why is rdma page
pinning any different than eg mlock pinning? I imagine
Oracle pins lots of memory (using mlock), how come
they do not run into vm overcommit issues?

Are we up against some kind of breaking c-o-w issue
here that is different between mlock and rdma pinning?

Asked another way, why should effort be spent on a
notifier scheme, and rather not on fixing any memory
accounting problems and unifying how pin pages are
accounted for that get pinned via mlock() or rdma
drivers?

Startup benefits are well understood with the notifier
scheme (ie, not all pages need to be faulted in at
memory region creation time), specially when most of
the memory region is not accessed at all. I would
imagine most of HPC does not work this way though.
Then again, as rdma hardware is applied
(increasingly?) towards apps with short lived
connections, the notifier scheme will help with
startup times.

Kanoj



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 22:44                                                           ` Kanoj Sarcar
@ 2008-02-13 23:02                                                             ` Christoph Lameter
  2008-02-13 23:43                                                               ` Kanoj Sarcar
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-13 23:02 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Christian Bell, Jason Gunthorpe, Rik van Riel, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

On Wed, 13 Feb 2008, Kanoj Sarcar wrote:

> It seems that the need is to solve potential memory
> shortage and overcommit issues by being able to
> reclaim pages pinned by rdma driver/hardware. Is my
> understanding correct?

Correct.

> If I do understand correctly, then why is rdma page
> pinning any different than eg mlock pinning? I imagine
> Oracle pins lots of memory (using mlock), how come
> they do not run into vm overcommit issues?

Mlocked pages are not pinned. They are movable by f.e. page migration and 
will be potentially be moved by future memory defrag approaches. Currently 
we have the same issues with mlocked pages as with pinned pages. There is 
work in progress to put mlocked pages onto a different lru so that reclaim 
exempts these pages and more work on limiting the percentage of memory 
that can be mlocked.

> Are we up against some kind of breaking c-o-w issue
> here that is different between mlock and rdma pinning?

Not that I know.

> Asked another way, why should effort be spent on a
> notifier scheme, and rather not on fixing any memory
> accounting problems and unifying how pin pages are
> accounted for that get pinned via mlock() or rdma
> drivers?

There are efforts underway to account for and limit mlocked pages as 
described above. Page pinning the way it is done by Infiniband through
increasing the page refcount is treated by the VM as a temporary 
condition not as a permanent pin. The VM will continually try to reclaim 
these pages thinking that the temporary usage of the page must cease 
soon. This is why the use of large amounts of pinned pages can lead to 
livelock situations.

If we want to have pinning behavior then we could mark pinned pages 
specially so that the VM will not continually try to evict these pages. We 
could manage them similar to mlocked pages but just not allow page 
migration, memory unplug and defrag to occur on pinned memory. All of 
theses would have to fail. With the notifier scheme the device driver 
could be told to get rid of the pinned memory. This would make these 3 
techniques work despite having an RDMA memory section.

> Startup benefits are well understood with the notifier
> scheme (ie, not all pages need to be faulted in at
> memory region creation time), specially when most of
> the memory region is not accessed at all. I would
> imagine most of HPC does not work this way though.

No for optimal performance  you would want to prefault all pages like 
it is now. The notifier scheme would only become relevant in memory 
shortage situations.

> Then again, as rdma hardware is applied (increasingly?) towards apps 
> with short lived connections, the notifier scheme will help with startup 
> times.

The main use of the notifier scheme is for stability and reliability. The 
"pinned" pages become unpinnable on request by the VM. So the VM can work 
itself out of memory shortage situations in cooperation with the 
RDMA logic instead of simply failing.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13  4:09                                                   ` Christian Bell
  2008-02-13 19:00                                                     ` Christoph Lameter
@ 2008-02-13 23:23                                                     ` Pete Wyckoff
  2008-02-14  0:01                                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 119+ messages in thread
From: Pete Wyckoff @ 2008-02-13 23:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Bell, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

christian.bell@qlogic.com wrote on Tue, 12 Feb 2008 20:09 -0800:
> One other area that has not been brought up yet (I think) is the
> applicability of notifiers in letting users know when pinned memory
> is reclaimed by the kernel.  This is useful when a lower-level
> library employs lazy deregistration strategies on memory regions that
> are subsequently released to the kernel via the application's use of
> munmap or sbrk.  Ohio Supercomputing Center has work in this area but
> a generalized approach in the kernel would certainly be welcome.

The whole need for memory registration is a giant pain.  There is no
motivating application need for it---it is simply a hack around
virtual memory and the lack of full VM support in current hardware.
There are real hardware issues that interact poorly with virtual
memory, as discussed previously in this thread.

The way a messaging cycle goes in IB is:

    register buf
    post send from buf
    wait for completion
    deregister buf

This tends to get hidden via userspace software libraries into
a single call:

    MPI_send(buf)

Now if you actually do the reg/dereg every time, things are very
slow.  So userspace library writers came up with the idea of caching
registrations:

    if buf is not registered:
	register buf
    post send from buf
    wait for completion

The second time that the app happens to do a send from the same
buffer, it proceeds much faster.  Spatial locality applies here, and
this caching is generally worth it.  Some libraries have schemes to
limit the size of the registration cache too.

But there are plenty of ways to hurt yourself with such a scheme.
The first being a huge pool of unused but registered memory, as the
library doesn't know the app patterns, and it doesn't know the VM
pressure level in the kernel.

There are plenty of subtle ways that this breaks too.  If the
registered buf is removed from the address space via munmap() or
sbrk() or other ways, the mapping and registration are gone, but the
library has no way of knowing that the app just did this.  Sure the
physical page is still there and pinned, but the app cannot get at
it.  Later if new address space arrives at the same virtual address
but a different physical page, the library will mistakenly think it
already has it registered properly, and data is transferred from
this old now-unmapped physical page.

The whole situation is rather ridiculuous, but we are quite stuck
with it for current generation IB and iWarp hardware.  If we can't
have the kernel interact with the device directly, we could at least
manage state in these multiple userspace registration caches.  The
VM could ask for certain (or any) pages to be released, and the
library would respond if they are indeed not in use by the device.
The app itself does not know about pinned regions, and the library
is aware of exactly which regions are potentially in use.

Since the great majority of userspace messaging over IB goes through
middleware like MPI or PGAS languages, and they all have the same
approach to registration caching, this approach could fix the
problem for a big segment of use cases.

More text on the registration caching problem is here:

    http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

with an approach using vm_ops open and close operations in a kernel
module here:

    http://www.osc.edu/~pw/dreg/

There is a place for VM notifiers in RDMA messaging, but not in
talking to devices, at least not the current set.  If you can define
a reasonable userspace interface for VM notifiers, libraries can
manage registration caches more efficiently, letting the kernel
unmap pinned pages as it likes.

		-- Pete


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 23:02                                                             ` Christoph Lameter
@ 2008-02-13 23:43                                                               ` Kanoj Sarcar
  2008-02-13 23:48                                                                 ` Jesse Barnes
                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Kanoj Sarcar @ 2008-02-13 23:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christian Bell, Jason Gunthorpe, Rik van Riel, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel


--- Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 13 Feb 2008, Kanoj Sarcar wrote:
> 
> > It seems that the need is to solve potential
> memory
> > shortage and overcommit issues by being able to
> > reclaim pages pinned by rdma driver/hardware. Is
> my
> > understanding correct?
> 
> Correct.
> 
> > If I do understand correctly, then why is rdma
> page
> > pinning any different than eg mlock pinning? I
> imagine
> > Oracle pins lots of memory (using mlock), how come
> > they do not run into vm overcommit issues?
> 
> Mlocked pages are not pinned. They are movable by
> f.e. page migration and 
> will be potentially be moved by future memory defrag
> approaches. Currently 
> we have the same issues with mlocked pages as with
> pinned pages. There is 
> work in progress to put mlocked pages onto a
> different lru so that reclaim 
> exempts these pages and more work on limiting the
> percentage of memory 
> that can be mlocked.
> 
> > Are we up against some kind of breaking c-o-w
> issue
> > here that is different between mlock and rdma
> pinning?
> 
> Not that I know.
> 
> > Asked another way, why should effort be spent on a
> > notifier scheme, and rather not on fixing any
> memory
> > accounting problems and unifying how pin pages are
> > accounted for that get pinned via mlock() or rdma
> > drivers?
> 
> There are efforts underway to account for and limit
> mlocked pages as 
> described above. Page pinning the way it is done by
> Infiniband through
> increasing the page refcount is treated by the VM as
> a temporary 
> condition not as a permanent pin. The VM will
> continually try to reclaim 
> these pages thinking that the temporary usage of the
> page must cease 
> soon. This is why the use of large amounts of pinned
> pages can lead to 
> livelock situations.

Oh ok, yes, I did see the discussion on this; sorry I
missed it. I do see what notifiers bring to the table
now (without endorsing it :-)).

An orthogonal question is this: is IB/rdma the only
"culprit" that elevates page refcounts? Are there no
other subsystems which do a similar thing?

The example I am thinking about is rawio (Oracle's
mlock'ed SHM regions are handed to rawio, isn't it?).
My understanding of how rawio works in Linux is quite
dated though ...

Kanoj

> 
> If we want to have pinning behavior then we could
> mark pinned pages 
> specially so that the VM will not continually try to
> evict these pages. We 
> could manage them similar to mlocked pages but just
> not allow page 
> migration, memory unplug and defrag to occur on
> pinned memory. All of 
> theses would have to fail. With the notifier scheme
> the device driver 
> could be told to get rid of the pinned memory. This
> would make these 3 
> techniques work despite having an RDMA memory
> section.
> 
> > Startup benefits are well understood with the
> notifier
> > scheme (ie, not all pages need to be faulted in at
> > memory region creation time), specially when most
> of
> > the memory region is not accessed at all. I would
> > imagine most of HPC does not work this way though.
> 
> No for optimal performance  you would want to
> prefault all pages like 
> it is now. The notifier scheme would only become
> relevant in memory 
> shortage situations.
> 
> > Then again, as rdma hardware is applied
> (increasingly?) towards apps 
> > with short lived connections, the notifier scheme
> will help with startup 
> > times.
> 
> The main use of the notifier scheme is for stability
> and reliability. The 
> "pinned" pages become unpinnable on request by the
> VM. So the VM can work 
> itself out of memory shortage situations in
> cooperation with the 
> RDMA logic instead of simply failing.
> 
> --
> To unsubscribe, send a message with 'unsubscribe
> linux-mm' in
> the body to majordomo@kvack.org.  For more info on
> Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org">
> email@kvack.org </a>
> 



      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: Demand paging for memory regions
  2008-02-13 23:43                                                               ` Kanoj Sarcar
@ 2008-02-13 23:48                                                                 ` Jesse Barnes
  2008-02-14  0:56                                                                 ` [ofa-general] " Andrea Arcangeli
  2008-02-14 19:35                                                                 ` Christoph Lameter
  2 siblings, 0 replies; 119+ messages in thread
From: Jesse Barnes @ 2008-02-13 23:48 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Christoph Lameter, Christian Bell, Jason Gunthorpe, Rik van Riel,
	Andrea Arcangeli, a.p.zijlstra, izike, Roland Dreier, steiner,
	linux-kernel, avi, linux-mm, daniel.blueman, Robin Holt, general,
	Andrew Morton, kvm-devel, Dave Airlie

On Wednesday, February 13, 2008 3:43 pm Kanoj Sarcar wrote:
> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).
>
> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?
>
> The example I am thinking about is rawio (Oracle's
> mlock'ed SHM regions are handed to rawio, isn't it?).
> My understanding of how rawio works in Linux is quite
> dated though ...

We're doing something similar in the DRM these days...  We need big chunks of 
memory to be pinned so that the GPU can operate on them, but when the 
operation completes we can allow them to be swappable again.  I think with 
the current implementation, allocations are always pinned, but we'll 
definitely want to change that soon.

Dave?

Jesse

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 23:23                                                     ` Pete Wyckoff
@ 2008-02-14  0:01                                                       ` Jason Gunthorpe
  2008-02-27 22:11                                                         ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Jason Gunthorpe @ 2008-02-14  0:01 UTC (permalink / raw)
  To: Pete Wyckoff
  Cc: Christoph Lameter, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote:
> christian.bell@qlogic.com wrote on Tue, 12 Feb 2008 20:09 -0800:
> > One other area that has not been brought up yet (I think) is the
> > applicability of notifiers in letting users know when pinned memory
> > is reclaimed by the kernel.  This is useful when a lower-level
> > library employs lazy deregistration strategies on memory regions that
> > are subsequently released to the kernel via the application's use of
> > munmap or sbrk.  Ohio Supercomputing Center has work in this area but
> > a generalized approach in the kernel would certainly be welcome.
> 
> The whole need for memory registration is a giant pain.  There is no
> motivating application need for it---it is simply a hack around
> virtual memory and the lack of full VM support in current hardware.
> There are real hardware issues that interact poorly with virtual
> memory, as discussed previously in this thread.

Well, the registrations also exist to provide protection against
rouge/faulty remotes, but for the purposes of MPI that is probably not
important.

Here is a thought.. Some RDMA hardware can change the page tables on
the fly. What if the kernel had a mechanism to dynamically maintain a
full registration of the processes entire address space ('mlocked' but
able to be migrated)? MPI would never need to register a buffer, and
all the messy cases with munmap/sbrk/etc go away - the risk is that
other MPI nodes can randomly scribble all over the process :)

Christoph: It seemed to me you were first talking about
freeing/swapping/faulting RDMA'able pages - but would pure migration
as a special hardware supported case be useful like Catilan suggested?

Regards,
Jason

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 23:43                                                               ` Kanoj Sarcar
  2008-02-13 23:48                                                                 ` Jesse Barnes
@ 2008-02-14  0:56                                                                 ` Andrea Arcangeli
  2008-02-14 19:35                                                                 ` Christoph Lameter
  2 siblings, 0 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2008-02-14  0:56 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Christoph Lameter, Christian Bell, Jason Gunthorpe, Rik van Riel,
	a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

Hi Kanoj,

On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote:
> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).

I'm not really livelocks are really the big issue here.

I'm running N 1G VM on a 1G ram system, with N-1G swapped
out. Combining this with auto-ballooning, rss limiting, and ksm ram
sharing, provides really advanced and lowlevel virtualization VM
capabilities to the linux kernel while at the same time guaranteeing
no oom failures as long as the guest pages are lower than ram+swap
(just slower runtime if too many pages are unshared or if the balloons
are deflated etc..).

Swapping the virtual machine in the host may be more efficient than
having the guest swapping over a virtual swap paravirt storage for
example. As more management features are added admins will gain more
experience in handling those new features and they'll find what's best
for them. mmu notifiers and real reliable swapping are the enabler for
those more advanced VM features.

oom livelocks wouldn't happen anyway with KVM as long as the maximimal
number of guest physical is lower than RAM.

> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?
> 
> The example I am thinking about is rawio (Oracle's
> mlock'ed SHM regions are handed to rawio, isn't it?).
> My understanding of how rawio works in Linux is quite
> dated though ...

rawio in flight I/O shall be limited. As long as each task can't pin
more than X ram, and the ram is released when the task is oom killed,
and the first get_user_pages/alloc_pages/slab_alloc that returns
-ENOMEM takes an oom fail path that returns failure to userland,
everything is ok.

Even with IB deadlock could only happen if IB would allow unlimited
memory to be pinned down by unprivileged users.

If IB is insecure and DoSable without mmu notifiers, then I'm not sure
how enabling swapping of the IB memory could be enough to fix the
DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap
to be allocated in a tmpfs file (despite the tmpfs file storage
includes swap and not only ram). Pinning the whole ram+swap with tmpfs
livelocks the same way of pinning the whole ram with ramfs. So if you
add mmu notifier support to IB, you only need to RDMA an area as large
as ram+swap to livelock again as before... no difference at all.

I don't think livelocks have anything to do with mmu notifiers (other
than to deferring the livelock to the "swap+ram" point of no return
instead of the current "ram" point of no return). Livelocks have to be
solved the usual way: handling alloc_pages/get_user_pages/slab
allocation failures with a fail path that returns to userland and
allows the ram to be released if the task was selected for
oom-killage.

The real benefit of the mmu notifiers for IB would be to allow the
rdma region to be larger than RAM without triggering the oom
killer (or without triggering a livelock if it's DoSable but then the
livelock would need fixing to be converted in a regular oom-killing by
some other mean not related to the mmu-notifier, it's really an
orthogonal problem).

So suppose you've a MPI simulation that requires a 10G array and
you've only 1G of ram, then you can rdma over 10G like if you had 10G
of ram. Things will preform ok only if there's some huge locality of
the computations. For virtualization it's orders of magnitude more
useful than for computer clusters but certain simulations really swaps
so I don't exclude certain RDMA apps will also need this (dunno about
IB).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-12 23:14                                           ` Felix Marti
  2008-02-13  0:57                                             ` Christoph Lameter
@ 2008-02-14 15:09                                             ` Steve Wise
  2008-02-14 15:53                                               ` Robin Holt
  2008-02-14 19:39                                               ` Christoph Lameter
  1 sibling, 2 replies; 119+ messages in thread
From: Steve Wise @ 2008-02-14 15:09 UTC (permalink / raw)
  To: Felix Marti
  Cc: Roland Dreier, Christoph Lameter, Rik van Riel, steiner,
	Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

Felix Marti wrote:

> 
> That is correct, not a change we can make for T3. We could, in theory,
> deal with changing mappings though. The change would need to be
> synchronized though: the VM would need to tell us which mapping were
> about to change and the driver would then need to disable DMA to/from
> it, do the change and resume DMA.
> 

Note that for T3, this involves suspending _all_ rdma connections that 
are in the same PD as the MR being remapped.  This is because the driver 
doesn't know who the application advertised the rkey/stag to.  So 
without that knowledge, all connections that _might_ rdma into the MR 
must be suspended.  If the MR was only setup for local access, then the 
driver could track the connections with references to the MR and only 
quiesce those connections.

Point being, it will stop probably all connections that an application 
is using (assuming the application uses a single PD).


Steve.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 15:09                                             ` Steve Wise
@ 2008-02-14 15:53                                               ` Robin Holt
  2008-02-14 16:23                                                 ` Steve Wise
  2008-02-14 19:39                                               ` Christoph Lameter
  1 sibling, 1 reply; 119+ messages in thread
From: Robin Holt @ 2008-02-14 15:53 UTC (permalink / raw)
  To: Steve Wise
  Cc: Felix Marti, Roland Dreier, Christoph Lameter, Rik van Riel,
	steiner, Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel,
	avi, linux-mm, daniel.blueman, Robin Holt, general,
	Andrew Morton, kvm-devel

On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
> Note that for T3, this involves suspending _all_ rdma connections that are 
> in the same PD as the MR being remapped.  This is because the driver 
> doesn't know who the application advertised the rkey/stag to.  So without 

Is there a reason the driver can not track these.

> Point being, it will stop probably all connections that an application is 
> using (assuming the application uses a single PD).

It seems like the need to not stop all would be a compelling enough reason
to modify the driver to track which processes have received the rkey/stag.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 15:53                                               ` Robin Holt
@ 2008-02-14 16:23                                                 ` Steve Wise
  2008-02-14 17:48                                                   ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Steve Wise @ 2008-02-14 16:23 UTC (permalink / raw)
  To: Robin Holt
  Cc: Felix Marti, Roland Dreier, Christoph Lameter, Rik van Riel,
	steiner, Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel,
	avi, linux-mm, daniel.blueman, general, Andrew Morton, kvm-devel

Robin Holt wrote:
> On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
>> Note that for T3, this involves suspending _all_ rdma connections that are 
>> in the same PD as the MR being remapped.  This is because the driver 
>> doesn't know who the application advertised the rkey/stag to.  So without 
> 
> Is there a reason the driver can not track these.
> 

Because advertising of a MR (ie telling the peer about your rkey/stag, 
offset and length) is application-specific and can be done out of band, 
or in band as simple SEND/RECV payload. Either way, the driver has no 
way of tracking this because the protocol used is application-specific.

>> Point being, it will stop probably all connections that an application is 
>> using (assuming the application uses a single PD).
> 
> It seems like the need to not stop all would be a compelling enough reason
> to modify the driver to track which processes have received the rkey/stag.
> 

Yes, _if_ the driver could track this.

And _if_ the rdma API and paradigm was such that the kernel/driver could 
keep track, then remote revokations of MR tags could be supported.

Stevo

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 16:23                                                 ` Steve Wise
@ 2008-02-14 17:48                                                   ` Caitlin Bestler
  0 siblings, 0 replies; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-14 17:48 UTC (permalink / raw)
  To: Steve Wise
  Cc: Robin Holt, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, kvm-devel,
	linux-mm, daniel.blueman, general, Andrew Morton,
	Christoph Lameter

On Thu, Feb 14, 2008 at 8:23 AM, Steve Wise <swise@opengridcomputing.com> wrote:
> Robin Holt wrote:
>  > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
>  >> Note that for T3, this involves suspending _all_ rdma connections that are
>  >> in the same PD as the MR being remapped.  This is because the driver
>  >> doesn't know who the application advertised the rkey/stag to.  So without
>  >
>  > Is there a reason the driver can not track these.
>  >
>
>  Because advertising of a MR (ie telling the peer about your rkey/stag,
>  offset and length) is application-specific and can be done out of band,
>  or in band as simple SEND/RECV payload. Either way, the driver has no
>  way of tracking this because the protocol used is application-specific.
>
>

I fully agree. If there is one important thing about RDMA and other fastpath
solutions that must be understood is that the driver does not see the
payload. This is a fundamental strength, but it means that you have
to identify what if any intercept points there are in advance.

You also raise a good point on the scope of any suspend/resume API.
Device reporting of this capability would not be a simple boolean, but
more of a suspend/resume scope. A minimal scope would be any
connection that actually attempts to use the suspended MR. Slightly
wider would be any connection *allowed* to use the MR, which could
expand all the way to any connection under the same PD. Convievably
I could imagine an RDMA device reporting that it could support suspend/
resume, but only at the scope of the entire device.

But even at such a wide scope, suspend/resume could be useful to
a Memory Manager. The pages could be fully migrated to the new
location, and the only work that was still required during the critical
suspend/resume region was to actually shift to the new map. That
might be short enough that not accepting *any* incoming RDMA
packet would be acceptable.

And if the goal is to replace a memory card the alternative might
be migrating the applications to other physical servers, which would
mean a much longer period of not accepting incoming RDMA packets.

But the broader question is what the goal is here. Allowing memory to
be shuffled is valuable, and perhaps even ultimately a requirement for
high availability systems. RDMA and other direct-access APIs should
be evolving their interfaces to accommodate these needs.

Oversubscribing memory is a totally different matter. If an application
is working with memory that is oversubscribed by a factor of 2 or more
can it really benefit from zero-copy direct placement? At first glance I
can't see what RDMA could be bringing of value when the overhead of
swapping is going to be that large.

If it really does make sense, then explicitly registering the portion of
memory that should be enabled to receive incoming traffic while the
application is swapped out actually makes sense.

Current Memory Registration methods force applications to either
register too much or too often. They register too much when the cost
of registration is high, and the application responds by registering its
entire buffer pool permanently. This is a problem when it overstates
the amount of memory that the application needs to have resident,
or when the device imposes limits on the size of memory maps that
it can know. The alternative is to register too often, that is on a
per-operation basis.

To me that suggests the solutions lie in making it more reasonable
to register more memory, or in making it practical to register memory
on-the-fly on a per-operation basis with low enough overhead that
applications don't feel the need to build elaborate registration caching
schemes.

As has been pointed out a few times in this thread, the RDMA and
transport layers simply do not have enough information to know which
portion of registered memory *really* had to be registered. So any
back-pressure scheme where the Memory Manager is asking for
pinned memory to be "given back" would have to go all the way to
the application. Only the application knows what it is "really" using.

I also suspect that most applications that are interested in using
RDMA would rather be told they can allocate 200M indefinitely
(and with real memory backing it) than be given 1GB of virtual
memory that is backed by 200-300M of physical memory,
especially if it meant dealing with memory pressure upcalls.

>  >> Point being, it will stop probably all connections that an application is
>  >> using (assuming the application uses a single PD).
>  >
>  > It seems like the need to not stop all would be a compelling enough reason
>  > to modify the driver to track which processes have received the rkey/stag.
>  >
>
>  Yes, _if_ the driver could track this.
>
>  And _if_ the rdma API and paradigm was such that the kernel/driver could
>  keep track, then remote revokations of MR tags could be supported.
>
>  Stevo
>
>
> _______________________________________________
>  general mailing list
>  general@lists.openfabrics.org
>  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
>  To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-13 23:43                                                               ` Kanoj Sarcar
  2008-02-13 23:48                                                                 ` Jesse Barnes
  2008-02-14  0:56                                                                 ` [ofa-general] " Andrea Arcangeli
@ 2008-02-14 19:35                                                                 ` Christoph Lameter
  2 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-14 19:35 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Christian Bell, Jason Gunthorpe, Rik van Riel, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, steiner, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

On Wed, 13 Feb 2008, Kanoj Sarcar wrote:

> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).
> 
> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?

Yes there are actually two projects by SGI that also ran into the same 
issue that motivated the work on this. One is XPmem which allows 
sharing of process memory between different Linux instances and then 
there is the GRU which is a kind of DMA engine. Then there is KVM and 
probably multiple other drivers.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 15:09                                             ` Steve Wise
  2008-02-14 15:53                                               ` Robin Holt
@ 2008-02-14 19:39                                               ` Christoph Lameter
  2008-02-14 20:17                                                 ` Caitlin Bestler
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-14 19:39 UTC (permalink / raw)
  To: Steve Wise
  Cc: Felix Marti, Roland Dreier, Rik van Riel, steiner,
	Andrea Arcangeli, a.p.zijlstra, izike, linux-kernel, avi,
	linux-mm, daniel.blueman, Robin Holt, general, Andrew Morton,
	kvm-devel

On Thu, 14 Feb 2008, Steve Wise wrote:

> Note that for T3, this involves suspending _all_ rdma connections that are in
> the same PD as the MR being remapped.  This is because the driver doesn't know
> who the application advertised the rkey/stag to.  So without that knowledge,
> all connections that _might_ rdma into the MR must be suspended.  If the MR
> was only setup for local access, then the driver could track the connections
> with references to the MR and only quiesce those connections.
> 
> Point being, it will stop probably all connections that an application is
> using (assuming the application uses a single PD).

Right but if the system starts reclaiming pages of the application then we 
have a memory shortage. So the user should address that by not running 
other apps concurrently. The stopping of all connections is still better 
than the VM getting into major trouble. And the stopping of connections in 
order to move the process memory into a more advantageous memory location 
(f.e. using page migration) or stopping of connections in order to be able 
to move the process memory out of a range of failing memory is certainly 
good.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 19:39                                               ` Christoph Lameter
@ 2008-02-14 20:17                                                 ` Caitlin Bestler
  2008-02-14 20:20                                                   ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-14 20:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steve Wise, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Thu, Feb 14, 2008 at 11:39 AM, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 14 Feb 2008, Steve Wise wrote:
>
>  > Note that for T3, this involves suspending _all_ rdma connections that are in
>  > the same PD as the MR being remapped.  This is because the driver doesn't know
>  > who the application advertised the rkey/stag to.  So without that knowledge,
>  > all connections that _might_ rdma into the MR must be suspended.  If the MR
>  > was only setup for local access, then the driver could track the connections
>  > with references to the MR and only quiesce those connections.
>  >
>  > Point being, it will stop probably all connections that an application is
>  > using (assuming the application uses a single PD).
>
>  Right but if the system starts reclaiming pages of the application then we
>  have a memory shortage. So the user should address that by not running
>  other apps concurrently. The stopping of all connections is still better
>  than the VM getting into major trouble. And the stopping of connections in
>  order to move the process memory into a more advantageous memory location
>  (f.e. using page migration) or stopping of connections in order to be able
>  to move the process memory out of a range of failing memory is certainly
>  good.
>

In that spirit, there are two important aspects of a suspend/resume API that
would enable the memory manager to solve problems most effectively:

1) The device should be allowed flexibility to extend the scope of the suspend
    to what it is capable of implementing -- rather than being forced
to say that
    it does not support suspend/;resume merely because it does so at a different
    granularity.

2) It is very important that users of this API understand that it is
only the RDMA
   device handling of incoming packets and WQEs that is being suspended. The
   peers are not suspended by this API, or even told that this end is
suspending.
   Unless the suspend is kept *extremely* short there will be adverse impacts.
   And "short" here is measured in network terms, not human terms. The blink
   of any eye is *way* too long. Any external dependencies between "suspend"
   and "resume" will probably mean that things will not work, especially if the
   external entities involve a disk drive.

So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
swapping out pages so they can be reallocated is an exercise in futility. By the
time you resume the connections will be broken or at the minimum damaged.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 20:17                                                 ` Caitlin Bestler
@ 2008-02-14 20:20                                                   ` Christoph Lameter
  2008-02-14 22:43                                                     ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-14 20:20 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Steve Wise, Rik van Riel, steiner, Andrea Arcangeli,
	a.p.zijlstra, izike, Roland Dreier, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
> swapping out pages so they can be reallocated is an exercise in futility. By the
> time you resume the connections will be broken or at the minimum damaged.

The connections would then have to be torn down before swap out and would 
have to be reestablished after the pages have been brought back from swap.
 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 20:20                                                   ` Christoph Lameter
@ 2008-02-14 22:43                                                     ` Caitlin Bestler
  2008-02-14 22:48                                                       ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-14 22:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Thu, Feb 14, 2008 at 12:20 PM, Christoph Lameter <clameter@sgi.com> wrote:
> On Thu, 14 Feb 2008, Caitlin Bestler wrote:
>
>  > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
>  > swapping out pages so they can be reallocated is an exercise in futility. By the
>  > time you resume the connections will be broken or at the minimum damaged.
>
>  The connections would then have to be torn down before swap out and would
>  have to be reestablished after the pages have been brought back from swap.
>
>
I have no problem with that, as long as the application layer is responsible for
tearing down and re-establishing the connections. The RDMA/transport layers
are incapable of tearing down and re-establishing a connection transparently
because connections need to be approved above the RDMA layer.

Further the teardown will have visible artificats that the application
must deal with,
such as flushed Recv WQEs.

This is still, the RDMA device will do X and will not worry about Y. The reasons
for not worrying about Y could be that the suspend will be very short, or that
other mechanisms have taken care  of all the Ys independently.

For example, an HPC cluster that suspended the *entire* cluster would not
have to worry about dropped packets.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 22:43                                                     ` Caitlin Bestler
@ 2008-02-14 22:48                                                       ` Christoph Lameter
  2008-02-15  1:26                                                         ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-14 22:48 UTC (permalink / raw)
  To: Caitlin Bestler; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> I have no problem with that, as long as the application layer is responsible for
> tearing down and re-establishing the connections. The RDMA/transport layers
> are incapable of tearing down and re-establishing a connection transparently
> because connections need to be approved above the RDMA layer.

I am not that familiar with the RDMA layers but it seems that RDMA has 
a library that does device driver like things right? So the logic would 
best fit in there I guess.

If you combine mlock with the mmu notifier then you can actually 
guarantee that a certain memory range will not be swapped out. The 
notifier will then only be called if the memory range will need to be 
moved for page migration, memory unplug etc etc. There may be a limit on 
the percentage of memory that you can mlock in the future. This may be 
done to guarantee that the VM still has memory to work with.



^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-14 22:48                                                       ` Christoph Lameter
@ 2008-02-15  1:26                                                         ` Caitlin Bestler
  2008-02-15  2:37                                                           ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-15  1:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel



> -----Original Message-----
> From: Christoph Lameter [mailto:clameter@sgi.com]
> Sent: Thursday, February 14, 2008 2:49 PM
> To: Caitlin Bestler
> Cc: linux-kernel@vger.kernel.org; avi@qumranet.com;
linux-mm@kvack.org;
> general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net
> Subject: Re: [ofa-general] Re: Demand paging for memory regions
> 
> On Thu, 14 Feb 2008, Caitlin Bestler wrote:
> 
> > I have no problem with that, as long as the application layer is
> responsible for
> > tearing down and re-establishing the connections. The RDMA/transport
> layers
> > are incapable of tearing down and re-establishing a connection
> transparently
> > because connections need to be approved above the RDMA layer.
> 
> I am not that familiar with the RDMA layers but it seems that RDMA has
> a library that does device driver like things right? So the logic
would
> best fit in there I guess.
> 
> If you combine mlock with the mmu notifier then you can actually
> guarantee that a certain memory range will not be swapped out. The
> notifier will then only be called if the memory range will need to be
> moved for page migration, memory unplug etc etc. There may be a limit
> on
> the percentage of memory that you can mlock in the future. This may be
> done to guarantee that the VM still has memory to work with.
> 

The problem is that with existing APIs, or even slightly modified APIs,
the RDMA layer will not be able to figure out which connections need to
be "interrupted" in order to deal with what memory suspensions.

Further, because any request for a new connection will be handled by
the remote *application layer* peer there is no way for the two RDMA
layers to agree to covertly tear down and re-establish the connection.
Nor really should there be, connections should be approved by OS layer
networking controls. RDMA should not be able to tell the network stack,
"trust me, you don't have to check if this connection is legitimate".

Another example, if you terminate a connection pending receive
operations
complete *to the user* in a Completion Queue. Those completions are NOT
seen by the RDMA layer, and especially not by the Connection Manager. It
has absolutely no way to repost them transparently to the same
connection
when the connection is re-established.

Even worse, some portions of a receive operation might have been placed
in the receive buffer and acknowledged to the remote peer. But there is
no mechanism to report this fact in the CQE. A receive operation that is
aborted is aborted. There is no concept of partial success. Therefore
you
cannot covertly terminate a connection mid-operation and covertly
re-establish
it later. Data will be lost, it will no longer be a reliable connection,
and
therefore it needs to be torn down anyway.

The RDMA layers also cannot tell the other side not to transmit. Flow
control is the responsibility of the application layer, not RDMA.

What the RDMA layer could do is this: once you tell it to suspend a
given
memory region it can either tell you that it doesn't know how to do that
or it can instruct the device to stop processing a set of connections
that will ceases all access for a given Memory Region. When you resume
it can guarantee that it is no longer using any cached older mappings
for the memory region (assuming it was capable of doing the suspend),
and then because RDMA connections are reliable everything will recover
unless the connection timed-out. The chance that it will time-out is
probably low, but the chance that the underlying connection will be in
slow start or equivalent is much higher.

So any solution that requires the upper layers to suspend operations
for a brief bit will require explicit interaction with those layers.
No RDMA layer can perform the sleight of hand tricks that you seem
to want it to perform.

AT the RDMA layer the best you could get is very brief suspensions
for the purpose of *re-arranging* memory, not of reducing the amount
of registered memory. If you need to reduce the amount of registered
memory then you have to talk to the application. Discussions on making
it easier for the application to trim a memory region dynamically might
be in order, but you will not work around the fact that the application
layer needs to determine what pages are registered. And they would
really
prefer just to be told how much memory they can have up front, they can
figure out how to deal with that amount of memory on their own.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15  1:26                                                         ` Caitlin Bestler
@ 2008-02-15  2:37                                                           ` Christoph Lameter
  2008-02-15 18:09                                                             ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-15  2:37 UTC (permalink / raw)
  To: Caitlin Bestler; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> So any solution that requires the upper layers to suspend operations
> for a brief bit will require explicit interaction with those layers.
> No RDMA layer can perform the sleight of hand tricks that you seem
> to want it to perform.

Looks like it has to be up there right.
 
> AT the RDMA layer the best you could get is very brief suspensions for 
> the purpose of *re-arranging* memory, not of reducing the amount of 
> registered memory. If you need to reduce the amount of registered memory 
> then you have to talk to the application. Discussions on making it 
> easier for the application to trim a memory region dynamically might be 
> in order, but you will not work around the fact that the application 
> layer needs to determine what pages are registered. And they would 
> really prefer just to be told how much memory they can have up front, 
> they can figure out how to deal with that amount of memory on their own.

What does it mean that the "application layer has to be determine what 
pages are registered"? The application does not know which of its pages 
are currently in memory. It can only force these pages to stay in memory 
if their are mlocked.
 
 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15  2:37                                                           ` Christoph Lameter
@ 2008-02-15 18:09                                                             ` Caitlin Bestler
  2008-02-15 18:45                                                               ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-15 18:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

Christoph Lameter asked:
> 
> What does it mean that the "application layer has to be determine what
> pages are registered"? The application does not know which of its
pages
> are currently in memory. It can only force these pages to stay in
> memory if their are mlocked.
> 

An application that advertises an RDMA accessible buffer
to a remote peer *does* have to know that its pages *are*
currently in memory.

The application does *not* need for the virtual-to-physical
mapping of those pages to be frozen for the lifespan of the
Memory Region. But it is issuing an invitation to its peer
to perform direct writes to the advertised buffer. When the
peer decides to exercise that invitation the pages have to
be there.

An analogy: when you write a check for $100 you do not have
to identify the serial numbers of ten $10 bills, but you are
expected to have the funds in your account.

Issuing a buffer advertisement for memory you do not have
is the network equivalent of writing a check that you do
not have funds for.

Now, just as your bank may offer overdraft protection, an
RDMA device could merely report a page fault rather than
tearing down the connection itself. But that does not grant
permission for applications to advertise buffer space that
they do not have committed, it  merely helps recovery from
a programming fault.

A suspend/resume interface between the Virtual Memory Manager
and the RDMA layer allows pages to be re-arranged at the 
convenience of the Virtual Memory Manager without breaking
the application layer peer-to-peer contract. The current
interfaces that pin exact pages are really the equivalent
of having to tell the bank that when Joe cashes this $100
check that you should give him *these* ten $10 bills. It
works, but it adds too much overhead and is very inflexible.
So there are a lot of good reasons to evolve this interface
to better deal with these issues. Other areas of possible
evolution include allowing growing or trimming of Memory
Regions without invalidating their advertised handles.

But the more fundamental issue is recognizing that applications
that use direct interfaces need to know that buffers that they
enable truly have committed resources. They need a way to
ask for twenty *real* pages, not twenty pages of address
space. And they need to do it in a way that allows memory
to be rearranged or even migrated with them to a new host.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 18:09                                                             ` Caitlin Bestler
@ 2008-02-15 18:45                                                               ` Christoph Lameter
  2008-02-15 18:53                                                                 ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-15 18:45 UTC (permalink / raw)
  To: Caitlin Bestler; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> > What does it mean that the "application layer has to be determine what
> > pages are registered"? The application does not know which of its
> pages
> > are currently in memory. It can only force these pages to stay in
> > memory if their are mlocked.
> > 
> 
> An application that advertises an RDMA accessible buffer
> to a remote peer *does* have to know that its pages *are*
> currently in memory.

Ok that would mean it needs to inform the VM of that issue by mlocking 
these pages.
 
> But the more fundamental issue is recognizing that applications
> that use direct interfaces need to know that buffers that they
> enable truly have committed resources. They need a way to
> ask for twenty *real* pages, not twenty pages of address
> space. And they need to do it in a way that allows memory
> to be rearranged or even migrated with them to a new host.

mlock will force the pages to stay in memory without requiring the OS to 
keep them where they are.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 18:45                                                               ` Christoph Lameter
@ 2008-02-15 18:53                                                                 ` Caitlin Bestler
  2008-02-15 20:02                                                                   ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-15 18:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel



> -----Original Message-----
> From: Christoph Lameter [mailto:clameter@sgi.com]
> Sent: Friday, February 15, 2008 10:46 AM
> To: Caitlin Bestler
> Cc: linux-kernel@vger.kernel.org; avi@qumranet.com;
linux-mm@kvack.org;
> general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net
> Subject: RE: [ofa-general] Re: Demand paging for memory regions
> 
> On Fri, 15 Feb 2008, Caitlin Bestler wrote:
> 
> > > What does it mean that the "application layer has to be determine
> what
> > > pages are registered"? The application does not know which of its
> > pages
> > > are currently in memory. It can only force these pages to stay in
> > > memory if their are mlocked.
> > >
> >
> > An application that advertises an RDMA accessible buffer
> > to a remote peer *does* have to know that its pages *are*
> > currently in memory.
> 
> Ok that would mean it needs to inform the VM of that issue by mlocking
> these pages.
> 
> > But the more fundamental issue is recognizing that applications
> > that use direct interfaces need to know that buffers that they
> > enable truly have committed resources. They need a way to
> > ask for twenty *real* pages, not twenty pages of address
> > space. And they need to do it in a way that allows memory
> > to be rearranged or even migrated with them to a new host.
> 
> mlock will force the pages to stay in memory without requiring the OS
> to keep them where they are.

So that would mean that mlock is used by the application before it 
registers memory for direct access, and then it is up to the RDMA
layer and the OS to negotiate actual pinning of the addresses for
whatever duration is required.

There is no *protocol* barrier to replacing pages within a Memory
Region as long as it is done in a way that keeps the content of
those page coherent. But existing devices have their own ideas
on how this is done and existing devices are notoriously poor at
learning new tricks.

Merely mlocking pages deals with the end-to-end RDMA semantics.
What still needs to be addressed is how a fastpath interface
would dynamically pin and unpin. Yielding pins for short-term
suspensions (and flushing cached translations) deals with the
rest. Understanding the range of support that existing devices
could provide with software updates would be the next step if
you wanted to pursue this.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 18:53                                                                 ` Caitlin Bestler
@ 2008-02-15 20:02                                                                   ` Christoph Lameter
  2008-02-15 20:14                                                                     ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-15 20:02 UTC (permalink / raw)
  To: Caitlin Bestler; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> So that would mean that mlock is used by the application before it 
> registers memory for direct access, and then it is up to the RDMA
> layer and the OS to negotiate actual pinning of the addresses for
> whatever duration is required.

Right.
 
> There is no *protocol* barrier to replacing pages within a Memory
> Region as long as it is done in a way that keeps the content of
> those page coherent. But existing devices have their own ideas
> on how this is done and existing devices are notoriously poor at
> learning new tricks.

Hmmmm.. Okay. But that is mainly a device driver maintenance issue.

> Merely mlocking pages deals with the end-to-end RDMA semantics.
> What still needs to be addressed is how a fastpath interface
> would dynamically pin and unpin. Yielding pins for short-term
> suspensions (and flushing cached translations) deals with the
> rest. Understanding the range of support that existing devices
> could provide with software updates would be the next step if
> you wanted to pursue this.

That is addressed on the VM level by the mmu_notifier which started this 
whole thread. The RDMA layers need to subscribe to this notifier and then 
do whatever the hardware requires to unpin and pin memory. I can only go 
as far as dealing with the VM layer. If you have any issues there I'd be 
glad to help.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 20:02                                                                   ` Christoph Lameter
@ 2008-02-15 20:14                                                                     ` Caitlin Bestler
  2008-02-15 22:50                                                                       ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-15 20:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

Christoph Lameter wrote
> 
> > Merely mlocking pages deals with the end-to-end RDMA semantics.
> > What still needs to be addressed is how a fastpath interface
> > would dynamically pin and unpin. Yielding pins for short-term
> > suspensions (and flushing cached translations) deals with the
> > rest. Understanding the range of support that existing devices
> > could provide with software updates would be the next step if
> > you wanted to pursue this.
> 
> That is addressed on the VM level by the mmu_notifier which started
> this whole thread. The RDMA layers need to subscribe to this notifier
> and then do whatever the hardware requires to unpin and pin memory.
> I can only go as far as dealing with the VM layer. If you have any
> issues there I'd be glad to help.

There isn't much point in the RDMA layer subscribing to mmu
notifications
if the specific RDMA device will not be able to react appropriately when
the notification occurs. I don't see how you get around needing to know
which devices are capable of supporting page migration (via
suspend/resume
or other mechanisms) and which can only respond to a page migration by
aborting connections.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 20:14                                                                     ` Caitlin Bestler
@ 2008-02-15 22:50                                                                       ` Christoph Lameter
  2008-02-15 23:50                                                                         ` Caitlin Bestler
  0 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-15 22:50 UTC (permalink / raw)
  To: Caitlin Bestler; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> There isn't much point in the RDMA layer subscribing to mmu
> notifications
> if the specific RDMA device will not be able to react appropriately when
> the notification occurs. I don't see how you get around needing to know
> which devices are capable of supporting page migration (via
> suspend/resume
> or other mechanisms) and which can only respond to a page migration by
> aborting connections.

You either register callbacks if the device can react properly or you 
dont. If you dont then the device will continue to have the problem with 
page pinning etc until someone comes around and implements the 
mmu callbacks to fix these issues.

I have doubts regarding the claim that some devices just cannot be made to 
suspend and resume appropriately. They obviously can be shutdown and so 
its a matter of sequencing the things the right way. I.e. stop the app 
wait for a quiet period then release resources etc.




^ permalink raw reply	[flat|nested] 119+ messages in thread

* RE: [ofa-general] Re: Demand paging for memory regions
  2008-02-15 22:50                                                                       ` Christoph Lameter
@ 2008-02-15 23:50                                                                         ` Caitlin Bestler
  0 siblings, 0 replies; 119+ messages in thread
From: Caitlin Bestler @ 2008-02-15 23:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, avi, linux-mm, general, kvm-devel



> -----Original Message-----
> From: Christoph Lameter [mailto:clameter@sgi.com]
> Sent: Friday, February 15, 2008 2:50 PM
> To: Caitlin Bestler
> Cc: linux-kernel@vger.kernel.org; avi@qumranet.com;
linux-mm@kvack.org;
> general@lists.openfabrics.org; kvm-devel@lists.sourceforge.net
> Subject: RE: [ofa-general] Re: Demand paging for memory regions
> 
> On Fri, 15 Feb 2008, Caitlin Bestler wrote:
> 
> > There isn't much point in the RDMA layer subscribing to mmu
> > notifications
> > if the specific RDMA device will not be able to react appropriately
> when
> > the notification occurs. I don't see how you get around needing to
> know
> > which devices are capable of supporting page migration (via
> > suspend/resume
> > or other mechanisms) and which can only respond to a page migration
> by
> > aborting connections.
> 
> You either register callbacks if the device can react properly or you
> dont. If you dont then the device will continue to have the problem
> with
> page pinning etc until someone comes around and implements the
> mmu callbacks to fix these issues.
> 
> I have doubts regarding the claim that some devices just cannot be
made
> to
> suspend and resume appropriately. They obviously can be shutdown and
so
> its a matter of sequencing the things the right way. I.e. stop the app
> wait for a quiet period then release resources etc.
> 
> 

That is true. What some devices will be unable to do is suspend
and resume in a manner that is transparent to the application.
However, for the duration required to re-arrange pages it is 
definitely feasible to do so transparently to the application.

Presumably the Virtual Memory Manager would be more willing to
take an action that is transparent to the user than one that is
disruptive, although obviously as the owner of the physical memory
it has the right to do either.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [ofa-general] Re: Demand paging for memory regions
  2008-02-14  0:01                                                       ` Jason Gunthorpe
@ 2008-02-27 22:11                                                         ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pete Wyckoff, Rik van Riel, Andrea Arcangeli, a.p.zijlstra,
	izike, Roland Dreier, steiner, linux-kernel, avi, linux-mm,
	daniel.blueman, Robin Holt, general, Andrew Morton, kvm-devel

On Wed, 13 Feb 2008, Jason Gunthorpe wrote:

> Christoph: It seemed to me you were first talking about
> freeing/swapping/faulting RDMA'able pages - but would pure migration
> as a special hardware supported case be useful like Catilan suggested?

That is a special case of the proposed solution. You could mlock the 
regions of interest. Those can then only be migrated but not swapped out.

However, I think we need some limit on the number of pages one can mlock. 
Otherwise the VM can get into a situation where reclaim is not possible 
because the majority of memory is either mlocked or pinned by I/O etc.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-02-16  3:37   ` Andrew Morton
@ 2008-02-18 22:33   ` Roland Dreier
  1 sibling, 0 replies; 119+ messages in thread
From: Roland Dreier @ 2008-02-18 22:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

It seems that we've come up with two reasonable cases where it makes
sense to use these notifiers for InfiniBand/RDMA:

First, the ability to safely to DMA to/from userspace memory with the
memory regions mlock()ed but the pages not pinned.  In this case the
notifiers here would seem to suit us well:

 > +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
 > +				 struct mm_struct *mm,
 > +				 unsigned long start, unsigned long end,
 > +				 int atomic);
 > +
 > +	void (*invalidate_range_end)(struct mmu_notifier *mn,
 > +				 struct mm_struct *mm,
 > +				 unsigned long start, unsigned long end,
 > +				 int atomic);

If I understand correctly, the IB stack would have to get the hardware
driver to shoot down translation entries and suspend access to the
region when an invalidate_range_begin notifier is called, and wait for
the invalidate_range_end notifier to repopulate the adapter
translation tables.  This will probably work OK as long as the
interval between the invalidate_range_begin and invalidate_range_end
calls is not "too long."

Also, using this effectively requires us to figure out how we want to
mlock() regions that are going to be used for RDMA.  We could require
userspace to do it, but it's not clear to me that we're safe in the
case where userspace decides not to... what happens if some pages get
swapped out after the invalidate_range_begin notifier?

The second case where some form of notifiers are useful is for
userspace to know when a memory registration is still valid, ie Pete
Wyckoff's work:

    http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf
    http://www.osc.edu/~pw/dreg/

however these MMU notifiers seem orthogonal to that: the registration
cache is concerned with address spaces, not page mapping, and hence
the existing vma operations seem to be a better fit.

 - R.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-17  3:01       ` Andrea Arcangeli
@ 2008-02-17 12:24         ` Robin Holt
  0 siblings, 0 replies; 119+ messages in thread
From: Robin Holt @ 2008-02-17 12:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Andrew Morton, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman

On Sun, Feb 17, 2008 at 04:01:20AM +0100, Andrea Arcangeli wrote:
> On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> > On Fri, 15 Feb 2008, Andrew Morton wrote:
> > 
> > > What is the status of getting infiniband to use this facility?
> > 
> > Well we are talking about this it seems.
> 
> It seems the IB folks think allowing RDMA over virtual memory is not
> interesting, their argument seem to be that RDMA is only interesting
> on RAM (and they seem not interested in allowing RDMA over a ram+swap
> backed _virtual_ memory allocation). They've just to decide if
> ram+swap allocation for RDMA is useful or not.

I don't think that is a completely fair characterization.  It would be
more fair to say that the changes required to their library/user api
would be too significant to allow an adaptation to any scheme which
allowed removal of physical memory below a virtual mapping.

I agree with the IB folks when they say it is impossible with their
current scheme.  The fact that any consumer of their endpoint identifier
can use any identifier without notifying the kernel prior to its use
certainly makes any implementation under any scheme impossible.

I guess we could possibly make things work for IB if we did some heavy
work.  Let's assume, instead of passing around the physical endpoint
identifiers, they passed around a handle.  In order for any IB endpoint
to commuicate, it would need to request the kernel translate a handle
into an endpoint identifier.  In order for the kernel to put a TLB
entry into the processes address space allowing the process access to
the _CARD_, it would need to ensure all the current endpoint identifiers
for this process were "active" meaning we have verified with the other
endpoint that all pages are faulted and TLB/PFN information is in the
owning card's TLB/PFN tables.  Once all of a processes endoints are
"active" we would drop in the PFN for the adapter into the pages tables.
Any time pages are being revoked from under an active handle, we would
shoot-down the IB adapter card TLB entries for all the remote users of
this handle and quiesce the cards state to ensure transfers are either
complete or terminated.  When their are no active transfers, we would
respond back to the owner and they could complete the source process
page table cleaning.  Any time all of the pages for a handle can not be
mapped from virtual to physical, the remote process would be SIGBUS'd
instead of having it IB adapter TLB installed.

This is essentially how XPMEM does it except we have the benefit of
working on individual pages.

Again, not knowing what I am talking about, but under the assumption that
MPI IB use is contained to a library, I would hope the changes could be
contained under the MPI-to-IB library interface and would not need any
changes at the MPI-user library interface.

We do keep track of the virtual address ranges within a handle that
are being used.  I assume the IB folks will find that helpful as well.
Otherwise, I think they could make things operate this way.  XPMEM has
the advantage of not needing to have virtual-to-physical at all times,
but otherwise it is essentially the same.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code 
  2008-02-16  3:37   ` Andrew Morton
                       ` (2 preceding siblings ...)
  2008-02-16 19:21     ` Christoph Lameter
@ 2008-02-17  5:04     ` Doug Maxey
  3 siblings, 0 replies; 119+ messages in thread
From: Doug Maxey @ 2008-02-17  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman, Ben Herrenschmidt, Jan-Bernd Themann


On Fri, 15 Feb 2008 19:37:19 PST, Andrew Morton wrote:
> Which other potential clients have been identified and how important it it
> to those?

The powerpc ehea utilizes its own mmu.  Not sure about the importance 
to the driver. (But will investigate :)

++doug


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 19:21     ` Christoph Lameter
@ 2008-02-17  3:01       ` Andrea Arcangeli
  2008-02-17 12:24         ` Robin Holt
  0 siblings, 1 reply; 119+ messages in thread
From: Andrea Arcangeli @ 2008-02-17  3:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
> 
> > What is the status of getting infiniband to use this facility?
> 
> Well we are talking about this it seems.

It seems the IB folks think allowing RDMA over virtual memory is not
interesting, their argument seem to be that RDMA is only interesting
on RAM (and they seem not interested in allowing RDMA over a ram+swap
backed _virtual_ memory allocation). They've just to decide if
ram+swap allocation for RDMA is useful or not.

> > How important is this feature to KVM?
> 
> Andrea can answer this.

I think I already did in separate email.

> > That sucks big time.  What do we need to do to make get the callback
> > functions called in non-atomic context?

I sure agree given I also asked to drop the lock param and enforce the
invalidate_range_* to always be called in non atomic context.

> We would have to drop the inode_mmap_lock. Could be done with some minor 
> work.

The invalidate may be deferred after releasing the lock, the lock may
not have to be dropped to cleanup the API (and make xpmem life easier).

> That is one implementation (XPmem does that). The other is to simply stop 
> all references when any invalidate_range is in progress (KVM and GRU do 
> that).

KVM doesn't stop new references. It doesn't need to because it holds a
reference on the page (GRU doesn't). KVM can invalidate the spte and
flush the tlb only after the linux pte has been cleared and after the
page has been released by the VM (because the page doesn't go in the
freelist and it remains pinned for a little while, until the spte is
dropped too inside invalidate_range_end). GRU has to invalidate
_before_ the linux pte is cleared so it has to stop new references
from being established in the invalidate_range_start/end critical
section.

> Andrea put this in to check the reference status of a page. It functions 
> like the accessed bit.

In short each pte can have some spte associated to it. So whenever we
do a ptep_clear_flush protected by the PT lock, we also have to run
invalidate_page that will internally invoke a sort-of
sptep_clear_flush protected by a kvm->mmu_lock (equivalent of
page_table_lock/PT-lock). sptes just like ptes maps virtual addresses
to physical addresses, so you can read/write to RAM either through a
pte or through a spte.

Just like it would be insane to have any requirement that
ptep_clear_flush has to run in not-atomic context (forcing a
conversion of the PT lock to a mutex), it's also weird require the
invalidate_page/age_page to run in atomic context.

All troubles start with the xpmem requirements of having to schedule
in its equivalent of the sptep_clear_flush because it's not a
gigaherz-in-cpu thing but a gigabit thing where the network stack is
involved with its own software linux driven skb memory allocations,
schedules waiting for network I/O, etc... Imagine ptes allocated in a
remote node, no surprise its brings a new set of problems (assuming it
can work reliably during oom given its memory requirements in the
try_to_unmap path, no page can ever be freed until the skbs have been
allocated and sent and allocated again to receive the ack).

Furthermore xpmem doesn't associate any pte to a spte, it associates a
page_t to certain remote references, or it would be in trouble with
invalidate_page that corresponds to ptep_clear_flush on a virtual
address that exists thanks to the anon_vma/i_mmap lock held (and not
thanks to the mmap_sem like in all invalidate_range calls).

Christoph's patch is a mix of two entirely separated features. KVM can
live with V7 just fine, but it's a lot more than what is needed by KVM.

I don't think that invalidate_page/age_page must be allowed to sleep
because invalidate_range also can sleep. You've to just ask yourself
if the VM locks shall remain spinlocks, for the VM own good (not for
the mmu notifiers good). It'd be bad to make the VM underperform with
mutex protecting tiny critical sections to please some mmu notifier
user. But if they're spinlocks, then clearly invalidate_page/age_page
based on virtual addresses can't sleep or the virtual address wouldn't
make sense anymore by the time the spinlock is released.

> > This function looks like it was tossed in at the last minute.  It's
> > mysterious, undocumented, poorly commented, poorly named.  A better name
> > would be one which has some correlation with the return value.
> > 
> > Because anyone who looks at some code which does
> > 
> > 	if (mmu_notifier_age_page(mm, address))
> > 		...
> > 
> > has to go and reverse-engineer the implementation of
> > mmu_notifier_age_page() to work out under which circumstances the "..."
> > will be executed.  But this should be apparent just from reading the callee
> > implementation.
> > 
> > This function *really* does need some documentation.  What does it *mean*
> > when the ->age_page() from some of the notifiers returned "1" and the
> > ->age_page() from some other notifiers returned zero?  Dunno.
> 
> Andrea: Could you provide some more detail here?

age_page is simply the ptep_clear_flush_young equivalent for
sptes. It's meant to provide aging to the pages mapped by secondary
mmus. Its return value is the same one of ptep_clear_flush_young but
it represents the sptes associated with the pte,
ptep_clear_flush_young instead only takes care of the pte itself.

For KVM the below would be all that is needed, the fact
invalidate_range can sleep and invalidate_page/age_page can't, is
because their users are very different. With my approach the mmu
notifiers callback are always protected by the PT lock (just like
ptep_clear_flush and the other pte+tlb manglings) and they're called
after the pte is cleared and before the VM reference on the page has
been dropped. That makes it safe for GRU too, so for my initial
approach _none_ of the callbacks was allowed to sleep, and that was a
feature that allows GRU not to block its tlb miss interrupt with any
further locking (the PT-lock taken by follow_page automatically
serialized the GRU interrupt against the MMU notifiers and the linux
page fault). For KVM the invalidate_pages of my patch is converted to
invalidate_range_end because it doesn't matter for KVM if it's called
after the PT lock has been dropped. In the try_to_unmap case
invalidate_page is called by atomic context in Christoph's patch too,
because a virtual address and in turn a pte and in turn certain sptes,
can only exist thanks to the spinlocks taken by the VM. Changing the
VM to make mmu notifiers sleepable in the try_to_unmap path sounds bad
to me, especially given not even xpmem needs this.

You can see how everything looks simpler and more symmetric by
assuming the secondary mmu-references are established and dropped like
ptes, like in the KVM case where infact sptes are a pure cpu thing
exact like the ptes. XPMEM adds the requirement that sptes are infact
remote entities that are mangled by a message passing protocol over
the network, it's the same as ptep_clear_flush being required to
schedule and send skbs to be successful and allowing try_to_unmap to
do its work. Same problem. No wonder patch gets more complicated then.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
 	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
 	if (__young)							\
 		flush_tlb_page(__vma, __address);			\
+	__young |= mmu_notifier_age_page((__vma)->vm_mm, __address);	\
 	__young;							\
 })
 #endif
@@ -86,6 +87,7 @@ do {									\
 	pte_t __pte;							\
 	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
 	flush_tlb_page(__vma, __address);				\
+	mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);	\
 	__pte;								\
 })
 #endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
 {
 	pte_t pte = *ptep;
 	ptep_invalidate(address, ptep);
+	mmu_notifier(invalidate_page, vma->vm_mm, address);
 	return pte;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -219,6 +220,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,132 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+	/*
+	 * Called when nobody can register any more notifier in the mm
+	 * and after the "mn" notifier has been disarmed already.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * invalidate_page[s] is called in atomic context
+	 * after any pte has been updated and before
+	 * dropping the PT lock required to update any Linux pte.
+	 * Once the PT lock will be released the pte will have its
+	 * final value to export through the secondary MMU.
+	 * Before this is invoked any secondary MMU is still ok
+	 * to read/write to the page previously pointed by the
+	 * Linux pte because the old page hasn't been freed yet.
+	 * If required set_page_dirty has to be called internally
+	 * to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+	void (*invalidate_pages)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
+
+	/*
+	 * Age page is called in atomic context inside the PT lock
+	 * right after the VM is test-and-clearing the young/accessed
+	 * bitflag in the pte. This way the VM will provide proper aging
+	 * to the accesses to the page through the secondary MMUs
+	 * and not only to the ones through the Linux pte.
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+};
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+	struct hlist_head head;
+	spinlock_t lock;
+};
+
+#include <linux/mm_types.h>
+
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+	spin_lock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+						 &(mm)->mmu_notifier.head, \
+						 hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+struct mmu_notifier_head {};
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)			       \
+	do {							       \
+		if (0) {					       \
+			struct mmu_notifier *__mn;		       \
+								       \
+			__mn = (struct mmu_notifier *)(0x00ff);	       \
+			__mn->ops->function(__mn, mm, args);	       \
+		};						       \
+	} while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
 		if (pte_none(pte))
 			continue;
 
+		mmu_notifier(invalidate_page, mm, address);
 		page = pte_page(pte);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -494,6 +494,7 @@ static int copy_pte_range(struct mm_stru
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	unsigned long start;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -505,6 +506,7 @@ again:
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	start = addr;
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -525,6 +527,8 @@ again:
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_pages, vma->vm_mm, start, addr);
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
@@ -660,6 +664,7 @@ static unsigned long zap_pte_range(struc
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			mmu_notifier(invalidate_page, mm, addr);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1248,6 +1253,7 @@ static int remap_pte_range(struct mm_str
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1259,6 +1265,7 @@ static int remap_pte_range(struct mm_str
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 	return 0;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2044,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,73 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/rcupdate.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *tmp;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, tmp,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					 &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_del_rcu(&mn->hlist);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 }
 


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 10:58       ` Andrew Morton
@ 2008-02-16 19:31         ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Brice Goglin, Andrea Arcangeli, linux-kernel, linux-mm

On Sat, 16 Feb 2008, Andrew Morton wrote:

> "looks good" maybe.  But it's in the details where I fear this will come
> unstuck.  The likelihood that some callbacks really will want to be able to
> block in places where this interface doesn't permit that - either to wait
> for IO to complete or to wait for other threads to clear critical regions.

We can get the invalidate_range to always be called without spinlocks if 
we deal with the case of the inode_mmap_lock being held in truncate case.

If you always want to be able to sleep then we could drop the 
invalidate_page() that is called while pte locks held and require the use 
of a device driver rmap?

> >From that POV it doesn't look like a sufficiently general and useful
> design.  Looks like it was grafted onto the current VM implementation in a
> way which just about suits two particular clients if they try hard enough.

You missed KVM. We did the best we could being as least invasive as 
possible.

> Which is all perfectly understandable - it would be hard to rework core MM
> to be able to make this interface more general.  But I do think it's
> half-baked and there is a decent risk that future (or present) code which
> _could_ use something like this won't be able to use this one, and will
> continue to futz with mlock, page-pinning, etc.
> 
> Not that I know what the fix to that is..

You do not see a chance of this being okay if we adopt the two measures 
that I mentioned above?
 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
  2008-02-16 10:41     ` Brice Goglin
@ 2008-02-16 19:21     ` Christoph Lameter
  2008-02-17  3:01       ` Andrea Arcangeli
  2008-02-17  5:04     ` Doug Maxey
  3 siblings, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 15 Feb 2008, Andrew Morton wrote:

> What is the status of getting infiniband to use this facility?

Well we are talking about this it seems.
> 
> How important is this feature to KVM?

Andrea can answer this.

> To xpmem?

Without this feature we are stuck with page pinning by increasing 
refcounts which leads to endless lru scanning and other misbehavior. Also 
applications that use XPmem will not be able to swap or be able to use 
things like remap.
 
> Which other potential clients have been identified and how important it it
> to those?

It is likely important to various DMA engines, framebuffers devices etc 
etc. Seems to be a generally useful feature.


> > +The notifier chains provide two callback mechanisms. The
> > +first one is required for any device that establishes external mappings.
> > +The second (rmap) mechanism is required if a device needs to be
> > +able to sleep when invalidating references. Sleeping may be necessary
> > +if we are mapping across a network or to different Linux instances
> > +in the same address space.
> 
> I'd have thought that a major reason for sleeping would be to wait for IO
> to complete.  Worth mentioning here?

Right.

> Why is that "easy"?  I's have thought that it would only be easy if the
> driver happened to be using those same locks for its own purposes. 
> Otherwise it is "awkward"?

Its relatively easy because it is tied directly to a process and can use
external tlb shootdown / external page table clearing directly. The other 
method requires an rmap in the device driver where it can lookup the 
processes that are mapping the page.
 
> > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> > +called most of the time without any locks held. It is only called with
> > +locks held for file backed mappings that are truncated. A flag indicates
> > +in which mode we are. A driver can use that mechanism to f.e.
> > +delay the freeing of the pages during truncate until no locks are held.
> 
> That sucks big time.  What do we need to do to make get the callback
> functions called in non-atomic context?

We would have to drop the inode_mmap_lock. Could be done with some minor 
work.

> > +Pages must be marked dirty if dirty bits are found to be set in
> > +the external ptes during unmap.
> 
> That sentence is too vague.  Define "marked dirty"?

Call set_page_dirty().

> > +The *release* method is called when a Linux process exits. It is run before
> 
> We'd conventionally use a notation such as "->release()" here, rather than
> the asterisks.

Ok.

> 
> > +the pages and mappings of a process are torn down and gives the device driver
> > +a chance to zap all the external mappings in one go.
> 
> I assume what you mean here is that ->release() is called during exit()
> when the final reference to an mm is being dropped.

Right.

> > +An example for a code that can be used to build a notifier mechanism into
> > +a device driver can be found in the file
> > +Documentation/mmu_notifier/skeleton.c
> 
> Should that be in samples/?

Oh. We have that?

> > +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> > +*before* the Linux rmaps are walked. At that point only the page lock is
> > +held. The invalidate_page() function must walk the driver rmaps and evict
> > +all the references to the page.
> 
> What happens if it cannot do so?

The page is not reclaimed if we were called from try_to_unmap(). From 
page_mkclean() we must always evict the page to switch off the write 
protect bit.

> > +There is no process information available before the rmaps are consulted.
> 
> Not sure what that sentence means.  I guess "available to the core VM"?

At that point we only have the page. We do not know which processes map 
the page. In order to find out we need to take a spinlock.


> > +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> > +it is a global callback list. Having to perform a callback for each and every
> > +page that is reclaimed would be inefficient. Therefore we add an additional
> > +page flag: PageRmapExternal().
> 
> How many page flags are left?

30 or so. Its only available on 64bit.

> Is this feature important enough to justfy consumption of another one?
> 
> > Only pages that are marked with this bit can
> > +be exported and the rmap callbacks will only be performed for pages marked
> > +that way.
> 
> "exported": new term, unclear what it means.

Something external to the kernel references the page.

> > +The required additional Page flag is only availabe in 64 bit mode and
> > +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
> 
> whoa.  Is that good?  You just made your feature unavailable on the great
> majority of Linux systems.

rmaps are usually used by complex drivers that are typically used in large 
systems.

> > + * Notifier functions for hardware and software that establishes external
> > + * references to pages of a Linux system. The notifier calls ensure that
> > + * external mappings are removed when the Linux VM removes memory ranges
> > + * or individual pages from a process.
> 
> So the callee cannot fail.  hm.  If it can't block, it's likely screwed in
> that case.  In other cases it might be screwed anyway.  I suspect we'll
> need to be able to handle callee failure.

Probably.

> 
> > + * These fall into two classes:
> > + *
> > + * 1. mmu_notifier
> > + *
> > + * 	These are callbacks registered with an mm_struct. If pages are
> > + * 	removed from an address space then callbacks are performed.
> 
> "to be removed", I guess.  It's called before the page is actually removed?

Its called after the pte was cleared while holding the pte lock.

> > + * 	The invalidate_range_start/end callbacks can be performed in contexts
> > + * 	where sleeping is allowed or in atomic contexts. A flag is passed
> > + * 	to indicate an atomic context.
> 
> We generally would prefer separate callbacks, rather than a unified
> callback with a mode flag.

We could drop the inode_mmap_lock when doing truncate. That would make 
this work but its a kind of invasive thing for the VM.

> > +struct mmu_notifier_ops {
> > +	/*
> > +	 * The release notifier is called when no other execution threads
> > +	 * are left. Synchronization is not necessary.
> 
> "and the mm is about to be destroyed"?

Right.

> > +	/*
> > +	 * invalidate_range_begin() and invalidate_range_end() must be paired.
> > +	 *
> > +	 * Multiple invalidate_range_begin/ends may be nested or called
> > +	 * concurrently.
> 
> Under what circumstances would they be nested?

Hmmmm.. Right they cannot be nested. Multiple processors can have 
invalidates() concurrently in progress.

> > That is legit. However, no new external references
> 
> references to what?

To the ranges that are in the process of being invalidated.

> > +	 * invalidate_range_begin() must clear all references in the range
> > +	 * and stop the establishment of new references.
> 
> and stop the establishment of new references within the range, I assume?

Right.
 
> If so, that's putting a heck of a lot of complexity into the driver, isn't
> it?  It needs to temporarily remember an arbitrarily large number of
> regions in this mm against which references may not be taken?

That is one implementation (XPmem does that). The other is to simply stop 
all references when any invalidate_range is in progress (KVM and GRU do 
that).


> > +	 * invalidate_range_end() reenables the establishment of references.
> 
> within the range?

Right.

> > +extern void mmu_notifier_release(struct mm_struct *mm);
> > +extern int mmu_notifier_age_page(struct mm_struct *mm,
> > +				 unsigned long address);
> 
> There's the mysterious age_page again.

Andrea put this in to check the reference status of a page. It functions 
like the accessed bit.

> > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> > +{
> > +	INIT_HLIST_HEAD(&mnh->head);
> > +}
> > +
> > +#define mmu_notifier(function, mm, args...)				\
> > +	do {								\
> > +		struct mmu_notifier *__mn;				\
> > +		struct hlist_node *__n;					\
> > +									\
> > +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> > +			rcu_read_lock();				\
> > +			hlist_for_each_entry_rcu(__mn, __n,		\
> > +					     &(mm)->mmu_notifier.head,	\
> > +					     hlist)			\
> > +				if (__mn->ops->function)		\
> > +					__mn->ops->function(__mn,	\
> > +							    mm,		\
> > +							    args);	\
> > +			rcu_read_unlock();				\
> > +		}							\
> > +	} while (0)
> 
> The macro references its args more than once.  Anyone who does
> 
> 	mmu_notifier(function, some_function_which_has_side_effects())
> 
> will get a surprise.  Use temporaries.

Ok.

> > +#define mmu_notifier(function, mm, args...)				\
> > +	do {								\
> > +		if (0) {						\
> > +			struct mmu_notifier *__mn;			\
> > +									\
> > +			__mn = (struct mmu_notifier *)(0x00ff);		\
> > +			__mn->ops->function(__mn, mm, args);		\
> > +		};							\
> > +	} while (0)
> 
> That's a bit weird.  Can't we do the old
> 
> 	(void)function;
> 	(void)mm;
> 
> trick?  Or make it a staic inline function?

Static inline wont allow the checking of the parameters.

(void) may be a good thing here.

> > +config MMU_NOTIFIER
> > +	def_bool y
> > +	bool "MMU notifier, for paging KVM/RDMA"
> 
> Why is this not selectable?  The help seems a bit brief.
> 
> Does this cause 32-bit systems to drag in a bunch of code they're not
> allowed to ever use?

I have selected it a number of times. We could make that a bit longer 
right.


> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		hlist_for_each_entry_safe(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_init(&mn->hlist);
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> We do this a lot, but back in the old days people didn't like optional
> callbacks which can be NULL.  If we expect that mmu_notifier_ops.release is
> usually implemented, the just unconditionally call it and require that all
> clients implement it.  Perhaps provide an exported-to-modules stuv in core
> kernel for clients which didn't want to implement ->release().

Ok.

> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n;
> > +	int young = 0;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_rcu(mn, n,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->age_page)
> > +				young |= mn->ops->age_page(mn, mm, address);
> > +		}
> > +		rcu_read_unlock();
> > +	}
> > +
> > +	return young;
> > +}
> 
> should the rcu_read_lock() cover the hlist_empty() test?
> 
> This function looks like it was tossed in at the last minute.  It's
> mysterious, undocumented, poorly commented, poorly named.  A better name
> would be one which has some correlation with the return value.
> 
> Because anyone who looks at some code which does
> 
> 	if (mmu_notifier_age_page(mm, address))
> 		...
> 
> has to go and reverse-engineer the implementation of
> mmu_notifier_age_page() to work out under which circumstances the "..."
> will be executed.  But this should be apparent just from reading the callee
> implementation.
> 
> This function *really* does need some documentation.  What does it *mean*
> when the ->age_page() from some of the notifiers returned "1" and the
> ->age_page() from some other notifiers returned zero?  Dunno.

Andrea: Could you provide some more detail here?


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 10:41     ` Brice Goglin
@ 2008-02-16 10:58       ` Andrew Morton
  2008-02-16 19:31         ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2008-02-16 10:58 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm

On Sat, 16 Feb 2008 11:41:35 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Andrew Morton wrote:
> > What is the status of getting infiniband to use this facility?
> >
> > How important is this feature to KVM?
> >
> > To xpmem?
> >
> > Which other potential clients have been identified and how important it it
> > to those?
> >   
> 
> As I said when Andrea posted the first patch series, I used something
> very similar for non-RDMA-based HPC about 4 years ago. I haven't had
> time yet to look in depth and try the latest proposed API but my feeling
> is that it looks good.
> 

"looks good" maybe.  But it's in the details where I fear this will come
unstuck.  The likelihood that some callbacks really will want to be able to
block in places where this interface doesn't permit that - either to wait
for IO to complete or to wait for other threads to clear critical regions.

>From that POV it doesn't look like a sufficiently general and useful
design.  Looks like it was grafted onto the current VM implementation in a
way which just about suits two particular clients if they try hard enough.

Which is all perfectly understandable - it would be hard to rework core MM
to be able to make this interface more general.  But I do think it's
half-baked and there is a decent risk that future (or present) code which
_could_ use something like this won't be able to use this one, and will
continue to futz with mlock, page-pinning, etc.

Not that I know what the fix to that is..

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
@ 2008-02-16 10:41     ` Brice Goglin
  2008-02-16 10:58       ` Andrew Morton
  2008-02-16 19:21     ` Christoph Lameter
  2008-02-17  5:04     ` Doug Maxey
  3 siblings, 1 reply; 119+ messages in thread
From: Brice Goglin @ 2008-02-16 10:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm

Andrew Morton wrote:
> What is the status of getting infiniband to use this facility?
>
> How important is this feature to KVM?
>
> To xpmem?
>
> Which other potential clients have been identified and how important it it
> to those?
>   

As I said when Andrea posted the first patch series, I used something
very similar for non-RDMA-based HPC about 4 years ago. I haven't had
time yet to look in depth and try the latest proposed API but my feeling
is that it looks good.

Brice


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  8:56       ` Andrew Morton
@ 2008-02-16  9:21         ` Avi Kivity
  0 siblings, 0 replies; 119+ messages in thread
From: Avi Kivity @ 2008-02-16  9:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

Andrew Morton wrote:

  

>> Very.  kvm pins pages that are referenced by the guest;
>>     
>
> hm.  Why does it do that?
>
>   

It was deemed best not to allow the guest to write to a page that has 
been swapped out and assigned to an unrelated host process.

One way to view the kvm shadow page tables is as hardware dma 
descriptors. kvm pins pages for the same reason that drivers pin pages 
that are being dma'ed. It's also the reason why mmu notifiers are useful 
for such a wide range of dma capable hardware.

>> a 64-bit guest 
>> will easily pin its entire memory with the kernel map.
>>     
>
>   
>>  So this is 
>> critical for guest swapping to actually work.
>>     
>
> Curious.  If KVM can release guest pages at the request of this notifier so
> that they can be swapped out, why can't it release them by default, and
> allow swapping to proceed?
>
>   

If kvm releases a page, it must also zap any shadow ptes pointing at the 
page and flush the tlb. If you do that for all of memory you can't 
reference any of it.

Releasing a page has costs, both at the time of the release and when the 
guest eventually refers to the page again.

>> Other nice features like page migration are also enabled by this patch.
>>
>>     
>
> We already have page migration.  Do you mean page-migration-when-using-kvm?
>   

Yes, I'm obviously writing from a kvm-centric point of view. This is an 
important feature, as the virtualization future seems to be NUMA hosts 
(2- or 4- way, 4 cores per socket) running moderately sized guests. The 
ability to load-balance guests among the NUMA nodes is important for 
performance.

(btw, I'm also looking forward to memory defragmentation. large pages 
are important for virtualization workloads and mmu notifiers are again 
critical to getting it to work while running kvm).

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  8:45     ` Avi Kivity
@ 2008-02-16  8:56       ` Andrew Morton
  2008-02-16  9:21         ` Avi Kivity
  0 siblings, 1 reply; 119+ messages in thread
From: Andrew Morton @ 2008-02-16  8:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

On Sat, 16 Feb 2008 10:45:50 +0200 Avi Kivity <avi@qumranet.com> wrote:

> Andrew Morton wrote:
> > How important is this feature to KVM?
> >   
> 
> Very.  kvm pins pages that are referenced by the guest;

hm.  Why does it do that?

> a 64-bit guest 
> will easily pin its entire memory with the kernel map.

>  So this is 
> critical for guest swapping to actually work.

Curious.  If KVM can release guest pages at the request of this notifier so
that they can be swapped out, why can't it release them by default, and
allow swapping to proceed?

> 
> Other nice features like page migration are also enabled by this patch.
> 

We already have page migration.  Do you mean page-migration-when-using-kvm?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
@ 2008-02-16  8:45     ` Avi Kivity
  2008-02-16  8:56       ` Andrew Morton
  2008-02-16 10:41     ` Brice Goglin
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 119+ messages in thread
From: Avi Kivity @ 2008-02-16  8:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

Andrew Morton wrote:
> How important is this feature to KVM?
>   

Very.  kvm pins pages that are referenced by the guest; a 64-bit guest 
will easily pin its entire memory with the kernel map.  So this is 
critical for guest swapping to actually work.

Other nice features like page migration are also enabled by this patch.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
                       ` (3 more replies)
  2008-02-18 22:33   ` Roland Dreier
  1 sibling, 4 replies; 119+ messages in thread
From: Andrew Morton @ 2008-02-16  3:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 14 Feb 2008 22:49:00 -0800 Christoph Lameter <clameter@sgi.com> wrote:

> MMU notifiers are used for hardware and software that establishes
> external references to pages managed by the Linux kernel. These are
> page table entriews or tlb entries or something else that allows
> hardware (such as DMA engines, scatter gather devices, networking,
> sharing of address spaces across operating system boundaries) and
> software (Virtualization solutions such as KVM, Xen etc) to
> access memory managed by the Linux kernel.
> 
> The MMU notifier will notify the device driver that subscribes to such
> a notifier that the VM is going to do something with the memory
> mapped by that device. The device must then drop references for the
> indicated memory area. The references may be reestablished later.
> 
> The notification scheme is much better than the current schemes of
> avoiding the danger of the VM removing pages that are externally
> mapped. We currently either mlock pages used for RDMA, XPmem etc
> in memory or increase the refcount to pin the pages. Increasing
> the refcount makes it impossible for the VM to reclaim the page.
> 
> Mlock causes problems with reclaim and may lead to OOM if too many
> pages are pinned in memory. It is also incorrect in terms what the POSIX
> specificies for what role mlock should play. Mlock does *not* pin pages in
> memory. Mlock just means do not allow the page to be moved to swap.
> 
> Linux can move pages in memory (for example through the page migration
> mechanism). These pages can be moved even if they are mlocked(!!!!).
> The current approach of page pinning in use by RDMA etc is conceptually
> broken but there are currently no other easy solutions.
> 
> The alternate of increasing the page count to pin pages is also not
> that enticing since there will be continual attempts to reclaim
> or migrate these pages.
> 
> The solution here allows us to finally fix this issue by requiring
> such devices to subscribe to a notification chain that will allow
> them to work without pinning. The VM gains control of its memory again
> and the memory that has external references can be managed like regular
> memory.
> 
> This patch: Core portion
> 

What is the status of getting infiniband to use this facility?

How important is this feature to KVM?

To xpmem?

Which other potential clients have been identified and how important it it
to those?


> Index: linux-2.6/Documentation/mmu_notifier/README
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/mmu_notifier/README	2008-02-14 22:27:19.000000000 -0800
> @@ -0,0 +1,105 @@
> +Linux MMU Notifiers
> +-------------------
> +
> +MMU notifiers are used for hardware and software that establishes
> +external references to pages managed by the Linux kernel. These are
> +page table entriews or tlb entries or something else that allows
> +hardware (such as DMA engines, scatter gather devices, networking,
> +sharing of address spaces across operating system boundaries) and
> +software (Virtualization solutions such as KVM, Xen etc) to
> +access memory managed by the Linux kernel.
> +
> +The MMU notifier will notify the device driver that subscribes to such
> +a notifier that the VM is going to do something with the memory
> +mapped by that device. The device must then drop references for the
> +indicated memory area. The references may be reestablished later.
> +
> +The notification scheme is much better than the current schemes of
> +dealing with the danger of the VM removing pages.
> +We currently mlock pages used for RDMA, XPmem etc in memory or
> +increase the refcount of the pages.
> +
> +Both cause problems with reclaim and may lead to OOM if too many
> +pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
> +specification of the role of mlock. Mlock does *not* pin pages in
> +memory. It just does not allow the page to be moved to swap.
> +The page refcount is used to track current users of a page struct.
> +Artificially inflating the refcount means that the VM cannot track
> +down all references to a page. It will not be able to reclaim or
> +move a page. However, the core code will try again and again because
> +the assumption is that an elevated refcount is a temporary situation.
> +
> +Linux can move pages in memory (for example through the page migration
> +mechanism). These pages can be moved even if they are mlocked(!!!!).
> +So the current approach in use by RDMA etc etc is conceptually broken
> +but there are currently no other easy solutions.
> +
> +The solution here allows us to finally fix this issue by requiring
> +such devices to subscribe to a notification chain that will allow
> +them to work without pinning.
> +
> +The notifier chains provide two callback mechanisms. The
> +first one is required for any device that establishes external mappings.
> +The second (rmap) mechanism is required if a device needs to be
> +able to sleep when invalidating references. Sleeping may be necessary
> +if we are mapping across a network or to different Linux instances
> +in the same address space.

I'd have thought that a major reason for sleeping would be to wait for IO
to complete.  Worth mentioning here?

> +mmu_notifier mechanism (for KVM/GRU etc)
> +----------------------------------------
> +Callbacks are registered with an mm_struct from a device driver using
> +mmu_notifier_register(). When the VM removes pages (or changes
> +permissions on pages etc) then callbacks are triggered.
> +
> +The invalidation function for a single page (*invalidate_page)

We already have an invalidatepage.  Ho hum.

> +is called with spinlocks (in particular the pte lock) held. This allow
> +for an easy implementation of external ptes that are on the local system.
>

Why is that "easy"?  I's have thought that it would only be easy if the
driver happened to be using those same locks for its own purposes. 
Otherwise it is "awkward"?

> +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> +called most of the time without any locks held. It is only called with
> +locks held for file backed mappings that are truncated. A flag indicates
> +in which mode we are. A driver can use that mechanism to f.e.
> +delay the freeing of the pages during truncate until no locks are held.

That sucks big time.  What do we need to do to make get the callback
functions called in non-atomic context?

> +Pages must be marked dirty if dirty bits are found to be set in
> +the external ptes during unmap.

That sentence is too vague.  Define "marked dirty"?

> +The *release* method is called when a Linux process exits. It is run before

We'd conventionally use a notation such as "->release()" here, rather than
the asterisks.

> +the pages and mappings of a process are torn down and gives the device driver
> +a chance to zap all the external mappings in one go.

I assume what you mean here is that ->release() is called during exit()
when the final reference to an mm is being dropped.

> +An example for a code that can be used to build a notifier mechanism into
> +a device driver can be found in the file
> +Documentation/mmu_notifier/skeleton.c

Should that be in samples/?

> +mmu_rmap_notifier mechanism (XPMEM etc)
> +---------------------------------------
> +The mmu_rmap_notifier allows the device driver to implement their own rmap

s/their/its/

> +and allows the device driver to sleep during page eviction. This is necessary
> +for complex drivers that f.e. allow the sharing of memory between processes
> +running on different Linux instances (typically over a network or in a
> +partitioned NUMA system).
> +
> +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> +*before* the Linux rmaps are walked. At that point only the page lock is
> +held. The invalidate_page() function must walk the driver rmaps and evict
> +all the references to the page.

What happens if it cannot do so?

> +There is no process information available before the rmaps are consulted.

Not sure what that sentence means.  I guess "available to the core VM"?

> +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> +it is a global callback list. Having to perform a callback for each and every
> +page that is reclaimed would be inefficient. Therefore we add an additional
> +page flag: PageRmapExternal().

How many page flags are left?

Is this feature important enough to justfy consumption of another one?

> Only pages that are marked with this bit can
> +be exported and the rmap callbacks will only be performed for pages marked
> +that way.

"exported": new term, unclear what it means.

> +The required additional Page flag is only availabe in 64 bit mode and
> +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.

whoa.  Is that good?  You just made your feature unavailable on the great
majority of Linux systems.

> +An example of code to build a mmu_notifier mechanism with rmap capabilty
> +can be found in Documentation/mmu_notifier/skeleton_rmap.c
> +
> +February 9, 2008,
> +	Christoph Lameter <clameter@sgi.com
> +
> +Index: linux-2.6/include/linux/mm_types.h
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h	2008-02-14 21:17:51.000000000 -0800
> @@ -159,6 +159,12 @@ struct vm_area_struct {
>  #endif
>  };
>  
> +struct mmu_notifier_head {
> +#ifdef CONFIG_MMU_NOTIFIER
> +	struct hlist_head head;
> +#endif
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -228,6 +234,7 @@ struct mm_struct {
>  #ifdef CONFIG_CGROUP_MEM_CONT
>  	struct mem_cgroup *mem_cgroup;
>  #endif
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-02-14 22:42:28.000000000 -0800
> @@ -0,0 +1,180 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier

typo

> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.

So the callee cannot fail.  hm.  If it can't block, it's likely screwed in
that case.  In other cases it might be screwed anyway.  I suspect we'll
need to be able to handle callee failure.

> + * These fall into two classes:
> + *
> + * 1. mmu_notifier
> + *
> + * 	These are callbacks registered with an mm_struct. If pages are
> + * 	removed from an address space then callbacks are performed.

"to be removed", I guess.  It's called before the page is actually removed?

> + * 	Spinlocks must be held in order to walk reverse maps. The
> + * 	invalidate_page() callbacks are performed with spinlocks held.

hm, yes, problem.   Permitting callee failure might be good enough.

> + * 	The invalidate_range_start/end callbacks can be performed in contexts
> + * 	where sleeping is allowed or in atomic contexts. A flag is passed
> + * 	to indicate an atomic context.

We generally would prefer separate callbacks, rather than a unified
callback with a mode flag.


> + *	Pages must be marked dirty if dirty bits are found to be set in
> + *	the external ptes.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> +	/*
> +	 * The release notifier is called when no other execution threads
> +	 * are left. Synchronization is not necessary.

"and the mm is about to be destroyed"?

> +	 */
> +	void (*release)(struct mmu_notifier *mn,
> +			struct mm_struct *mm);
> +
> +	/*
> +	 * age_page is called from contexts where the pte_lock is held
> +	 */
> +	int (*age_page)(struct mmu_notifier *mn,
> +			struct mm_struct *mm,
> +			unsigned long address);

This wasn't documented.

> +	/*
> +	 * invalidate_page is called from contexts where the pte_lock is held.
> +	 */
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);
> +
> +	/*
> +	 * invalidate_range_begin() and invalidate_range_end() must be paired.
> +	 *
> +	 * Multiple invalidate_range_begin/ends may be nested or called
> +	 * concurrently.

Under what circumstances would they be nested?

> That is legit. However, no new external references

references to what?

> +	 * may be established as long as any invalidate_xxx is running or
> +	 * any invalidate_range_begin() and has not been completed through a

stray "and".

> +	 * corresponding call to invalidate_range_end().
> +	 *
> +	 * Locking within the notifier needs to serialize events correspondingly.
> +	 *
> +	 * invalidate_range_begin() must clear all references in the range
> +	 * and stop the establishment of new references.

and stop the establishment of new references within the range, I assume?

If so, that's putting a heck of a lot of complexity into the driver, isn't
it?  It needs to temporarily remember an arbitrarily large number of
regions in this mm against which references may not be taken?

> +	 * invalidate_range_end() reenables the establishment of references.

within the range?

> +	 * atomic indicates that the function is called in an atomic context.
> +	 * We can sleep if atomic == 0.
> +	 *
> +	 * invalidate_range_begin() must remove all external references.
> +	 * There will be no retries as with invalidate_page().
> +	 */
> +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int atomic);
> +
> +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int atomic);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +
> +/*
> + * Must hold mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> +				    struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> +				 unsigned long address);

There's the mysterious age_page again.

> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> +	INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		struct mmu_notifier *__mn;				\
> +		struct hlist_node *__n;					\
> +									\
> +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> +			rcu_read_lock();				\
> +			hlist_for_each_entry_rcu(__mn, __n,		\
> +					     &(mm)->mmu_notifier.head,	\
> +					     hlist)			\
> +				if (__mn->ops->function)		\
> +					__mn->ops->function(__mn,	\
> +							    mm,		\
> +							    args);	\
> +			rcu_read_unlock();				\
> +		}							\
> +	} while (0)

The macro references its args more than once.  Anyone who does

	mmu_notifier(function, some_function_which_has_side_effects())

will get a surprise.  Use temporaries.

> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_notifier *__mn;			\
> +									\
> +			__mn = (struct mmu_notifier *)(0x00ff);		\
> +			__mn->ops->function(__mn, mm, args);		\
> +		};							\
> +	} while (0)

That's a bit weird.  Can't we do the old

	(void)function;
	(void)mm;

trick?  Or make it a staic inline function?

> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> +				unsigned long address)
> +{
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Kconfig	2008-02-14 21:17:51.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	def_bool y
> +	bool "MMU notifier, for paging KVM/RDMA"

Why is this not selectable?  The help seems a bit brief.

Does this cause 32-bit systems to drag in a bunch of code they're not
allowed to ever use?

> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Makefile	2008-02-14 21:17:51.000000000 -0800
> @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>  
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c	2008-02-14 22:41:55.000000000 -0800
> @@ -0,0 +1,76 @@
> +/*
> + *  linux/mm/mmu_notifier.c
> + *
> + *  Copyright (C) 2008  Qumranet, Inc.
> + *  Copyright (C) 2008  SGI
> + *  		Christoph Lameter <clameter@sgi.com>
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +
> +/*
> + * No synchronization. This function can only be called when only a single
> + * process remains that performs teardown.
> + */
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		hlist_for_each_entry_safe(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			hlist_del_init(&mn->hlist);
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);

We do this a lot, but back in the old days people didn't like optional
callbacks which can be NULL.  If we expect that mmu_notifier_ops.release is
usually implemented, the just unconditionally call it and require that all
clients implement it.  Perhaps provide an exported-to-modules stuv in core
kernel for clients which didn't want to implement ->release().

> +		}
> +	}
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int young = 0;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_rcu(mn, n,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->age_page)
> +				young |= mn->ops->age_page(mn, mm, address);
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	return young;
> +}

should the rcu_read_lock() cover the hlist_empty() test?

This function looks like it was tossed in at the last minute.  It's
mysterious, undocumented, poorly commented, poorly named.  A better name
would be one which has some correlation with the return value.

Because anyone who looks at some code which does

	if (mmu_notifier_age_page(mm, address))
		...

has to go and reverse-engineer the implementation of
mmu_notifier_age_page() to work out under which circumstances the "..."
will be executed.  But this should be apparent just from reading the callee
implementation.

This function *really* does need some documentation.  What does it *mean*
when the ->age_page() from some of the notifiers returned "1" and the
->age_page() from some other notifiers returned zero?  Dunno.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
@ 2008-02-15  6:49 ` Christoph Lameter
  2008-02-16  3:37   ` Andrew Morton
  2008-02-18 22:33   ` Roland Dreier
  0 siblings, 2 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-15  6:49 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 19064 bytes --]

MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.

The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.

The notification scheme is much better than the current schemes of
avoiding the danger of the VM removing pages that are externally
mapped. We currently either mlock pages used for RDMA, XPmem etc
in memory or increase the refcount to pin the pages. Increasing
the refcount makes it impossible for the VM to reclaim the page.

Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.

Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.

The alternate of increasing the page count to pin pages is also not
that enticing since there will be continual attempts to reclaim
or migrate these pages.

The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning. The VM gains control of its memory again
and the memory that has external references can be managed like regular
memory.

This patch: Core portion

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 Documentation/mmu_notifier/README |  105 ++++++++++++++++++++++
 include/linux/mm_types.h          |    7 +
 include/linux/mmu_notifier.h      |  180 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                     |    2 
 mm/Kconfig                        |    4 
 mm/Makefile                       |    1 
 mm/mmap.c                         |    2 
 mm/mmu_notifier.c                 |   76 ++++++++++++++++
 8 files changed, 377 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README	2008-02-14 22:27:19.000000000 -0800
@@ -0,0 +1,105 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current schemes of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory or
+increase the refcount of the pages.
+
+Both cause problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+The page refcount is used to track current users of a page struct.
+Artificially inflating the refcount means that the VM cannot track
+down all references to a page. It will not be able to reclaim or
+move a page. However, the core code will try again and again because
+the assumption is that an elevated refcount is a temporary situation.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+	Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-02-14 21:17:51.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-02-14 22:42:28.000000000 -0800
@@ -0,0 +1,180 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/*
+	 * invalidate_page is called from contexts where the pte_lock is held.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must be paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 *
+	 * invalidate_range_begin() must remove all external references.
+	 * There will be no retries as with invalidate_page().
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-02-14 21:17:51.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-02-14 21:17:51.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-02-14 22:41:55.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-02-14 21:17:51.000000000 -0800
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-02-14 22:42:02.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-05 18:05   ` Andy Whitcroft
  2008-02-05 18:17     ` Peter Zijlstra
@ 2008-02-05 18:19     ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:19 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Tue, 5 Feb 2008, Andy Whitcroft wrote:

> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.

Right that was fixed in a later release and discussed extensively later. 
See V5.

> I am not sure it makes sense to add a _safe_rcu variant.  As I understand
> things an _safe variant is used where we are going to remove the current

It was dropped in V5.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-05 18:05   ` Andy Whitcroft
@ 2008-02-05 18:17     ` Peter Zijlstra
  2008-02-05 18:19     ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2008-02-05 18:17 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins


On Tue, 2008-02-05 at 18:05 +0000, Andy Whitcroft wrote:

> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.
> 
> If it does that seems wrong.  This is an RCU hlist, therefore the list
> integrity must be maintained through the next grace period in case there
> are parallell readers using the element, in particular its forward
> pointer for traversal.

That is not quite so, list elements must be preserved, not the list
order.

> 
> > +			hlist_del(&mn->hlist);
> 
> For this to be updating the list, you must have some form of "write-side"
> exclusion as these primatives are not "parallel write safe".  It would
> be helpful for this routine to state what that write side exclusion is.

Yeah, has been noticed, read on in the thread :-)

> I am not sure it makes sense to add a _safe_rcu variant.  As I understand
> things an _safe variant is used where we are going to remove the current
> list element in the middle of a list walk.  However the key feature of an
> RCU data structure is that it will always be in a "safe" state until any
> parallel readers have completed.  For an hlist this means that the removed
> entry and its forward link must remain valid for as long as there may be
> a parallel reader traversing this list, ie. until the next grace period.
> If this link is valid for the parallel reader, then it must be valid for
> us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
> to cope in the face of entries being unlinked as we traverse the list.

It does make sense, hlist_del_rcu() maintains the fwd reference, but it
does unlink it from the list proper. As long as there is a write side
exclusion around the actual removal as you noted.

rcu_read_lock();
hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) {

	if (foo) {
		spin_lock(write_lock);
		hlist_del_rcu(tpos);
		spin_unlock(write_unlock);
	}
}
rcu_read_unlock();

is a safe construct in that the list itself stays a proper list, and
even items that might be caught in the to-be-deleted entries will have a
fwd way out.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                     ` (3 preceding siblings ...)
  2008-01-29 16:07   ` Robin Holt
@ 2008-02-05 18:05   ` Andy Whitcroft
  2008-02-05 18:17     ` Peter Zijlstra
  2008-02-05 18:19     ` Christoph Lameter
  4 siblings, 2 replies; 119+ messages in thread
From: Andy Whitcroft @ 2008-02-05 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> Core code for mmu notifiers.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
> 
> ---
>  include/linux/list.h         |   14 ++
>  include/linux/mm_types.h     |    6 +
>  include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/page-flags.h   |   10 ++
>  kernel/fork.c                |    2 
>  mm/Kconfig                   |    4 
>  mm/Makefile                  |    1 
>  mm/mmap.c                    |    2 
>  mm/mmu_notifier.c            |  101 ++++++++++++++++++++
>  9 files changed, 350 insertions(+)
> 
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h	2008-01-28 11:35:22.000000000 -0800
> @@ -153,6 +153,10 @@ struct vm_area_struct {
>  #endif
>  };
>  
> +struct mmu_notifier_head {
> +	struct hlist_head head;
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
>  	/* aio bits */
>  	rwlock_t		ioctx_list_lock;
>  	struct kioctx		*ioctx_list;
> +
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
> @@ -0,0 +1,210 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier
> + *
> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * the external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.
> + *
> + * These fall into two classes
> + *
> + * 1. mmu_notifier
> + *
> + * 	These are callbacks registered with an mm_struct. If mappings are
> + * 	removed from an address space then callbacks are performed.
> + * 	Spinlocks must be held in order to the walk reverse maps and the
> + * 	notifications are performed while the spinlock is held.
> + *
> + *
> + * 2. mmu_rmap_notifier
> + *
> + *	Callbacks for subsystems that provide their own rmaps. These
> + *	need to walk their own rmaps for a page. The invalidate_page
> + *	callback is outside of locks so that we are not in a strictly
> + *	atomic context (but we may be in a PF_MEMALLOC context if the
> + *	notifier is called from reclaim code) and are able to sleep.
> + *	Rmap notifiers need an extra page bit and are only available
> + *	on 64 bit platforms. It is up to the subsystem to mark pags
> + *	as PageExternalRmap as needed to trigger the callbacks. Pages
> + *	must be marked dirty if dirty bits are set in the external
> + *	pte.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> +	/*
> +	 * Note: The mmu_notifier structure must be released with
> +	 * call_rcu() since other processors are only guaranteed to
> +	 * see the changes after a quiescent period.
> +	 */
> +	void (*release)(struct mmu_notifier *mn,
> +			struct mm_struct *mm);
> +
> +	int (*age_page)(struct mmu_notifier *mn,
> +			struct mm_struct *mm,
> +			unsigned long address);
> +
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);
> +
> +	/*
> +	 * lock indicates that the function is called under spinlock.
> +	 */
> +	void (*invalidate_range)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int lock);
> +};
> +
> +struct mmu_rmap_notifier_ops;
> +
> +struct mmu_rmap_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_rmap_notifier_ops *ops;
> +};
> +
> +struct mmu_rmap_notifier_ops {
> +	/*
> +	 * Called with the page lock held after ptes are modified or removed
> +	 * so that a subsystem with its own rmap's can remove remote ptes
> +	 * mapping a page.
> +	 */
> +	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
> +						struct page *page);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void __mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +/* Will acquire mmap_sem for write*/
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +/*
> + * Will acquire mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> +				    struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> +				 unsigned long address);
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> +	INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		struct mmu_notifier *__mn;				\
> +		struct hlist_node *__n;					\
> +									\
> +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> +			rcu_read_lock();				\
> +			hlist_for_each_entry_rcu(__mn, __n,		\
> +					     &(mm)->mmu_notifier.head,	\
> +					     hlist)			\
> +				if (__mn->ops->function)		\
> +					__mn->ops->function(__mn,	\
> +							    mm,		\
> +							    args);	\
> +			rcu_read_unlock();				\
> +		}							\
> +	} while (0)
> +
> +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
> +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
> +
> +extern struct hlist_head mmu_rmap_notifier_list;
> +
> +#define mmu_rmap_notifier(function, args...)				\
> +	do {								\
> +		struct mmu_rmap_notifier *__mrn;			\
> +		struct hlist_node *__n;					\
> +									\
> +		rcu_read_lock();					\
> +		hlist_for_each_entry_rcu(__mrn, __n,			\
> +				&mmu_rmap_notifier_list, 		\
> +						hlist)			\
> +			if (__mrn->ops->function)			\
> +				__mrn->ops->function(__mrn, args);	\
> +		rcu_read_unlock();					\
> +	} while (0);
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_notifier *__mn;			\
> +									\
> +			__mn = (struct mmu_notifier *)(0x00ff);		\
> +			__mn->ops->function(__mn, mm, args);		\
> +		};							\
> +	} while (0)
> +
> +#define mmu_rmap_notifier(function, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_rmap_notifier *__mrn;		\
> +									\
> +			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
> +			__mrn->ops->function(__mrn, args);		\
> +		}							\
> +	} while (0);
> +
> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> +				unsigned long address)
> +{
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +									{}
> +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +									{}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/page-flags.h	2008-01-28 11:35:22.000000000 -0800
> @@ -105,6 +105,7 @@
>   * 64 bit  |           FIELDS             | ??????         FLAGS         |
>   *         63                            32                              0
>   */
> +#define PG_external_rmap	30	/* Page has external rmap */
>  #define PG_uncached		31	/* Page has been mapped as uncached */
>  #endif
>  
> @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
>  #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
>  #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
>  
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
> +#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
> +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
> +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
> +							&(page)->flags)
> +#else
> +#define PageExternalRmap(page)	0
> +#endif
> +
>  struct page;	/* forward declaration */
>  
>  extern void cancel_dirty_page(struct page *page, unsigned int account_size);
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Kconfig	2008-01-28 11:35:22.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	def_bool y
> +	bool "MMU notifier, for paging KVM/RDMA"
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Makefile	2008-01-28 11:35:22.000000000 -0800
> @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>  
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c	2008-01-28 11:35:22.000000000 -0800
> @@ -0,0 +1,101 @@
> +/*
> + *  linux/mm/mmu_notifier.c
> + *
> + *  Copyright (C) 2008  Qumranet, Inc.
> + *  Copyright (C) 2008  SGI
> + *  		Christoph Lameter <clameter@sgi.com>
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);

Does this ->release actually release the 'nm' and its associated hlist?
I see in this thread that this ordering is deemed "use after free" which
implies so.

If it does that seems wrong.  This is an RCU hlist, therefore the list
integrity must be maintained through the next grace period in case there
are parallell readers using the element, in particular its forward
pointer for traversal.

> +			hlist_del(&mn->hlist);

For this to be updating the list, you must have some form of "write-side"
exclusion as these primatives are not "parallel write safe".  It would
be helpful for this routine to state what that write side exclusion is.

> +		}
> +		rcu_read_unlock();
> +		synchronize_rcu();
> +	}
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int young = 0;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_rcu(mn, n,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->age_page)
> +				young |= mn->ops->age_page(mn, mm, address);
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	return young;
> +}
> +
> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	__mmu_notifier_register(mn, mm);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	hlist_del_rcu(&mn->hlist);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> +
> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);
> +HLIST_HEAD(mmu_rmap_notifier_list);
> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_del_rcu(&mrn->hlist);
> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
> +
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/kernel/fork.c	2008-01-28 11:35:22.000000000 -0800
> @@ -51,6 +51,7 @@
>  #include <linux/random.h>
>  #include <linux/tty.h>
>  #include <linux/proc_fs.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
>  
>  	if (likely(!mm_alloc_pgd(mm))) {
>  		mm->def_flags = 0;
> +		mmu_notifier_head_init(&mm->mmu_notifier);
>  		return mm;
>  	}
>  	free_mm(mm);
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/mmap.c	2008-01-28 11:37:53.000000000 -0800
> @@ -26,6 +26,7 @@
>  #include <linux/mount.h>
>  #include <linux/mempolicy.h>
>  #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
>  	tlb_finish_mmu(tlb, 0, end);
> +	mmu_notifier_release(mm);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> Index: linux-2.6/include/linux/list.h
> ===================================================================
> --- linux-2.6.orig/include/linux/list.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/list.h	2008-01-28 11:35:22.000000000 -0800
> @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
>  		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
>  	     pos = pos->next)
>  
> +/**
> + * hlist_for_each_entry_safe_rcu	- iterate over list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @n:		temporary pointer
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
> +	for (pos = (head)->first;					 \
> +	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = n)
> +
>  #else
>  #warning "don't include kernel headers in userspace"
>  #endif /* __KERNEL__ */

I am not sure it makes sense to add a _safe_rcu variant.  As I understand
things an _safe variant is used where we are going to remove the current
list element in the middle of a list walk.  However the key feature of an
RCU data structure is that it will always be in a "safe" state until any
parallel readers have completed.  For an hlist this means that the removed
entry and its forward link must remain valid for as long as there may be
a parallel reader traversing this list, ie. until the next grace period.
If this link is valid for the parallel reader, then it must be valid for
us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
to cope in the face of entries being unlinked as we traverse the list.

-apw

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 23:38           ` Andrea Arcangeli
@ 2008-01-30 23:55             ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Jack Steiner, Avi Kivity, Izik Eidus, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > I think Andrea's original concept of the lock in the mmu_notifier_head
> > structure was the best.  I agree with him that it should be a spinlock
> > instead of the rw_lock.
> 
> BTW, I don't see the scalability concern with huge number of tasks:
> the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
> up_write(mm->mmap_sem) is always going to scale worse than
> spin_lock(mm->somethingelse); oneinstruction;
> spin_unlock(mm->somethinglese).

If we put it elsewhere in the mm then we increase the size of the memory 
used in the mm_struct.

> Furthermore if we go this route and we don't relay on implicit
> serialization of all the mmu notifier users against exit_mmap
> (i.e. the mmu notifier user must agree to stop calling
> mmu_notifier_register on a mm after the last mmput) the autodisarming
> feature will likely have to be removed or it can't possibly be safe to
> run mmu_notifier_unregister while mmu_notifier_release runs. With the
> auto-disarming feature, there is no way to safely know if
> mmu_notifier_unregister has to be called or not. I'm ok with removing
> the auto-disarming feature and to have as self-contained-as-possible
> locking. Then mmu_notifier_release can just become the
> invalidate_all_after and invalidate_all, invalidate_all_before.

Hmmmm.. exit_mmap is only called when the last reference is removed 
against the mm right? So no tasks are running anymore. No pages are left. 
Do we need to serialize at all for mmu_notifier_release?

 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 22:20         ` Robin Holt
@ 2008-01-30 23:38           ` Andrea Arcangeli
  2008-01-30 23:55             ` Christoph Lameter
  0 siblings, 1 reply; 119+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 23:38 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Jack Steiner, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 04:20:35PM -0600, Robin Holt wrote:
> On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> > On Wed, 30 Jan 2008, Jack Steiner wrote:
> > 
> > > Moving to a different lock solves the problem.
> > 
> > Well it gets us back to the issue why we removed the lock. As Robin said 
> > before: If its global then we can have a huge number of tasks contending 
> > for the lock on startup of a process with a large number of ranks. The 
> > reason to go to mmap_sem was that it was placed in the mm_struct and so we 
> > would just have a couple of contentions per mm_struct.
> > 
> > I'll be looking for some other way to do this.
> 
> I think Andrea's original concept of the lock in the mmu_notifier_head
> structure was the best.  I agree with him that it should be a spinlock
> instead of the rw_lock.

BTW, I don't see the scalability concern with huge number of tasks:
the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
up_write(mm->mmap_sem) is always going to scale worse than
spin_lock(mm->somethingelse); oneinstruction;
spin_unlock(mm->somethinglese).

Furthermore if we go this route and we don't relay on implicit
serialization of all the mmu notifier users against exit_mmap
(i.e. the mmu notifier user must agree to stop calling
mmu_notifier_register on a mm after the last mmput) the autodisarming
feature will likely have to be removed or it can't possibly be safe to
run mmu_notifier_unregister while mmu_notifier_release runs. With the
auto-disarming feature, there is no way to safely know if
mmu_notifier_unregister has to be called or not. I'm ok with removing
the auto-disarming feature and to have as self-contained-as-possible
locking. Then mmu_notifier_release can just become the
invalidate_all_after and invalidate_all, invalidate_all_before.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 19:19       ` Christoph Lameter
@ 2008-01-30 22:20         ` Robin Holt
  2008-01-30 23:38           ` Andrea Arcangeli
  0 siblings, 1 reply; 119+ messages in thread
From: Robin Holt @ 2008-01-30 22:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jack Steiner, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman,
	Hugh Dickins

On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
> 
> > Moving to a different lock solves the problem.
> 
> Well it gets us back to the issue why we removed the lock. As Robin said 
> before: If its global then we can have a huge number of tasks contending 
> for the lock on startup of a process with a large number of ranks. The 
> reason to go to mmap_sem was that it was placed in the mm_struct and so we 
> would just have a couple of contentions per mm_struct.
> 
> I'll be looking for some other way to do this.

I think Andrea's original concept of the lock in the mmu_notifier_head
structure was the best.  I agree with him that it should be a spinlock
instead of the rw_lock.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 17:10     ` Peter Zijlstra
@ 2008-01-30 19:28       ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

How about just taking the mmap_sem writelock in release? We have only a 
single caller of mmu_notifier_release() in mm/mmap.c and we know that we 
are not holding mmap_sem at that point. So just acquire it when needed?

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-30 11:21:57.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 11:24:59.000000000 -0800
@@ -18,6 +19,7 @@ void mmu_notifier_release(struct mm_stru
 	struct hlist_node *n, *t;
 
 	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		down_write(&mm->mmap_sem);
 		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
@@ -26,6 +28,7 @@ void mmu_notifier_release(struct mm_stru
 				mn->ops->release(mn, mm);
 		}
 		rcu_read_unlock();
+		up_write(&mm->mmap_sem);
 		synchronize_rcu();
 	}
 }

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:53     ` Jack Steiner
  2008-01-30 16:38       ` Andrea Arcangeli
@ 2008-01-30 19:19       ` Christoph Lameter
  2008-01-30 22:20         ` Robin Holt
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:19 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Jack Steiner wrote:

> Moving to a different lock solves the problem.

Well it gets us back to the issue why we removed the lock. As Robin said 
before: If its global then we can have a huge number of tasks contending 
for the lock on startup of a process with a large number of ranks. The 
reason to go to mmap_sem was that it was placed in the mm_struct and so we 
would just have a couple of contentions per mm_struct.

I'll be looking for some other way to do this.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 18:02   ` Robin Holt
  2008-01-30 19:08     ` Christoph Lameter
@ 2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Ok. So I added the following patch:

---
 include/linux/mmu_notifier.h |    1 +
 mm/mmu_notifier.c            |   12 ++++++++++++
 2 files changed, 13 insertions(+)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-30 11:09:06.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 11:10:38.000000000 -0800
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
 
 extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
 extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
 
 extern struct hlist_head mmu_rmap_notifier_list;
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-30 11:09:01.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 11:12:10.000000000 -0800
@@ -99,3 +99,15 @@ void mmu_rmap_notifier_unregister(struct
 }
 EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
 
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 18:02   ` Robin Holt
@ 2008-01-30 19:08     ` Christoph Lameter
  2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:08 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Robin Holt wrote:

> Index: git-linus/mm/mmu_notifier.c
> ===================================================================
> --- git-linus.orig/mm/mmu_notifier.c	2008-01-30 11:43:45.000000000 -0600
> +++ git-linus/mm/mmu_notifier.c	2008-01-30 11:56:08.000000000 -0600
> @@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
>  }
>  EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
>  
> +void mmu_rmap_export_page(struct page *page)
> +{
> +	SetPageExternalRmap(page);
> +}
> +EXPORT_SYMBOL(mmu_rmap_export_page);

Then mmu_rmap_export_page would have to be called before the subsystem 
establishes the rmap entry for the page. Could we do all PageExternalRmap 
modifications under Pagelock?



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-30 15:37   ` Andrea Arcangeli
@ 2008-01-30 18:02   ` Robin Holt
  2008-01-30 19:08     ` Christoph Lameter
  2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 2 replies; 119+ messages in thread
From: Robin Holt @ 2008-01-30 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

Back to one of Andrea's points from a couple days ago, I think we still
have a problem with the PageExternalRmap page flag.

If I had two drivers with external rmap implementations, there is no way
I can think of for a simple flag to coordinate a single page being
exported and maintained by the two.

Since the intended use seems to point in the direction of the external
rmap must be maintained consistent with the all pages the driver has
exported and the driver will already need to handle cases where the page
does not appear in its rmap, I would propose the setting and clearing
should be handled in the mmu_notifier code.

This is the first of two patches.  This one is intended as an addition
to patch 1/6.  I will post the other shortly under the patch 3/6 thread.


Index: git-linus/include/linux/mmu_notifier.h
===================================================================
--- git-linus.orig/include/linux/mmu_notifier.h	2008-01-30 11:43:45.000000000 -0600
+++ git-linus/include/linux/mmu_notifier.h	2008-01-30 11:44:35.000000000 -0600
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
 
 extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
 extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
 
 extern struct hlist_head mmu_rmap_notifier_list;
 
Index: git-linus/mm/mmu_notifier.c
===================================================================
--- git-linus.orig/mm/mmu_notifier.c	2008-01-30 11:43:45.000000000 -0600
+++ git-linus/mm/mmu_notifier.c	2008-01-30 11:56:08.000000000 -0600
@@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
 }
 EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
 
+void mmu_rmap_export_page(struct page *page)
+{
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 15:53     ` Jack Steiner
@ 2008-01-30 17:10     ` Peter Zijlstra
  2008-01-30 19:28       ` Christoph Lameter
  1 sibling, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2008-01-30 17:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins


On Wed, 2008-01-30 at 16:37 +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n, *t;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_rcu(&mn->hlist);
> 
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.

Agreed, sorry for this oversight.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:53     ` Jack Steiner
@ 2008-01-30 16:38       ` Andrea Arcangeli
  2008-01-30 19:19       ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 16:38 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 09:53:06AM -0600, Jack Steiner wrote:
> That will also resolve the problem we discussed yesterday. 
> I want to unregister my mmu_notifier when a GRU segment is
> unmapped. This would not necessarily be at task termination.

My proof that there is something wrong in the smp locking of the
current code is very simple: it can't be right to use
hlist_for_each_entry_safe_rcu and rcu_read_lock inside
mmu_notifier_release, and then to call hlist_del_rcu without any
spinlock or semaphore. If we walk the list with
hlist_for_each_entry_safe_rcu (and not with
hlist_for_each_entry_safe), it means the list _can_ change from under
us, and in turn the hlist_del_rcu must be surrounded by a spinlock or
sempahore too!

If by design the list _can't_ change from under us and calling
hlist_del_rcu was safe w/o locks, then hlist_for_each_entry_safe is
_sure_ enough for mmu_notifier_release, and rcu_read_lock most
certainly can be removed too.

To make an usage case where the race could trigger, I was thinking at
somebody bumping the mm_count (not mm_users) and registering a
notifier while mmu_notifier_release runs and relaying on ->release to
know if it has to run mmu_notifier_unregister. However I now started
wondering how it can relay on ->release to know that if ->release is
called after hlist_del_rcu because with the latest changes ->release
will also allow the mn to release itself ;). It's unsafe to call
list_del_rcu twice (the second will crash on a poisoned entry).

This starts to make me think we should remove the auto-disarming
feature and require the notifier-user to have the ->release call
mmu_notifier_unregister first and to free the "mn" inside ->release
too if needed. Or alternatively the notifier-user can bump mm_count
and to call a mmu_notifier_unregister before calling mmdrop (like kvm
could do).

Another approach is to simply define mmu_notifier_release as
implicitly serialized by other code design, with a real lock (not rcu)
against the whole register/unregister operations. So to guarantee the
notifier list can't change from under us while mmu_notifier_release
runs. If we go this route, yes, the auto-disarming hlist_del can be
kept, the current code would have been safe, but to avoid confusion
the mmu_notifier_release shall become this:

void mmu_notifier_release(struct mm_struct *mm)
{
	struct mmu_notifier *mn;
	struct hlist_node *n, *t;

	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
		hlist_for_each_entry_safe(mn, n, t,
					  &mm->mmu_notifier.head, hlist) {
			hlist_del(&mn->hlist);
			if (mn->ops->release)
				mn->ops->release(mn, mm);
		}
	}
}

> However, the mmap_sem is already held for write by the core
> VM at the point I would call the unregister function.
> Currently, there is no __mmu_notifier_unregister() defined.
> 
> Moving to a different lock solves the problem.

Unless the mmu_notifier_release becomes like above and we rely on the
user of the mmu notifiers to implement a highlevel external lock that
will we definitely forbid to bump the mm_count of the mm, and to call
register/unregister while mmu_notifier_release could run, 1) moving to a
different lock and 2) removing the auto-disarming hlist_del_rcu from
mmu_notifier_release sounds the only possible smp safe way.

As far as KVM is concerned mmu_notifier_released could be changed to
the version I written above and everything should be ok. For KVM the
mm_count bump is done by the task that also holds a mm_user, so when
exit_mmap runs I don't think the list could possible change anymore.

Anyway those are details that can be perfected after mainline merging,
so this isn't something to worry about too much right now. My idea is
to keep working to perfect it while I hope progress is being made by
Christoph to merge the mmu notifiers V3 patchset in mainline ;).

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:37   ` Andrea Arcangeli
@ 2008-01-30 15:53     ` Jack Steiner
  2008-01-30 16:38       ` Andrea Arcangeli
  2008-01-30 19:19       ` Christoph Lameter
  2008-01-30 17:10     ` Peter Zijlstra
  1 sibling, 2 replies; 119+ messages in thread
From: Jack Steiner @ 2008-01-30 15:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 04:37:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n, *t;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_rcu(&mn->hlist);
> 
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.

Agree.

That will also resolve the problem we discussed yesterday. 
I want to unregister my mmu_notifier when a GRU segment is
unmapped. This would not necessarily be at task termination.

However, the mmap_sem is already held for write by the core
VM at the point I would call the unregister function.
Currently, there is no __mmu_notifier_unregister() defined.

Moving to a different lock solves the problem.


-- jack

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 15:53     ` Jack Steiner
  2008-01-30 17:10     ` Peter Zijlstra
  2008-01-30 18:02   ` Robin Holt
  1 sibling, 2 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			hlist_del_rcu(&mn->hlist);

This will race and kernel crash against mmu_notifier_register in
SMP. You should resurrect the per-mmu_notifier_head lock in my last
patch (except it can be converted from a rwlock_t to a regular
spinlock_t) and drop the mmap_sem from
mmu_notifier_register/unregister.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
@ 2008-01-30  2:29 ` Christoph Lameter
  2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 18:02   ` Robin Holt
  0 siblings, 2 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-30  2:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15337 bytes --]

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/list.h         |   14 ++
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   10 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |  101 ++++++++++++++++++++
 9 files changed, 350 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-29 16:56:36.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-29 16:56:36.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ * 	Spinlocks must be held in order to the walk reverse maps and the
+ * 	notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms. It is up to the subsystem to mark pags
+ *	as PageExternalRmap as needed to trigger the callbacks. Pages
+ *	must be marked dirty if dirty bits are set in the external
+ *	pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Note: The mmu_notifier structure must be released with
+	 * call_rcu() since other processors are only guaranteed to
+	 * see the changes after a quiescent period.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * lock indicates that the function is called under spinlock.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, 		\
+						hlist)			\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-29 16:56:36.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-29 16:56:36.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-29 16:56:36.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-29 16:57:26.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_safe_rcu(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_rcu(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+		rcu_read_unlock();
+		synchronize_rcu();
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-29 16:56:36.000000000 -0800
@@ -52,6 +52,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-29 16:56:36.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/list.h	2008-01-29 16:56:36.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		temporary pointer
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
+	for (pos = (head)->first;					 \
+	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 19:49     ` Christoph Lameter
@ 2008-01-29 20:41       ` Avi Kivity
  0 siblings, 0 replies; 119+ messages in thread
From: Avi Kivity @ 2008-01-29 20:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
>   
>>> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>>>  };
>>>       
>> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
>> regression (a minor one though).
>>     
>
> Andrew does not like #ifdefs and it makes it possible to verify calling 
> conventions if !CONFIG_MMU_NOTIFIER.
>
>   

You could define mmu_notifier_head as an empty struct in that case.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 13:59   ` Andrea Arcangeli
  2008-01-29 14:34     ` Andrea Arcangeli
@ 2008-01-29 19:49     ` Christoph Lameter
  2008-01-29 20:41       ` Avi Kivity
  1 sibling, 1 reply; 119+ messages in thread
From: Christoph Lameter @ 2008-01-29 19:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> >  };
> 
> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
> regression (a minor one though).

Andrew does not like #ifdefs and it makes it possible to verify calling 
conventions if !CONFIG_MMU_NOTIFIER.

> It's out of my reach how can you be ok with lock=1. You said you have
> to block, if you can deal with lock=1 once, why can't you deal with
> lock=1 _always_?

Not sure yet. We may have to do more in that area. Need to have feedback 
from Robin.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                     ` (2 preceding siblings ...)
  2008-01-29 13:59   ` Andrea Arcangeli
@ 2008-01-29 16:07   ` Robin Holt
  2008-02-05 18:05   ` Andy Whitcroft
  4 siblings, 0 replies; 119+ messages in thread
From: Robin Holt @ 2008-01-29 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

I am going to seperate my comments into individual replies to help
reduce the chance they are lost.

> +void mmu_notifier_release(struct mm_struct *mm)
...
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);
> +			hlist_del(&mn->hlist);

This is a use-after-free issue.  The hlist_del_rcu needs to be done before
the callout as the structure containing the mmu_notifier structure will
need to be freed from within the ->release callout.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 13:59   ` Andrea Arcangeli
@ 2008-01-29 14:34     ` Andrea Arcangeli
  2008-01-29 19:49     ` Christoph Lameter
  1 sibling, 0 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 14:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 02:59:14PM +0100, Andrea Arcangeli wrote:
> The down_write is garbage. The caller should put it around
> mmu_notifier_register if something. The same way the caller should
> call synchronize_rcu after mmu_notifier_register if it needs
> synchronous behavior from the notifiers. The default version of
> mmu_notifier_register shouldn't be cluttered with unnecessary locking.

Ooops my spinlock was gone from the notifier head.... so the above
comment is wrong sorry! I thought down_write was needed to serialize
against some _external_ event, not to serialize the list updates in
place of my explicit lock. The critical section is so small that a
semaphore is the wrong locking choice, that's why I assumed it was for
an external event. Anyway RCU won't be optimal for a huge flood of
register/unregister, I agree the down_write shouldn't create much
contention and it saves 4 bytes from each mm_struct, and we can always
change it to a proper spinlock later if needed.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
  2008-01-29  0:05   ` Robin Holt
@ 2008-01-29 13:59   ` Andrea Arcangeli
  2008-01-29 14:34     ` Andrea Arcangeli
  2008-01-29 19:49     ` Christoph Lameter
  2008-01-29 16:07   ` Robin Holt
  2008-02-05 18:05   ` Andy Whitcroft
  4 siblings, 2 replies; 119+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 13:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> +struct mmu_notifier_head {
> +	struct hlist_head head;
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
>  	/* aio bits */
>  	rwlock_t		ioctx_list_lock;
>  	struct kioctx		*ioctx_list;
> +
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };

Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
regression (a minor one though).

> +	/*
> +	 * lock indicates that the function is called under spinlock.
> +	 */
> +	void (*invalidate_range)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int lock);
> +};

It's out of my reach how can you be ok with lock=1. You said you have
to block, if you can deal with lock=1 once, why can't you deal with
lock=1 _always_?

> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	__mmu_notifier_register(mn, mm);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);

The down_write is garbage. The caller should put it around
mmu_notifier_register if something. The same way the caller should
call synchronize_rcu after mmu_notifier_register if it needs
synchronous behavior from the notifiers. The default version of
mmu_notifier_register shouldn't be cluttered with unnecessary locking.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29  0:05   ` Robin Holt
@ 2008-01-29  1:19     ` Christoph Lameter
  0 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-29  1:19 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, 28 Jan 2008, Robin Holt wrote:

> USE_AFTER_FREE!!!  I made this same comment as well as other relavent
> comments last week.

Must have slipped somehow. Patch needs to be applied after the rcu fix.

Please repeat the other relevant comments if they are still relevant.... I 
thought I had worked through them.



mmu_notifier_release: remove mmu_notifier struct from list before calling ->release

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 17:17:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 17:17:10.000000000 -0800
@@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru
 		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_rcu(&mn->hlist);
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
@ 2008-01-29  0:05   ` Robin Holt
  2008-01-29  1:19     ` Christoph Lameter
  2008-01-29 13:59   ` Andrea Arcangeli
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 119+ messages in thread
From: Robin Holt @ 2008-01-29  0:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

> +void mmu_notifier_release(struct mm_struct *mm)
...
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);
> +			hlist_del(&mn->hlist);

USE_AFTER_FREE!!!  I made this same comment as well as other relavent
comments last week.


Robin

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-28 22:06   ` Christoph Lameter
  2008-01-29  0:05   ` Robin Holt
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-28 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

mmu core: Need to use hlist_del

Wrong type of list del in mmu_notifier_release()

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 14:02:18.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 14:02:30.000000000 -0800
@@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru
 					  &mm->mmu_notifier.head, hlist) {
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del(&mn->hlist);
+			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
                     ` (4 more replies)
  0 siblings, 5 replies; 119+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15333 bytes --]

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/list.h         |   14 ++
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   10 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |  101 ++++++++++++++++++++
 9 files changed, 350 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-28 11:35:22.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ * 	Spinlocks must be held in order to the walk reverse maps and the
+ * 	notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms. It is up to the subsystem to mark pags
+ *	as PageExternalRmap as needed to trigger the callbacks. Pages
+ *	must be marked dirty if dirty bits are set in the external
+ *	pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Note: The mmu_notifier structure must be released with
+	 * call_rcu() since other processors are only guaranteed to
+	 * see the changes after a quiescent period.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * lock indicates that the function is called under spinlock.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, 		\
+						hlist)			\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-28 11:35:22.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-28 11:35:22.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-28 11:35:22.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 11:35:22.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_safe_rcu(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+			hlist_del(&mn->hlist);
+		}
+		rcu_read_unlock();
+		synchronize_rcu();
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-28 11:35:22.000000000 -0800
@@ -51,6 +51,7 @@
 #include <linux/random.h>
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-28 11:37:53.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/list.h	2008-01-28 11:35:22.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		temporary pointer
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
+	for (pos = (head)->first;					 \
+	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */

-- 

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2008-02-27 22:11 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
2008-02-08 22:23 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
2008-02-08 23:32   ` Christoph Lameter
2008-02-08 23:36     ` Robin Holt
2008-02-08 23:41       ` Christoph Lameter
2008-02-08 23:43         ` Robin Holt
2008-02-08 23:56           ` Andrew Morton
2008-02-09  0:05             ` Christoph Lameter
2008-02-09  0:12               ` [ofa-general] " Roland Dreier
2008-02-09  0:16                 ` Christoph Lameter
2008-02-09  0:22                   ` Roland Dreier
2008-02-09  0:36                     ` Christoph Lameter
2008-02-09  1:24                       ` Andrea Arcangeli
2008-02-09  1:27                         ` Christoph Lameter
2008-02-09  1:56                           ` Andrea Arcangeli
2008-02-09  2:16                             ` Christoph Lameter
2008-02-09 12:55                               ` Rik van Riel
2008-02-09 21:46                                 ` Christoph Lameter
2008-02-11 22:40                                   ` Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier
2008-02-12 22:01                                     ` Steve Wise
2008-02-12 22:10                                       ` Christoph Lameter
2008-02-12 22:41                                         ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
2008-02-12 23:14                                           ` Felix Marti
2008-02-13  0:57                                             ` Christoph Lameter
2008-02-14 15:09                                             ` Steve Wise
2008-02-14 15:53                                               ` Robin Holt
2008-02-14 16:23                                                 ` Steve Wise
2008-02-14 17:48                                                   ` Caitlin Bestler
2008-02-14 19:39                                               ` Christoph Lameter
2008-02-14 20:17                                                 ` Caitlin Bestler
2008-02-14 20:20                                                   ` Christoph Lameter
2008-02-14 22:43                                                     ` Caitlin Bestler
2008-02-14 22:48                                                       ` Christoph Lameter
2008-02-15  1:26                                                         ` Caitlin Bestler
2008-02-15  2:37                                                           ` Christoph Lameter
2008-02-15 18:09                                                             ` Caitlin Bestler
2008-02-15 18:45                                                               ` Christoph Lameter
2008-02-15 18:53                                                                 ` Caitlin Bestler
2008-02-15 20:02                                                                   ` Christoph Lameter
2008-02-15 20:14                                                                     ` Caitlin Bestler
2008-02-15 22:50                                                                       ` Christoph Lameter
2008-02-15 23:50                                                                         ` Caitlin Bestler
2008-02-12 23:23                                           ` Jason Gunthorpe
2008-02-13  1:01                                             ` Christoph Lameter
2008-02-13  1:26                                               ` Jason Gunthorpe
2008-02-13  1:45                                                 ` Steve Wise
2008-02-13  2:35                                                 ` Christoph Lameter
2008-02-13  3:25                                                   ` Jason Gunthorpe
2008-02-13  3:56                                                     ` Patrick Geoffray
2008-02-13  4:26                                                       ` Jason Gunthorpe
2008-02-13  4:47                                                         ` Patrick Geoffray
2008-02-13 18:51                                                     ` Christoph Lameter
2008-02-13 19:51                                                       ` Jason Gunthorpe
2008-02-13 20:36                                                         ` Christoph Lameter
2008-02-13  4:09                                                   ` Christian Bell
2008-02-13 19:00                                                     ` Christoph Lameter
2008-02-13 19:46                                                       ` Christian Bell
2008-02-13 20:32                                                         ` Christoph Lameter
2008-02-13 22:44                                                           ` Kanoj Sarcar
2008-02-13 23:02                                                             ` Christoph Lameter
2008-02-13 23:43                                                               ` Kanoj Sarcar
2008-02-13 23:48                                                                 ` Jesse Barnes
2008-02-14  0:56                                                                 ` [ofa-general] " Andrea Arcangeli
2008-02-14 19:35                                                                 ` Christoph Lameter
2008-02-13 23:23                                                     ` Pete Wyckoff
2008-02-14  0:01                                                       ` Jason Gunthorpe
2008-02-27 22:11                                                         ` Christoph Lameter
2008-02-13  1:55                                               ` Christian Bell
2008-02-13  2:19                                                 ` Christoph Lameter
2008-02-13  0:56                                           ` Christoph Lameter
2008-02-13 12:11                                           ` Christoph Raisch
2008-02-13 19:02                                             ` Christoph Lameter
2008-02-09  0:12               ` [patch 0/6] MMU Notifiers V6 Andrew Morton
2008-02-09  0:18                 ` Christoph Lameter
2008-02-13 14:31 ` Jack Steiner
  -- strict thread matches above, loose matches on Subject: below --
2008-02-15  6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-16  3:37   ` Andrew Morton
2008-02-16  8:45     ` Avi Kivity
2008-02-16  8:56       ` Andrew Morton
2008-02-16  9:21         ` Avi Kivity
2008-02-16 10:41     ` Brice Goglin
2008-02-16 10:58       ` Andrew Morton
2008-02-16 19:31         ` Christoph Lameter
2008-02-16 19:21     ` Christoph Lameter
2008-02-17  3:01       ` Andrea Arcangeli
2008-02-17 12:24         ` Robin Holt
2008-02-17  5:04     ` Doug Maxey
2008-02-18 22:33   ` Roland Dreier
2008-01-30  2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-30 15:37   ` Andrea Arcangeli
2008-01-30 15:53     ` Jack Steiner
2008-01-30 16:38       ` Andrea Arcangeli
2008-01-30 19:19       ` Christoph Lameter
2008-01-30 22:20         ` Robin Holt
2008-01-30 23:38           ` Andrea Arcangeli
2008-01-30 23:55             ` Christoph Lameter
2008-01-30 17:10     ` Peter Zijlstra
2008-01-30 19:28       ` Christoph Lameter
2008-01-30 18:02   ` Robin Holt
2008-01-30 19:08     ` Christoph Lameter
2008-01-30 19:14     ` Christoph Lameter
2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-28 22:06   ` Christoph Lameter
2008-01-29  0:05   ` Robin Holt
2008-01-29  1:19     ` Christoph Lameter
2008-01-29 13:59   ` Andrea Arcangeli
2008-01-29 14:34     ` Andrea Arcangeli
2008-01-29 19:49     ` Christoph Lameter
2008-01-29 20:41       ` Avi Kivity
2008-01-29 16:07   ` Robin Holt
2008-02-05 18:05   ` Andy Whitcroft
2008-02-05 18:17     ` Peter Zijlstra
2008-02-05 18:19     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).