LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/6] [RFC] MMU Notifiers V2
@ 2008-01-28 20:28 Christoph Lameter
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                   ` (5 more replies)
  0 siblings, 6 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel.

Issues:

- Feedback from uses of the callbacks for KVM, RDMA, XPmem and GRU

- RCU quiescent periods are required on registering and unregistering
  notifiers to guarantee visibility to other processors.
  Currently only mmu_notifier_release() does the correct thing.
  It is up to the user to provide RCU quiescent periods for
  register/unregister functions if they are called outside of the
  ->release method.


Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
  called.
- Develop a patch sequence that separates out the different types of
  hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
  already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
  is held.
- Add invalidate_all()

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
                     ` (4 more replies)
  2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
                   ` (4 subsequent siblings)
  5 siblings, 5 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15333 bytes --]

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/list.h         |   14 ++
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   10 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |  101 ++++++++++++++++++++
 9 files changed, 350 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-28 11:35:22.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ * 	Spinlocks must be held in order to the walk reverse maps and the
+ * 	notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms. It is up to the subsystem to mark pags
+ *	as PageExternalRmap as needed to trigger the callbacks. Pages
+ *	must be marked dirty if dirty bits are set in the external
+ *	pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Note: The mmu_notifier structure must be released with
+	 * call_rcu() since other processors are only guaranteed to
+	 * see the changes after a quiescent period.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * lock indicates that the function is called under spinlock.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, 		\
+						hlist)			\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-28 11:35:22.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-28 11:35:22.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-28 11:35:22.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 11:35:22.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_safe_rcu(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+			hlist_del(&mn->hlist);
+		}
+		rcu_read_unlock();
+		synchronize_rcu();
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-28 11:35:22.000000000 -0800
@@ -51,6 +51,7 @@
 #include <linux/random.h>
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-28 11:37:53.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h	2008-01-28 11:35:20.000000000 -0800
+++ linux-2.6/include/linux/list.h	2008-01-28 11:35:22.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		temporary pointer
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
+	for (pos = (head)->first;					 \
+	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-29 16:20   ` Andrea Arcangeli
  2008-01-28 20:28 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_range_callbacks --]
[-- Type: text/plain, Size: 4671 bytes --]

The invalidation of address ranges in a mm_struct needs to be
performed when pages are removed or permissions etc change.
Most of the VM address space changes can use the range invalidate
callback.

invalidate_range() is generally called with mmap_sem held but
no spinlocks are active. If invalidate_range() is called with
locks held then we pass a flag into invalidate_range()

Comments state that mmap_sem must be held for
remap_pfn_range() but various drivers do not seem to do this.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/fremap.c  |    2 ++
 mm/hugetlb.c |    2 ++
 mm/memory.c  |   11 +++++++++--
 mm/mmap.c    |    1 +
 4 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-25 19:32:49.000000000 -0800
@@ -15,6 +15,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/syscalls.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/mmu_context.h>
 #include <asm/cacheflush.h>
@@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range, mm, start, start + size, 0);
 	err = populate_range(mm, vma, start, size, pgoff);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-25 19:32:49.000000000 -0800
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -891,6 +892,8 @@ unsigned long zap_page_range(struct vm_a
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range, mm, address, end,
+		(details ? (details->i_mmap_lock != NULL)  : 0));
 	return end;
 }
 
@@ -1319,7 +1322,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1360,6 +1363,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1443,7 +1447,7 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
@@ -1454,6 +1458,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end, 0);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1634,6 +1639,8 @@ gotten:
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
+	mmu_notifier(invalidate_range, mm, address,
+				address + PAGE_SIZE - 1, 0);
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-25 19:31:05.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-25 19:32:49.000000000 -0800
@@ -1748,6 +1748,7 @@ static void unmap_region(struct mm_struc
 	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
 				 next? next->vm_start: 0);
 	tlb_finish_mmu(tlb, start, end);
+	mmu_notifier(invalidate_range, mm, start, end, 0);
 }
 
 /*
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-25 19:33:58.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-25 19:34:13.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -763,6 +764,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
+	mmu_notifier(invalidate_range, mm, start, end, 1);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-29 16:28   ` Robin Holt
  2008-01-28 20:28 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_page_rmap_callbacks --]
[-- Type: text/plain, Size: 1416 bytes --]

Callbacks to remove individual pages if the subsystem has an
rmap capability. The pagelock is held but no spinlocks are held.
The refcount of the page is elevated so that dropping the refcount
in the subsystem will not directly free the page.

The callbacks occur after the Linux rmaps have been walked.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-25 14:24:19.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-25 14:24:38.000000000 -0800
@@ -49,6 +49,7 @@
 #include <linux/rcupdate.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/tlbflush.h>
 
@@ -473,6 +474,8 @@ int page_mkclean(struct page *page)
 		struct address_space *mapping = page_mapping(page);
 		if (mapping) {
 			ret = page_mkclean_file(mapping, page);
+			if (unlikely(PageExternalRmap(page)))
+				mmu_rmap_notifier(invalidate_page, page);
 			if (page_test_dirty(page)) {
 				page_clear_dirty(page);
 				ret = 1;
@@ -971,6 +974,9 @@ int try_to_unmap(struct page *page, int 
 	else
 		ret = try_to_unmap_file(page, migration);
 
+	if (unlikely(PageExternalRmap(page)))
+		mmu_rmap_notifier(invalidate_page, page);
+
 	if (!page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-01-28 20:28 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-29 14:03   ` Andrea Arcangeli
  2008-01-28 20:28 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter
  2008-01-28 20:28 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter
  5 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_invalidate_page_callbacks --]
[-- Type: text/plain, Size: 2394 bytes --]

These notifiers here use the Linux rmaps to perform the callbacks.
In order to walk the rmaps locks must be held. Callbacks can therefore
only operate in an atomic context.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/rmap.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2008-01-25 14:27:01.000000000 -0800
+++ linux-2.6/mm/rmap.c	2008-01-25 14:27:04.000000000 -0800
@@ -288,6 +288,9 @@ static int page_referenced_one(struct pa
 	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
+	if (mmu_notifier_age_page(mm, address))
+		referenced++;
+
 	/* Pretend the page is referenced if the task has the
 	   swap token and is in the middle of a page fault. */
 	if (mm != current->mm && has_swap_token(mm) &&
@@ -435,6 +438,7 @@ static int page_mkclean_one(struct page 
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		entry = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -680,7 +684,8 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
+			(ptep_clear_flush_young(vma, address, pte) ||
+				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
 	}
@@ -688,6 +693,7 @@ static int try_to_unmap_one(struct page 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	pteval = ptep_clear_flush(vma, address, pte);
+	mmu_notifier(invalidate_page, mm, address);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -815,9 +821,13 @@ static void try_to_unmap_cluster(unsigne
 		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
+		if (mmu_notifier_age_page(mm, address))
+			continue;
+
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
 		pteval = ptep_clear_flush(vma, address, pte);
+		mmu_notifier(invalidate_page, mm, address);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-01-28 20:28 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-28 20:28 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter
  5 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_xip --]
[-- Type: text/plain, Size: 1242 bytes --]

Problem for external rmaps: There is no pagelock held on the page.

Signed-off-by: Robin Holt <holt@sgi.com>

---
 mm/filemap_xip.c |    5 +++++
 1 file changed, 5 insertions(+)

Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2008-01-25 19:39:04.000000000 -0800
+++ linux-2.6/mm/filemap_xip.c	2008-01-25 19:39:06.000000000 -0800
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/sched.h>
 #include <asm/tlbflush.h>
 
@@ -183,6 +184,9 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
+	if (PageExternalRmap(page))
+		mmu_rmap_notifier(invalidate_page, page);
+
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
@@ -194,6 +198,7 @@ __xip_unmap (struct address_space * mapp
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
+			mmu_notifier(invalidate_page, mm, address);
 			page_remove_rmap(page, vma);
 			dec_mm_counter(mm, file_rss);
 			BUG_ON(pte_dirty(pteval));

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 6/6] mmu_notifier: Add invalidate_all()
  2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-01-28 20:28 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter
@ 2008-01-28 20:28 ` Christoph Lameter
  2008-01-29 16:31   ` Robin Holt
  5 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 20:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_all --]
[-- Type: text/plain, Size: 1542 bytes --]

when a task exits we can remove all external pts at once. At that point the
extern mmu may also unregister itself from the mmu notifier chain to avoid
future calls.

Note the complications because of RCU. Other processors may not see that the
notifier was unlinked until a quiescent period has passed!

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mmu_notifier.h |    4 ++++
 mm/mmap.c                    |    1 +
 2 files changed, 5 insertions(+)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 12:21:33.000000000 -0800
@@ -62,6 +62,10 @@ struct mmu_notifier_ops {
 				struct mm_struct *mm,
 				unsigned long address);
 
+	/* Dummy needed because the mmu_notifier() macro requires it */
+	void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
+				int dummy);
+
 	/*
 	 * lock indicates that the function is called under spinlock.
 	 */
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-28 11:47:53.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-28 11:57:45.000000000 -0800
@@ -2034,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier(invalidate_all, mm, 0);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-28 22:06   ` Christoph Lameter
  2008-01-29  0:05   ` Robin Holt
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-28 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

mmu core: Need to use hlist_del

Wrong type of list del in mmu_notifier_release()

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 14:02:18.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 14:02:30.000000000 -0800
@@ -23,7 +23,7 @@ void mmu_notifier_release(struct mm_stru
 					  &mm->mmu_notifier.head, hlist) {
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del(&mn->hlist);
+			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
@ 2008-01-29  0:05   ` Robin Holt
  2008-01-29  1:19     ` Christoph Lameter
  2008-01-29 13:59   ` Andrea Arcangeli
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 97+ messages in thread
From: Robin Holt @ 2008-01-29  0:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

> +void mmu_notifier_release(struct mm_struct *mm)
...
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);
> +			hlist_del(&mn->hlist);

USE_AFTER_FREE!!!  I made this same comment as well as other relavent
comments last week.


Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29  0:05   ` Robin Holt
@ 2008-01-29  1:19     ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29  1:19 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, 28 Jan 2008, Robin Holt wrote:

> USE_AFTER_FREE!!!  I made this same comment as well as other relavent
> comments last week.

Must have slipped somehow. Patch needs to be applied after the rcu fix.

Please repeat the other relevant comments if they are still relevant.... I 
thought I had worked through them.



mmu_notifier_release: remove mmu_notifier struct from list before calling ->release

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/mmu_notifier.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-28 17:17:05.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-28 17:17:10.000000000 -0800
@@ -21,9 +21,9 @@ void mmu_notifier_release(struct mm_stru
 		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_rcu(&mn->hlist);
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
-			hlist_del_rcu(&mn->hlist);
 		}
 		rcu_read_unlock();
 		synchronize_rcu();

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-28 22:06   ` Christoph Lameter
  2008-01-29  0:05   ` Robin Holt
@ 2008-01-29 13:59   ` Andrea Arcangeli
  2008-01-29 14:34     ` Andrea Arcangeli
  2008-01-29 19:49     ` Christoph Lameter
  2008-01-29 16:07   ` Robin Holt
  2008-02-05 18:05   ` Andy Whitcroft
  4 siblings, 2 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 13:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> +struct mmu_notifier_head {
> +	struct hlist_head head;
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
>  	/* aio bits */
>  	rwlock_t		ioctx_list_lock;
>  	struct kioctx		*ioctx_list;
> +
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };

Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
regression (a minor one though).

> +	/*
> +	 * lock indicates that the function is called under spinlock.
> +	 */
> +	void (*invalidate_range)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int lock);
> +};

It's out of my reach how can you be ok with lock=1. You said you have
to block, if you can deal with lock=1 once, why can't you deal with
lock=1 _always_?

> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	__mmu_notifier_register(mn, mm);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);

The down_write is garbage. The caller should put it around
mmu_notifier_register if something. The same way the caller should
call synchronize_rcu after mmu_notifier_register if it needs
synchronous behavior from the notifiers. The default version of
mmu_notifier_register shouldn't be cluttered with unnecessary locking.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps
  2008-01-28 20:28 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
@ 2008-01-29 14:03   ` Andrea Arcangeli
  2008-01-29 14:24     ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 14:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:44PM -0800, Christoph Lameter wrote:
>  	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
> -			(ptep_clear_flush_young(vma, address, pte)))) {
> +			(ptep_clear_flush_young(vma, address, pte) ||
> +				mmu_notifier_age_page(mm, address)))) {

here an example of how inferior and error prone it is to have
mmu_notifier_age_page and invalidate_page outside of pgtable.h, you
just managed to break again with the above || go figure. The
mmu_notifier_age_page has to be called unconditionally regardless of
ptep_clear_flush_young return value, we want to give only one
additional LRU scan to the referenced pages, not more than that or the
KVM guest pages will get tons more priority than the regular linux
anonymous memory.

>  		ret = SWAP_FAIL;
>  		goto out_unmap;
>  	}
> @@ -688,6 +693,7 @@ static int try_to_unmap_one(struct page 
>  	/* Nuke the page table entry. */
>  	flush_cache_page(vma, address, page_to_pfn(page));
>  	pteval = ptep_clear_flush(vma, address, pte);
> +	mmu_notifier(invalidate_page, mm, address);
>  
>  	/* Move the dirty bit to the physical page now the pte is gone. */
>  	if (pte_dirty(pteval))
> @@ -815,9 +821,13 @@ static void try_to_unmap_cluster(unsigne
>  		if (ptep_clear_flush_young(vma, address, pte))
>  			continue;
>  
> +		if (mmu_notifier_age_page(mm, address))
> +			continue;
> +

Here the same exact aging regression compared to my code.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps
  2008-01-29 14:03   ` Andrea Arcangeli
@ 2008-01-29 14:24     ` Andrea Arcangeli
  2008-01-29 19:51       ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 14:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

This should fix the aging bugs you introduced through the faulty cpp
expansion. This is hard to write for me, given any time somebody does
a ptep_clear_flush_young w/o manually cpp-expandin "|
mmu_notifier_age_page" after it, it's always a bug that needs fixing,
similar bugs can emerge with time for ptep_clear_flush too. What will
happen is that somebody will cleanup in 26+ and we'll remain with a
#ifdef KERNEL_VERSION() < 2.6.26 in ksm.c to call
mmu_notifier(invalidate_page) explicitly. Performance and
optimizations or unnecessary invalidate_page are a red-herring, it can
be fully optimized both ways. 99% of the time when somebody calls
ptep_clear_flush and ptep_clear_flush_young, the respective mmu
notifier can't be forgotten (and calling them once more even if a
later invalidate_range is invoked, is always safer and preferable than
not calling them at all) so I fail to see how this will not be cleaned
up eventually, the same way the tlb flushes have been cleaned up
already. Nevertheless I back your implementation and I'm not even
trying at changing it with the risk to slowdown merging.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -285,10 +285,8 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
-	if (ptep_clear_flush_young(vma, address, pte))
-		referenced++;
-
-	if (mmu_notifier_age_page(mm, address))
+	if (ptep_clear_flush_young(vma, address, pte) |
+	    mmu_notifier_age_page(mm, address))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -684,7 +682,7 @@ static int try_to_unmap_one(struct page 
 	 * skipped over this mm) then we should reactivate it.
 	 */
 	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte) ||
+			(ptep_clear_flush_young(vma, address, pte) |
 				mmu_notifier_age_page(mm, address)))) {
 		ret = SWAP_FAIL;
 		goto out_unmap;
@@ -818,10 +816,8 @@ static void try_to_unmap_cluster(unsigne
 		page = vm_normal_page(vma, address, *pte);
 		BUG_ON(!page || PageAnon(page));
 
-		if (ptep_clear_flush_young(vma, address, pte))
-			continue;
-
-		if (mmu_notifier_age_page(mm, address))
+		if (ptep_clear_flush_young(vma, address, pte) | 
+		    mmu_notifier_age_page(mm, address))
 			continue;
 
 		/* Nuke the page table entry. */

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 13:59   ` Andrea Arcangeli
@ 2008-01-29 14:34     ` Andrea Arcangeli
  2008-01-29 19:49     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 14:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 02:59:14PM +0100, Andrea Arcangeli wrote:
> The down_write is garbage. The caller should put it around
> mmu_notifier_register if something. The same way the caller should
> call synchronize_rcu after mmu_notifier_register if it needs
> synchronous behavior from the notifiers. The default version of
> mmu_notifier_register shouldn't be cluttered with unnecessary locking.

Ooops my spinlock was gone from the notifier head.... so the above
comment is wrong sorry! I thought down_write was needed to serialize
against some _external_ event, not to serialize the list updates in
place of my explicit lock. The critical section is so small that a
semaphore is the wrong locking choice, that's why I assumed it was for
an external event. Anyway RCU won't be optimal for a huge flood of
register/unregister, I agree the down_write shouldn't create much
contention and it saves 4 bytes from each mm_struct, and we can always
change it to a proper spinlock later if needed.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                     ` (2 preceding siblings ...)
  2008-01-29 13:59   ` Andrea Arcangeli
@ 2008-01-29 16:07   ` Robin Holt
  2008-02-05 18:05   ` Andy Whitcroft
  4 siblings, 0 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-29 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

I am going to seperate my comments into individual replies to help
reduce the chance they are lost.

> +void mmu_notifier_release(struct mm_struct *mm)
...
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);
> +			hlist_del(&mn->hlist);

This is a use-after-free issue.  The hlist_del_rcu needs to be done before
the callout as the structure containing the mmu_notifier structure will
need to be freed from within the ->release callout.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
@ 2008-01-29 16:20   ` Andrea Arcangeli
  2008-01-29 18:28     ` Andrea Arcangeli
  2008-01-29 19:55     ` Christoph Lameter
  0 siblings, 2 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:42PM -0800, Christoph Lameter wrote:
> Index: linux-2.6/mm/fremap.c
> ===================================================================
> --- linux-2.6.orig/mm/fremap.c	2008-01-25 19:31:05.000000000 -0800
> +++ linux-2.6/mm/fremap.c	2008-01-25 19:32:49.000000000 -0800
> @@ -15,6 +15,7 @@
>  #include <linux/rmap.h>
>  #include <linux/module.h>
>  #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/mmu_context.h>
>  #include <asm/cacheflush.h>
> @@ -211,6 +212,7 @@ asmlinkage long sys_remap_file_pages(uns
>  		spin_unlock(&mapping->i_mmap_lock);
>  	}
>  
> +	mmu_notifier(invalidate_range, mm, start, start + size, 0);
>  	err = populate_range(mm, vma, start, size, pgoff);

How can it be right to invalidate_range _before_ ptep_clear_flush?

> @@ -1634,6 +1639,8 @@ gotten:
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> +	mmu_notifier(invalidate_range, mm, address,
> +				address + PAGE_SIZE - 1, 0);
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (likely(pte_same(*page_table, orig_pte))) {
>  		if (old_page) {

What's the point of invalidate_range when the size is PAGE_SIZE? And
how can it be right to invalidate_range _before_ ptep_clear_flush?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap
  2008-01-28 20:28 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
@ 2008-01-29 16:28   ` Robin Holt
  0 siblings, 0 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-29 16:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

I don't understand how this is intended to work.  I think the page flag
needs to be maintained by the mmu_notifier subsystem.

Let's assume we have a mapping that has a grant from xpmem and an
additional grant from kvm.  The exporters are not important, the fact
that there may be two is.

Assume that the user revokes the grant from xpmem (we call that
xpmem_remove).  As far as xpmem is concerned, there are no longer any
exports of that page so the page should no longer have its exported
flag set.  Note: This is not a process exit, but a function of xpmem.

In that case, at the remove time, we have no idea whether the flag should
be cleared.

For the invalidate_page side, I think we should have:
> @@ -473,6 +474,10 @@ int page_mkclean(struct page *page)
>  		struct address_space *mapping = page_mapping(page);
>  		if (mapping) {
>  			ret = page_mkclean_file(mapping, page);
> +			if (unlikely(PageExternalRmap(page))) {
> +				mmu_rmap_notifier(invalidate_page, page);
> +				ClearPageExternalRmap(page);
> +			}
>  			if (page_test_dirty(page)) {
>  				page_clear_dirty(page);
>  				ret = 1;

I would assume we would then want a function which sets the page flag.

Additionally, I would think we would want some intervention in the
freeing of the page side to ensure the page flag is cleared as well.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 6/6] mmu_notifier: Add invalidate_all()
  2008-01-28 20:28 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter
@ 2008-01-29 16:31   ` Robin Holt
  2008-01-29 20:02     ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Robin Holt @ 2008-01-29 16:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

What is the status of getting invalidate_all adjusted to indicate a need
to also call _release?

Thanks,
Robin

On Mon, Jan 28, 2008 at 12:28:46PM -0800, Christoph Lameter wrote:
> when a task exits we can remove all external pts at once. At that point the
> extern mmu may also unregister itself from the mmu notifier chain to avoid
> future calls.
> 
> Note the complications because of RCU. Other processors may not see that the
> notifier was unlinked until a quiescent period has passed!
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/mmu_notifier.h |    4 ++++
>  mm/mmap.c                    |    1 +
>  2 files changed, 5 insertions(+)
> 
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 12:21:33.000000000 -0800
> @@ -62,6 +62,10 @@ struct mmu_notifier_ops {
>  				struct mm_struct *mm,
>  				unsigned long address);
>  
> +	/* Dummy needed because the mmu_notifier() macro requires it */
> +	void (*invalidate_all)(struct mmu_notifier *mn, struct mm_struct *mm,
> +				int dummy);
> +
>  	/*
>  	 * lock indicates that the function is called under spinlock.
>  	 */
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c	2008-01-28 11:47:53.000000000 -0800
> +++ linux-2.6/mm/mmap.c	2008-01-28 11:57:45.000000000 -0800
> @@ -2034,6 +2034,7 @@ void exit_mmap(struct mm_struct *mm)
>  	unsigned long end;
>  
>  	/* mm's last user has gone, and its about to be pulled down */
> +	mmu_notifier(invalidate_all, mm, 0);
>  	arch_exit_mmap(mm);
>  
>  	lru_add_drain();
> 
> -- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 16:20   ` Andrea Arcangeli
@ 2008-01-29 18:28     ` Andrea Arcangeli
  2008-01-29 20:30       ` Christoph Lameter
  2008-01-29 19:55     ` Christoph Lameter
  1 sibling, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 18:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Christoph, the below patch should fix the current leak of the pinned
pages. I hope the page-pin that should be dropped by the
invalidate_range op, is enough to prevent the "physical page" mapped
on that "mm+address" to change before invalidate_range returns. If
that would ever happen, there would be a coherency loss between the
guest VM writes and the writes coming from userland on the same
mm+address from a different thread (qemu, whatever). invalidate_page
before PT lock was obviously safe. Now we entirely relay on the pin to
prevent the page to change before invalidate_range returns. If the pte
is unmapped and the page is mapped back in with a minor fault that's
ok, as long as the physical page remains the same for that mm+address,
until all sptes are gone.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/mm/fremap.c b/mm/fremap.c
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	err = populate_range(mm, vma, start, size, pgoff);
 	mmu_notifier(invalidate_range, mm, start, start + size, 0);
-	err = populate_range(mm, vma, start, size, pgoff);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1639,8 +1639,6 @@ gotten:
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	mmu_notifier(invalidate_range, mm, address,
-				address + PAGE_SIZE - 1, 0);
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
@@ -1676,6 +1674,8 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	mmu_notifier(invalidate_range, mm, address,
+				address + PAGE_SIZE - 1, 0);
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 13:59   ` Andrea Arcangeli
  2008-01-29 14:34     ` Andrea Arcangeli
@ 2008-01-29 19:49     ` Christoph Lameter
  2008-01-29 20:41       ` Avi Kivity
  1 sibling, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 19:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
> >  };
> 
> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
> regression (a minor one though).

Andrew does not like #ifdefs and it makes it possible to verify calling 
conventions if !CONFIG_MMU_NOTIFIER.

> It's out of my reach how can you be ok with lock=1. You said you have
> to block, if you can deal with lock=1 once, why can't you deal with
> lock=1 _always_?

Not sure yet. We may have to do more in that area. Need to have feedback 
from Robin.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps
  2008-01-29 14:24     ` Andrea Arcangeli
@ 2008-01-29 19:51       ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 19:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Thanks I will put that into V3.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 16:20   ` Andrea Arcangeli
  2008-01-29 18:28     ` Andrea Arcangeli
@ 2008-01-29 19:55     ` Christoph Lameter
  2008-01-29 21:17       ` Andrea Arcangeli
  1 sibling, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > +	mmu_notifier(invalidate_range, mm, address,
> > +				address + PAGE_SIZE - 1, 0);
> >  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> >  	if (likely(pte_same(*page_table, orig_pte))) {
> >  		if (old_page) {
> 
> What's the point of invalidate_range when the size is PAGE_SIZE? And
> how can it be right to invalidate_range _before_ ptep_clear_flush?

I am not sure. AFAICT you wrote that code.

It seems to be okay to invalidate range if you hold mmap_sem writably. In 
that case no additional faults can happen that would create new ptes.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 6/6] mmu_notifier: Add invalidate_all()
  2008-01-29 16:31   ` Robin Holt
@ 2008-01-29 20:02     ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 20:02 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Robin Holt wrote:

> What is the status of getting invalidate_all adjusted to indicate a need
> to also call _release?

Release is only called if the mmu_notifier is still registered. If you 
take it out on invalidate_all then there will be no call to release 
(provided you deal with the RCU issues).


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 18:28     ` Andrea Arcangeli
@ 2008-01-29 20:30       ` Christoph Lameter
  2008-01-29 21:36         ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 20:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> diff --git a/mm/fremap.c b/mm/fremap.c
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
>  		spin_unlock(&mapping->i_mmap_lock);
>  	}
>  
> +	err = populate_range(mm, vma, start, size, pgoff);
>  	mmu_notifier(invalidate_range, mm, start, start + size, 0);
> -	err = populate_range(mm, vma, start, size, pgoff);
>  	if (!err && !(flags & MAP_NONBLOCK)) {
>  		if (unlikely(has_write_lock)) {
>  			downgrade_write(&mm->mmap_sem);

We invalidate the range *after* populating it? Isnt it okay to establish 
references while populate_range() runs?

> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1639,8 +1639,6 @@ gotten:
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> -	mmu_notifier(invalidate_range, mm, address,
> -				address + PAGE_SIZE - 1, 0);
>  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (likely(pte_same(*page_table, orig_pte))) {
>  		if (old_page) {

What we did is to invalidate the page (?!) before taking the pte lock. In 
the lock we replace the pte to point to another page. This means that we 
need to clear stale information. So we zap it before. If another reference 
is established after taking the spinlock then the pte contents have 
changed at the cirtical section fails.

Before the critical section starts we have gotten an extra refcount on the 
original page so the page cannot vanish from under us.

> @@ -1676,6 +1674,8 @@ gotten:
>  		page_cache_release(old_page);
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
> +	mmu_notifier(invalidate_range, mm, address,
> +				address + PAGE_SIZE - 1, 0);
>  	if (dirty_page) {
>  		if (vma->vm_file)
>  			file_update_time(vma->vm_file);

Now we invalidate the page after the transaction is complete. This means 
external pte can persist while we change the pte? Possibly even dirty the 
page?




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-29 19:49     ` Christoph Lameter
@ 2008-01-29 20:41       ` Avi Kivity
  0 siblings, 0 replies; 97+ messages in thread
From: Avi Kivity @ 2008-01-29 20:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
>
>   
>>> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>>>  };
>>>       
>> Not sure why you prefer to waste ram when MMU_NOTIFIER=n, this is a
>> regression (a minor one though).
>>     
>
> Andrew does not like #ifdefs and it makes it possible to verify calling 
> conventions if !CONFIG_MMU_NOTIFIER.
>
>   

You could define mmu_notifier_head as an empty struct in that case.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 19:55     ` Christoph Lameter
@ 2008-01-29 21:17       ` Andrea Arcangeli
  2008-01-29 21:35         ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 21:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 11:55:10AM -0800, Christoph Lameter wrote:
> I am not sure. AFAICT you wrote that code.

Actually I didn't need to change a single line in do_wp_page because
ptep_clear_flush was already doing everything transparently for
me. This was the memory.c part of my last patch I posted, it only
touches zap_page_range, remap_pfn_range and apply_to_page_range.

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
+	mmu_notifier(invalidate_range, mm, address, end);
 	return end;
 }
 
@@ -1317,7 +1318,7 @@ int remap_pfn_range(struct vm_area_struc
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + PAGE_ALIGN(size);
+	unsigned long start = addr, end = addr + PAGE_ALIGN(size);
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
 
@@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1441,7 +1443,7 @@ int apply_to_page_range(struct mm_struct
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long end = addr + size;
+	unsigned long start = addr, end = addr + size;
 	int err;
 
 	BUG_ON(addr >= end);
@@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+	mmu_notifier(invalidate_range, mm, start, end);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);

> It seems to be okay to invalidate range if you hold mmap_sem writably. In 
> that case no additional faults can happen that would create new ptes.

In that place the mmap_sem is taken but in readonly mode. I never rely
on the mmap_sem in the mmu notifier methods. Not invoking the notifier
before releasing the PT lock adds quite some uncertainty on the smp
safety of the spte invalidates, because the pte may be unmapped and
remapped by a minor fault before invalidate_range is invoked, but I
didn't figure out a kernel crashing race yet thanks to the pin we take
through get_user_pages (and only thanks to it). The requirement is
that invalidate_range is invoked after the last ptep_clear_flush or it
leaks pins that's why I had to move it at the end.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 21:17       ` Andrea Arcangeli
@ 2008-01-29 21:35         ` Christoph Lameter
  2008-01-29 22:02           ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 21:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > It seems to be okay to invalidate range if you hold mmap_sem writably. In 
> > that case no additional faults can happen that would create new ptes.
> 
> In that place the mmap_sem is taken but in readonly mode. I never rely
> on the mmap_sem in the mmu notifier methods. Not invoking the notifier

Well it seems that we have to rely on mmap_sem otherwise concurrent faults 
can occur. The mmap_sem seems to be acquired for write there.

              if (!has_write_lock) {
                        up_read(&mm->mmap_sem);
                        down_write(&mm->mmap_sem);
                        has_write_lock = 1;
                        goto retry;
                }


> before releasing the PT lock adds quite some uncertainty on the smp
> safety of the spte invalidates, because the pte may be unmapped and
> remapped by a minor fault before invalidate_range is invoked, but I
> didn't figure out a kernel crashing race yet thanks to the pin we take
> through get_user_pages (and only thanks to it). The requirement is
> that invalidate_range is invoked after the last ptep_clear_flush or it
> leaks pins that's why I had to move it at the end.
 
So "pins" means a reference count right? I still do not get why you 
have refcount problems. You take a refcount when you export the page 
through KVM and then drop the refcount in invalidate page right?

So you walk through the KVM ptes and drop the refcount for each spte you 
encounter?



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 20:30       ` Christoph Lameter
@ 2008-01-29 21:36         ` Andrea Arcangeli
  2008-01-29 21:53           ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 21:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 12:30:06PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
> 
> > diff --git a/mm/fremap.c b/mm/fremap.c
> > --- a/mm/fremap.c
> > +++ b/mm/fremap.c
> > @@ -212,8 +212,8 @@ asmlinkage long sys_remap_file_pages(uns
> >  		spin_unlock(&mapping->i_mmap_lock);
> >  	}
> >  
> > +	err = populate_range(mm, vma, start, size, pgoff);
> >  	mmu_notifier(invalidate_range, mm, start, start + size, 0);
> > -	err = populate_range(mm, vma, start, size, pgoff);
> >  	if (!err && !(flags & MAP_NONBLOCK)) {
> >  		if (unlikely(has_write_lock)) {
> >  			downgrade_write(&mm->mmap_sem);
> 
> We invalidate the range *after* populating it? Isnt it okay to establish 
> references while populate_range() runs?

It's not ok because that function can very well overwrite existing and
present ptes (it's actually the nonlinear common case fast path for
db). With your code the sptes created between invalidate_range and
populate_range, will keep pointing forever to the old physical page
instead of the newly populated one.

I'm also asking myself if it's a smp race not to call
mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
in install_file_pte. Probably not because the guest VM running in a
different thread would need to serialize outside the install_file_pte
code with the task running install_file_pte, if it wants to be sure to
write either all its data to the old or the new page. Certainly doing
the invalidate_page inside the PT lock was obviously safe but I hope
this is safe and this can accommodate your needs too.

> > diff --git a/mm/memory.c b/mm/memory.c
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1639,8 +1639,6 @@ gotten:
> >  	/*
> >  	 * Re-check the pte - we dropped the lock
> >  	 */
> > -	mmu_notifier(invalidate_range, mm, address,
> > -				address + PAGE_SIZE - 1, 0);
> >  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> >  	if (likely(pte_same(*page_table, orig_pte))) {
> >  		if (old_page) {
> 
> What we did is to invalidate the page (?!) before taking the pte lock. In 
> the lock we replace the pte to point to another page. This means that we 
> need to clear stale information. So we zap it before. If another reference 
> is established after taking the spinlock then the pte contents have 
> changed at the cirtical section fails.
> 
> Before the critical section starts we have gotten an extra refcount on the 
> original page so the page cannot vanish from under us.

The problem is the missing invalidate_page/range _after_
ptep_clear_flush. If a spte is built between invalidate_range and
pte_offset_map_lock, it will remain pointing to the old page
forever. Nothing will be called to invalidate that stale spte built
between invalidate_page/range and ptep_clear_flush. This is why for
the last few days I kept saying the mmu notifiers have to be invoked
_after_ ptep_clear_flush and never before (remember the export
notifier?). No idea how you can deal with this in your code, certainly
for KVM sptes that's backwards and unworkable ordering of operation
(exactly as backwards are doing the tlb flush before pte_clear in
ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
clearing/updating the pte or it's smp unsafe).

> > @@ -1676,6 +1674,8 @@ gotten:
> >  		page_cache_release(old_page);
> >  unlock:
> >  	pte_unmap_unlock(page_table, ptl);
> > +	mmu_notifier(invalidate_range, mm, address,
> > +				address + PAGE_SIZE - 1, 0);
> >  	if (dirty_page) {
> >  		if (vma->vm_file)
> >  			file_update_time(vma->vm_file);
> 
> Now we invalidate the page after the transaction is complete. This means 
> external pte can persist while we change the pte? Possibly even dirty the 
> page?

Yes, and the only reason this can be safe is for the reason explained
at the top of the email, if the other cpu wants to serialize to be
sure to write in the "new" page, it has to serialize with the
page-fault but to serialize it has to wait the page fault to return
(example: we're not going to call futex code until the page fault
returns).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 21:36         ` Andrea Arcangeli
@ 2008-01-29 21:53           ` Christoph Lameter
  2008-01-29 22:35             ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 21:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> > We invalidate the range *after* populating it? Isnt it okay to establish 
> > references while populate_range() runs?
> 
> It's not ok because that function can very well overwrite existing and
> present ptes (it's actually the nonlinear common case fast path for
> db). With your code the sptes created between invalidate_range and
> populate_range, will keep pointing forever to the old physical page
> instead of the newly populated one.

Seems though that the mmap_sem is taken for regular vmas writably and will 
hold off new mappings.

> I'm also asking myself if it's a smp race not to call
> mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
> in install_file_pte. Probably not because the guest VM running in a
> different thread would need to serialize outside the install_file_pte
> code with the task running install_file_pte, if it wants to be sure to
> write either all its data to the old or the new page. Certainly doing
> the invalidate_page inside the PT lock was obviously safe but I hope
> this is safe and this can accommodate your needs too.

But that would be doing two invalidates on one pte. One range and one page 
invalidate.

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1639,8 +1639,6 @@ gotten:
> > >  	/*
> > >  	 * Re-check the pte - we dropped the lock
> > >  	 */
> > > -	mmu_notifier(invalidate_range, mm, address,
> > > -				address + PAGE_SIZE - 1, 0);
> > >  	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > >  	if (likely(pte_same(*page_table, orig_pte))) {
> > >  		if (old_page) {
> > 
> > What we did is to invalidate the page (?!) before taking the pte lock. In 
> > the lock we replace the pte to point to another page. This means that we 
> > need to clear stale information. So we zap it before. If another reference 
> > is established after taking the spinlock then the pte contents have 
> > changed at the cirtical section fails.
> > 
> > Before the critical section starts we have gotten an extra refcount on the 
> > original page so the page cannot vanish from under us.
> 
> The problem is the missing invalidate_page/range _after_
> ptep_clear_flush. If a spte is built between invalidate_range and
> pte_offset_map_lock, it will remain pointing to the old page
> forever. Nothing will be called to invalidate that stale spte built
> between invalidate_page/range and ptep_clear_flush. This is why for
> the last few days I kept saying the mmu notifiers have to be invoked
> _after_ ptep_clear_flush and never before (remember the export
> notifier?). No idea how you can deal with this in your code, certainly
> for KVM sptes that's backwards and unworkable ordering of operation
> (exactly as backwards are doing the tlb flush before pte_clear in
> ptep_clear_flush, think spte as a tlb, you can't flush the tlb before
> clearing/updating the pte or it's smp unsafe).

Hmmm... So we could only do an invalidate_page here? Drop the strange 
invalidate_range()?

> 
> > > @@ -1676,6 +1674,8 @@ gotten:
> > >  		page_cache_release(old_page);
> > >  unlock:
> > >  	pte_unmap_unlock(page_table, ptl);
> > > +	mmu_notifier(invalidate_range, mm, address,
> > > +				address + PAGE_SIZE - 1, 0);
> > >  	if (dirty_page) {
> > >  		if (vma->vm_file)
> > >  			file_update_time(vma->vm_file);
> > 
> > Now we invalidate the page after the transaction is complete. This means 
> > external pte can persist while we change the pte? Possibly even dirty the 
> > page?
> 
> Yes, and the only reason this can be safe is for the reason explained
> at the top of the email, if the other cpu wants to serialize to be
> sure to write in the "new" page, it has to serialize with the
> page-fault but to serialize it has to wait the page fault to return
> (example: we're not going to call futex code until the page fault
> returns).

Serialize how? mmap_sem?
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 21:35         ` Christoph Lameter
@ 2008-01-29 22:02           ` Andrea Arcangeli
  2008-01-29 22:39             ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 22:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 01:35:58PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
> 
> > > It seems to be okay to invalidate range if you hold mmap_sem writably. In 
> > > that case no additional faults can happen that would create new ptes.
> > 
> > In that place the mmap_sem is taken but in readonly mode. I never rely
> > on the mmap_sem in the mmu notifier methods. Not invoking the notifier
> 
> Well it seems that we have to rely on mmap_sem otherwise concurrent faults 
> can occur. The mmap_sem seems to be acquired for write there.
      	     	 	  	      	       	   	 ^^^^^
> 
>               if (!has_write_lock) {
>                         up_read(&mm->mmap_sem);
>                         down_write(&mm->mmap_sem);
>                         has_write_lock = 1;
>                         goto retry;
>                 }


hmm, "there" where? When I said it was taken in readonly mode I meant
for the quoted code (it would be at the top if it wasn't cut), so I
quote below again:

> > +   mmu_notifier(invalidate_range, mm, address,
> > +                           address + PAGE_SIZE - 1, 0);
> >     page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> >     if (likely(pte_same(*page_table, orig_pte))) {
> >             if (old_page) {

The "there" for me was do_wp_page.

Even for the code you quoted in freemap.c, the has_write_lock is set
to 1 _only_ for the very first time you call sys_remap_file_pages on a
VMA. Only the transition of the VMA between linear to nonlinear
requires the mmap in write mode. So you can be sure all freemap code
99% of the time is populating (overwriting) already present ptes with
only the mmap_sem in readonly mode like do_wp_page. It would be
unnecessary to populate the nonlinear range with the mmap in write
mode. Only the "vma" mangling requires the mmap_sem in write mode, the
pte modifications only requires the PT_lock + mmap_sem in read mode.

Effectively the first invocation of populate_range runs with the
mmap_sem in write mode, I wonder why, there seem to be no good reason
for that. I guess it's a bit that should be optimized, by calling
downgrade_write before calling populate_range even for the first time
the vma switches from linear to nonlinear (after the vma has been
fully updated to the new status). But for sure all later invocations
runs populate_range with the semaphore readonly like the rest of the
VM does when instantiating ptes in the page faults.

> > before releasing the PT lock adds quite some uncertainty on the smp
> > safety of the spte invalidates, because the pte may be unmapped and
> > remapped by a minor fault before invalidate_range is invoked, but I
> > didn't figure out a kernel crashing race yet thanks to the pin we take
> > through get_user_pages (and only thanks to it). The requirement is
> > that invalidate_range is invoked after the last ptep_clear_flush or it
> > leaks pins that's why I had to move it at the end.
>  
> So "pins" means a reference count right? I still do not get why you 

Yes.

> have refcount problems. You take a refcount when you export the page 
> through KVM and then drop the refcount in invalidate page right?

Yes.

> So you walk through the KVM ptes and drop the refcount for each spte you 
> encounter?

Yes.

All pins are gone by the time invalidate_page/range returns. But there
is no critical section between invalidate_page and the _later_
ptep_clear_flush. So get_user_pages is free to run and take the PT
lock before the ptep_clear_flush, find the linux pte still
instantiated, and to create a new spte, before ptep_clear_flush runs.

Think of why the tlb flushes are being called at the end of
ptep_clear_flush. The mmu notifier invalidate has to be called after
for the exact same reason.

Perhaps somebody else should explain this, I started exposing this
smp race the moment after I've seen the backwards ordering being
proposed in export-notifier-v1, sorry if I'm not clear enough.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 21:53           ` Christoph Lameter
@ 2008-01-29 22:35             ` Andrea Arcangeli
  2008-01-29 22:55               ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 01:53:05PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
> 
> > > We invalidate the range *after* populating it? Isnt it okay to establish 
> > > references while populate_range() runs?
> > 
> > It's not ok because that function can very well overwrite existing and
> > present ptes (it's actually the nonlinear common case fast path for
> > db). With your code the sptes created between invalidate_range and
> > populate_range, will keep pointing forever to the old physical page
> > instead of the newly populated one.
> 
> Seems though that the mmap_sem is taken for regular vmas writably and will 
> hold off new mappings.

It's taken writable due to the code being inefficient the first time,
all later times remap_populate_range overwrites ptes with the mmap_sem
in readonly mode (finally rightfully so). The first remap_file_pages I
guess it's irrelevant to optimize, the whole point of nonlinear is to
call remap_file_pages zillon of times on the same vma, overwriting
present ptes the whole time, so if the first time the mutex is not
readonly it probably doesn't make a difference.

get_user_pages invoked by the kvm spte-fault, can happen between
invalidate_range and populate_range. If it can't happen, for sure
nobody pointed out a good reason why it can't happen. The kvm page
faults as well rightfully only takes the mmap_sem in readonly mode, so
get_user_pages is only called internally to gfn_to_page with the
readonly semaphore.

With my approach ptep_clear_flush was not only invalidating sptes
after ptep_clear_flush, but it was also invalidating them inside the
PT lock, so it was totally obvious there could be no race vs
get_user_pages.

> > I'm also asking myself if it's a smp race not to call
> > mmu_notifier(invalidate_page) between ptep_clear_flush and set_pte_at
> > in install_file_pte. Probably not because the guest VM running in a
> > different thread would need to serialize outside the install_file_pte
> > code with the task running install_file_pte, if it wants to be sure to
> > write either all its data to the old or the new page. Certainly doing
> > the invalidate_page inside the PT lock was obviously safe but I hope
> > this is safe and this can accommodate your needs too.
> 
> But that would be doing two invalidates on one pte. One range and one page 
> invalidate.

Yes, but it would have been micro-optimized later if you really cared,
by simply changing ptep_clear_flush to __ptep_clear_flush, no big
deal. Definitely all methods must be robust about them being called
multiple times, even if the rmap finds no spte mapping such host
virtual address.

> Hmmm... So we could only do an invalidate_page here? Drop the strange 
> invalidate_range()?

That's a question you should answer.

> > > > @@ -1676,6 +1674,8 @@ gotten:
> > > >  		page_cache_release(old_page);
> > > >  unlock:
> > > >  	pte_unmap_unlock(page_table, ptl);
> > > > +	mmu_notifier(invalidate_range, mm, address,
> > > > +				address + PAGE_SIZE - 1, 0);
> > > >  	if (dirty_page) {
> > > >  		if (vma->vm_file)
> > > >  			file_update_time(vma->vm_file);
> > > 
> > > Now we invalidate the page after the transaction is complete. This means 
> > > external pte can persist while we change the pte? Possibly even dirty the 
> > > page?
> > 
> > Yes, and the only reason this can be safe is for the reason explained
> > at the top of the email, if the other cpu wants to serialize to be
> > sure to write in the "new" page, it has to serialize with the
> > page-fault but to serialize it has to wait the page fault to return
> > (example: we're not going to call futex code until the page fault
> > returns).
> 
> Serialize how? mmap_sem?

No, that's a different angle.

But now I think there may be an issue with a third thread that may
show unsafe the removal of invalidate_page from ptep_clear_flush.

A third thread writing to a page through the linux-pte and the guest
VM writing to the same page through the sptes, will be writing on the
same physical page concurrently and using an userspace spinlock w/o
ever entering the kernel. With your patch that invalidate_range after
dropping the PT lock, the third thread may start writing on the new
page, when the guest is still writing to the old page through the
sptes. While this couldn't happen with my patch.

So really at the light of the third thread, it seems your approach is
smp racey and ptep_clear_flush should invalidate_page as last thing
before returning. My patch was enforcing that ptep_clear_flush would
stop the third thread in a linux page fault, and to drop the spte,
before the new mapping could be instantiated in both the linux pte and
in the sptes. The PT lock provided the needed serialization. This
ensured the third thread and the guest VM would always write on the
same physical page even if the first thread runs a flood of
remap_file_pages on that same page moving it around the pagecache. So
it seems I found a unfixable smp race in pretending to invalidate in a
sleeping place.

Perhaps you want to change the PT lock to a mutex instead of a
spinlock, that may be your only chance to sleep while maintaining 100%
memory coherency with threads.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 22:02           ` Andrea Arcangeli
@ 2008-01-29 22:39             ` Christoph Lameter
  2008-01-30  0:00               ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 22:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

n Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> hmm, "there" where? When I said it was taken in readonly mode I meant
> for the quoted code (it would be at the top if it wasn't cut), so I
> quote below again:
> 
> > > +   mmu_notifier(invalidate_range, mm, address,
> > > +                           address + PAGE_SIZE - 1, 0);
> > >     page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> > >     if (likely(pte_same(*page_table, orig_pte))) {
> > >             if (old_page) {
> 
> The "there" for me was do_wp_page.

Maybe we better focus on one call at a time?

> Even for the code you quoted in freemap.c, the has_write_lock is set
> to 1 _only_ for the very first time you call sys_remap_file_pages on a
> VMA. Only the transition of the VMA between linear to nonlinear
> requires the mmap in write mode. So you can be sure all freemap code
> 99% of the time is populating (overwriting) already present ptes with
> only the mmap_sem in readonly mode like do_wp_page. It would be
> unnecessary to populate the nonlinear range with the mmap in write
> mode. Only the "vma" mangling requires the mmap_sem in write mode, the
> pte modifications only requires the PT_lock + mmap_sem in read mode.
> 
> Effectively the first invocation of populate_range runs with the
> mmap_sem in write mode, I wonder why, there seem to be no good reason
> for that. I guess it's a bit that should be optimized, by calling
> downgrade_write before calling populate_range even for the first time
> the vma switches from linear to nonlinear (after the vma has been
> fully updated to the new status). But for sure all later invocations
> runs populate_range with the semaphore readonly like the rest of the
> VM does when instantiating ptes in the page faults.

If it does not run in write mode then concurrent faults are permissible 
while we remap pages. Weird. Maybe we better handle this like individual
page operations? Put the invalidate_page back into zap_pte. But then there 
would be no callback w/o lock as required by Robin. Doing the 
invalidate_range after populate allows access to memory for which ptes 
were zapped and the refcount was released.

> All pins are gone by the time invalidate_page/range returns. But there
> is no critical section between invalidate_page and the _later_
> ptep_clear_flush. So get_user_pages is free to run and take the PT
> lock before the ptep_clear_flush, find the linux pte still
> instantiated, and to create a new spte, before ptep_clear_flush runs.

Hmmm... Right. Did not consider get_user_pages. A write to the page that 
is not marked dirty would typically require a fault that will serialize.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 22:35             ` Andrea Arcangeli
@ 2008-01-29 22:55               ` Christoph Lameter
  2008-01-29 23:43                 ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-29 22:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Andrea Arcangeli wrote:

> But now I think there may be an issue with a third thread that may
> show unsafe the removal of invalidate_page from ptep_clear_flush.
> 
> A third thread writing to a page through the linux-pte and the guest
> VM writing to the same page through the sptes, will be writing on the
> same physical page concurrently and using an userspace spinlock w/o
> ever entering the kernel. With your patch that invalidate_range after
> dropping the PT lock, the third thread may start writing on the new
> page, when the guest is still writing to the old page through the
> sptes. While this couldn't happen with my patch.

A user space spinlock plays into this??? That is irrelevant to the kernel. 
And we are discussing "your" placement of the invalidate_range not mine.

This is the scenario that I described before. You just need two threads.
One thread is in do_wp_page and the other is writing through the spte. 
We are in do_wp_page. Meaning the page is not writable. The writer will 
have to take fault which will properly serialize access. It a bug if the 
spte would allow write.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 22:55               ` Christoph Lameter
@ 2008-01-29 23:43                 ` Andrea Arcangeli
  2008-01-30  0:34                   ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-29 23:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 02:55:56PM -0800, Christoph Lameter wrote:
> On Tue, 29 Jan 2008, Andrea Arcangeli wrote:
> 
> > But now I think there may be an issue with a third thread that may
> > show unsafe the removal of invalidate_page from ptep_clear_flush.
> > 
> > A third thread writing to a page through the linux-pte and the guest
> > VM writing to the same page through the sptes, will be writing on the
> > same physical page concurrently and using an userspace spinlock w/o
> > ever entering the kernel. With your patch that invalidate_range after
> > dropping the PT lock, the third thread may start writing on the new
> > page, when the guest is still writing to the old page through the
> > sptes. While this couldn't happen with my patch.
> 
> A user space spinlock plays into this??? That is irrelevant to the kernel. 
> And we are discussing "your" placement of the invalidate_range not mine.

With "my" code, invalidate_range wasn't placed there at all, my
modification to ptep_clear_flush already covered it in a automatic
way, grep from the word fremap in my latest patch you won't find it,
like you won't find any change to do_wp_page. Not sure why you keep
thinking I added those invalidate_range when infact you did.

The user space spinlock plays also in declaring rdtscp unworkable to
provide a monotone vgettimeofday w/o kernel locking.

My patch by calling invalidate_page inside ptep_clear_flush guaranteed
that both the thread writing through sptes and the thread writing
through linux ptes, couldn't possibly simultaneously write to two
different physical pages.

Your patch allows the thread writing through linux-pte to write to a
new populated page while the old thread writing through sptes still
writes to the old page. Is that safe? I don't know for sure. The fact
the physical page backing the virtual address could change back and
forth, perhaps invalidates the theory that somebody could possibly do
some useful locking out of it relaying on all threads seeing the same
physical page at the same time.

Anyway as long as invalidate_page/range happens after ptep_clear_flush
things are mostly ok.

> This is the scenario that I described before. You just need two threads.
> One thread is in do_wp_page and the other is writing through the spte. 
> We are in do_wp_page. Meaning the page is not writable. The writer will 

Actually above I was describing remap_file_pages not do_wp_page.

> have to take fault which will properly serialize access. It a bug if the 
> spte would allow write.

In that scenario because write is forbidden (unlike remap_file_pages)
like you said things should be ok. The spte reader will eventually see
the updates happening in the new page, as long as the spte invalidate
happens after ptep_clear_flush (i.e. with my incremental fix applied
to your code, or with my latest patch).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 22:39             ` Christoph Lameter
@ 2008-01-30  0:00               ` Andrea Arcangeli
  2008-01-30  0:05                 ` Andrea Arcangeli
                                   ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30  0:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 02:39:00PM -0800, Christoph Lameter wrote:
> If it does not run in write mode then concurrent faults are permissible 
> while we remap pages. Weird. Maybe we better handle this like individual
> page operations? Put the invalidate_page back into zap_pte. But then there 
> would be no callback w/o lock as required by Robin. Doing the 

The Robin requirements and the need to schedule are the source of the
complications indeed.

I posted all the KVM patches using mmu notifiers, today I reposted the
ones to work with your V2 (which crashes my host unlike my last
simpler mmu notifier patch but I also changed a few other variable
besides your mmu notifier changes, so I can't yet be sure it's a bug
in your V2, and the SMP regressions I fixed so far sure can't explain
the crashes because my KVM setup could never run in do_wp_page nor
remap_file_pages so it's something else I need to find ASAP).

Robin, if you don't mind, could you please post or upload somewhere
your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
is it top secret? I wouldn't mind to have a look so I can better
understand what's the exact reason you're sleeping besides attempting
GFP_KERNEL allocations. Thanks!

> invalidate_range after populate allows access to memory for which ptes 
> were zapped and the refcount was released.

The last refcount is released by the invalidate_range itself.
 
> > All pins are gone by the time invalidate_page/range returns. But there
> > is no critical section between invalidate_page and the _later_
> > ptep_clear_flush. So get_user_pages is free to run and take the PT
> > lock before the ptep_clear_flush, find the linux pte still
> > instantiated, and to create a new spte, before ptep_clear_flush runs.
> 
> Hmmm... Right. Did not consider get_user_pages. A write to the page that 
> is not marked dirty would typically require a fault that will serialize.

The pte is already marked dirty (and this is the case only for
get_user_pages, regular linux writes don't fault unless it's
explicitly writeprotect, which is mandatory in a few archs, x86 not).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:00               ` Andrea Arcangeli
@ 2008-01-30  0:05                 ` Andrea Arcangeli
  2008-01-30  0:22                   ` Christoph Lameter
  2008-01-30  0:20                 ` Christoph Lameter
  2008-01-30 16:11                 ` Robin Holt
  2 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30  0:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
> get_user_pages, regular linux writes don't fault unless it's
> explicitly writeprotect, which is mandatory in a few archs, x86 not).

actually get_user_pages doesn't fault either but it calls into
set_page_dirty, however get_user_pages (unlike a userland-write) at
least requires mmap_sem in read mode and the PT lock as serialization,
userland writes don't, they just go ahead and mark the pte in hardware
w/o faults. Anyway anonymous memory these days always mapped with
dirty bit set regardless, even for read-faults, after Nick finally
rightfully cleaned up the zero-page trick.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:00               ` Andrea Arcangeli
  2008-01-30  0:05                 ` Andrea Arcangeli
@ 2008-01-30  0:20                 ` Christoph Lameter
  2008-01-30  0:28                   ` Jack Steiner
  2008-01-30 16:11                 ` Robin Holt
  2 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30  0:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> > invalidate_range after populate allows access to memory for which ptes 
> > were zapped and the refcount was released.
> 
> The last refcount is released by the invalidate_range itself.

That is true for your implementation and to address Robin's issues. Jack: 
Is that true for the GRU?


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:05                 ` Andrea Arcangeli
@ 2008-01-30  0:22                   ` Christoph Lameter
  2008-01-30  0:59                     ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30  0:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 01:00:39AM +0100, Andrea Arcangeli wrote:
> > get_user_pages, regular linux writes don't fault unless it's
> > explicitly writeprotect, which is mandatory in a few archs, x86 not).
> 
> actually get_user_pages doesn't fault either but it calls into
> set_page_dirty, however get_user_pages (unlike a userland-write) at
> least requires mmap_sem in read mode and the PT lock as serialization,
> userland writes don't, they just go ahead and mark the pte in hardware
> w/o faults. Anyway anonymous memory these days always mapped with
> dirty bit set regardless, even for read-faults, after Nick finally
> rightfully cleaned up the zero-page trick.

That is only partially true. pte are created wronly in order to track 
dirty state these days. The first write will lead to a fault that switches 
the pte to writable. When the page undergoes writeback the page again 
becomes write protected. Thus our need to effectively deal with 
page_mkclean.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:20                 ` Christoph Lameter
@ 2008-01-30  0:28                   ` Jack Steiner
  2008-01-30  0:35                     ` Christoph Lameter
  2008-01-30 13:37                     ` Andrea Arcangeli
  0 siblings, 2 replies; 97+ messages in thread
From: Jack Steiner @ 2008-01-30  0:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> 
> > > invalidate_range after populate allows access to memory for which ptes 
> > > were zapped and the refcount was released.
> > 
> > The last refcount is released by the invalidate_range itself.
> 
> That is true for your implementation and to address Robin's issues. Jack: 
> Is that true for the GRU?

I'm not sure I understand the question. The GRU never (currently) takes
a reference on a page. It has no mechanism for tracking pages that
were exported to the external TLBs.

--- jack

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-29 23:43                 ` Andrea Arcangeli
@ 2008-01-30  0:34                   ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30  0:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> > A user space spinlock plays into this??? That is irrelevant to the kernel. 
> > And we are discussing "your" placement of the invalidate_range not mine.
> 
> With "my" code, invalidate_range wasn't placed there at all, my
> modification to ptep_clear_flush already covered it in a automatic
> way, grep from the word fremap in my latest patch you won't find it,
> like you won't find any change to do_wp_page. Not sure why you keep
> thinking I added those invalidate_range when infact you did.

Well you moved the code at minimum. Hmmm... according 
http://marc.info/?l=linux-kernel&m=120114755620891&w=2 it was Robin.

> The user space spinlock plays also in declaring rdtscp unworkable to
> provide a monotone vgettimeofday w/o kernel locking.

No idea what you are talking about.

> My patch by calling invalidate_page inside ptep_clear_flush guaranteed
> that both the thread writing through sptes and the thread writing
> through linux ptes, couldn't possibly simultaneously write to two
> different physical pages.

But then the ptep_clear_flush will issue invalidate_page() for ranges 
that were already covered by invalidate_range(). There are multiple calls 
to clear the same spte.
>
> Your patch allows the thread writing through linux-pte to write to a
> new populated page while the old thread writing through sptes still
> writes to the old page. Is that safe? I don't know for sure. The fact
> the physical page backing the virtual address could change back and
> forth, perhaps invalidates the theory that somebody could possibly do
> some useful locking out of it relaying on all threads seeing the same
> physical page at the same time.

This is referrring to the remap issue not do_wp_page right?

> Actually above I was describing remap_file_pages not do_wp_page.

Ok.

The serialization of remap_file_pages does not seem that critical since we 
only take a read lock on mmap_sem here. There may already be concurrent 
access to pages from other processors while the ptes are remapped. So 
there is already some overlap.

We could take mmap_sem there writably and keep it writably for the case 
that we have an mmu notifier in the mm.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:28                   ` Jack Steiner
@ 2008-01-30  0:35                     ` Christoph Lameter
  2008-01-30 13:37                     ` Andrea Arcangeli
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30  0:35 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Tue, 29 Jan 2008, Jack Steiner wrote:

> > That is true for your implementation and to address Robin's issues. Jack: 
> > Is that true for the GRU?
> 
> I'm not sure I understand the question. The GRU never (currently) takes
> a reference on a page. It has no mechanism for tracking pages that
> were exported to the external TLBs.

Thats what I was looking for. Thanks. KVM takes a refcount and so does 
XPmem.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:22                   ` Christoph Lameter
@ 2008-01-30  0:59                     ` Andrea Arcangeli
  2008-01-30  8:26                       ` Peter Zijlstra
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30  0:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
> That is only partially true. pte are created wronly in order to track 
> dirty state these days. The first write will lead to a fault that switches 
> the pte to writable. When the page undergoes writeback the page again 
> becomes write protected. Thus our need to effectively deal with 
> page_mkclean.

Well I was talking about anonymous memory.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:59                     ` Andrea Arcangeli
@ 2008-01-30  8:26                       ` Peter Zijlstra
  0 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2008-01-30  8:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins


On Wed, 2008-01-30 at 01:59 +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 04:22:46PM -0800, Christoph Lameter wrote:
> > That is only partially true. pte are created wronly in order to track 
> > dirty state these days. The first write will lead to a fault that switches 
> > the pte to writable. When the page undergoes writeback the page again 
> > becomes write protected. Thus our need to effectively deal with 
> > page_mkclean.
> 
> Well I was talking about anonymous memory.

Just to be absolutely clear on this (I lost track of what exactly we are
talking about here), nonlinear mappings no not do the dirty accounting,
and are not allowed on a backing store that would require dirty
accounting.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:28                   ` Jack Steiner
  2008-01-30  0:35                     ` Christoph Lameter
@ 2008-01-30 13:37                     ` Andrea Arcangeli
  2008-01-30 14:43                       ` Jack Steiner
  1 sibling, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 13:37 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
> On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> > 
> > > > invalidate_range after populate allows access to memory for which ptes 
> > > > were zapped and the refcount was released.
> > > 
> > > The last refcount is released by the invalidate_range itself.
> > 
> > That is true for your implementation and to address Robin's issues. Jack: 
> > Is that true for the GRU?
> 
> I'm not sure I understand the question. The GRU never (currently) takes
> a reference on a page. It has no mechanism for tracking pages that
> were exported to the external TLBs.

If you don't have a pin, then things like invalidate_range in
remap_file_pages can't be safe as writes through the external TLBs can
keep going on pages in the freelist. For you to be safe w/o a
page-pin, you need to return in the direction of invalidate_page
inside ptep_clear_flush (or anyway before
page_cache_release/__free_page/put_page...). You're generally not safe
with any invalidate_range that may run after the page pointed by the
pte has been freed (or can be freed by the VM anytime because of being
unpinned cache).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 13:37                     ` Andrea Arcangeli
@ 2008-01-30 14:43                       ` Jack Steiner
  2008-01-30 19:41                         ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Jack Steiner @ 2008-01-30 14:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 02:37:20PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:28:05PM -0600, Jack Steiner wrote:
> > On Tue, Jan 29, 2008 at 04:20:50PM -0800, Christoph Lameter wrote:
> > > On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> > > 
> > > > > invalidate_range after populate allows access to memory for which ptes 
> > > > > were zapped and the refcount was released.
> > > > 
> > > > The last refcount is released by the invalidate_range itself.
> > > 
> > > That is true for your implementation and to address Robin's issues. Jack: 
> > > Is that true for the GRU?
> > 
> > I'm not sure I understand the question. The GRU never (currently) takes
> > a reference on a page. It has no mechanism for tracking pages that
> > were exported to the external TLBs.
> 
> If you don't have a pin, then things like invalidate_range in
> remap_file_pages can't be safe as writes through the external TLBs can
> keep going on pages in the freelist. For you to be safe w/o a
> page-pin, you need to return in the direction of invalidate_page
> inside ptep_clear_flush (or anyway before
> page_cache_release/__free_page/put_page...). You're generally not safe
> with any invalidate_range that may run after the page pointed by the
> pte has been freed (or can be freed by the VM anytime because of being
> unpinned cache).

Yuck....

I see what you mean. I need to review to mail to see why this changed
but in the original discussions with Christoph, the invalidate_range
callouts were suppose to be made BEFORE the pages were put on the freelist.


--- jack

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30  0:00               ` Andrea Arcangeli
  2008-01-30  0:05                 ` Andrea Arcangeli
  2008-01-30  0:20                 ` Christoph Lameter
@ 2008-01-30 16:11                 ` Robin Holt
  2008-01-30 17:04                   ` Andrea Arcangeli
  2008-01-30 19:35                   ` Christoph Lameter
  2 siblings, 2 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-30 16:11 UTC (permalink / raw)
  To: Andrea Arcangeli, Christoph Lameter
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

> Robin, if you don't mind, could you please post or upload somewhere
> your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
> is it top secret? I wouldn't mind to have a look so I can better
> understand what's the exact reason you're sleeping besides attempting
> GFP_KERNEL allocations. Thanks!

Dean is still actively working on updating the xpmem patch posted
here a few months ago reworked for the mmu_notifiers.  I am sure
we can give you a early look, but it is in a really rough state.

http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t

The need to sleep comes from the fact that these PFNs are sent to other
hosts on the same NUMA fabric which have direct access to the pages
and then placed into remote process's page tables and then filled into
their TLBs.  Our only means of communicating the recall is async.

I think I need to straighten this discussion out in my head a little bit.
Am I correct in assuming Andrea's original patch set did not have any SMP
race conditions for KVM?  If so, then we need to start looking at how to
implement Christoph's and my changes in a safe fashion.  Andrea, I agree
complete that our introduction of the range callouts have introduced
SMP races.

The three issues we need to simultaneously solve is revoking the remote
page table/tlb information while still in a sleepable context and not
having the remote faulters become out of sync with the granting process.
Currently, I don't see a way to do that cleanly with a single callout.

Could we consider doing a range-based recall and lock callout before
clearing the processes page tables/TLBs, then use the _page or _range
callouts from Andrea's patch to clear the mappings,  finally make a
range-based unlock callout.  The mmu_notifier user would usually use ops
for either the recall+lock/unlock family of callouts or the _page/_range
family of callouts.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 16:11                 ` Robin Holt
@ 2008-01-30 17:04                   ` Andrea Arcangeli
  2008-01-30 17:30                     ` Robin Holt
  2008-01-30 19:35                   ` Christoph Lameter
  1 sibling, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 17:04 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Avi Kivity, Izik Eidus, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
> > Robin, if you don't mind, could you please post or upload somewhere
> > your GPLv2 code that registers itself in Christoph's V2 notifiers? Or
> > is it top secret? I wouldn't mind to have a look so I can better
> > understand what's the exact reason you're sleeping besides attempting
> > GFP_KERNEL allocations. Thanks!
> 
> Dean is still actively working on updating the xpmem patch posted
> here a few months ago reworked for the mmu_notifiers.  I am sure
> we can give you a early look, but it is in a really rough state.
> 
> http://marc.info/?l=linux-mm&w=2&r=1&s=xpmem&q=t
> 
> The need to sleep comes from the fact that these PFNs are sent to other
> hosts on the same NUMA fabric which have direct access to the pages
> and then placed into remote process's page tables and then filled into
> their TLBs.  Our only means of communicating the recall is async.
> 
> I think I need to straighten this discussion out in my head a little bit.
> Am I correct in assuming Andrea's original patch set did not have any SMP
> race conditions for KVM?  If so, then we need to start looking at how to

Yes my last patch was SMP safe, stable and feature complete for KVM. I
tested it for 1 week on my smp workstation with real desktop load and
everything loaded, with 3G non-linux guest running on 2G of ram.

Now for whatever reason I adapted the KVM side to Christoph's V2/V3
and it hangs the moment it hits swap. However in the meantime I
changed test hardware, upgraded host to 2.6.24-hg, and upgraded kvm
kernel and userland. all patches applied cleanly (with a minor nit in
a .h include in V2 on top of current git). Swapping of regular tasks
on the test system is 100% solid or I wouldn't even wasting time
mentioning this. By code inspection I didn't expect a stability
regression or I wouldn't have chanced all variables at the same time
(taking the opportunity to move everything to bleeding edge while
moving to V2 turned out to be a bad idea). I already audited the mmu
notifiers a few times, infact I already went back to call
invalidate_page and age_page inside ptep_clear_flush/young in case the
page-pin wasn't enough to prevent the page to change under the sptes,
as I thought yesterday.

Christoph's V3 notably still misses the needed range flushes in mremap
for example, but that's not my problem.  (Jack instead will certainly
kernel crash due to the missing invalidate_page after ptep_clear_flush
in mremap, such an invalidate_page wasn't missing with my last patch)

I'm now going to run the same binaries that still are stable on my
workstation on the test system too, to rule out timings and hardware
differences.

> implement Christoph's and my changes in a safe fashion.  Andrea, I agree
> complete that our introduction of the range callouts have introduced
> SMP races.

I think for KVM basic swapping both V2 and V3 should be safe. V2 had
race conditions that would later break KSM yes, I fixed it and V3
should be already ok and I'm not testing KSM. This is all thanks to the
pin of the page in get_user_page that KVM does for every page mapped
in any spte.

> The three issues we need to simultaneously solve is revoking the remote
> page table/tlb information while still in a sleepable context and not
> having the remote faulters become out of sync with the granting process.
> Currently, I don't see a way to do that cleanly with a single callout.

Agreed.

> Could we consider doing a range-based recall and lock callout before
> clearing the processes page tables/TLBs, then use the _page or _range
> callouts from Andrea's patch to clear the mappings,  finally make a
> range-based unlock callout.  The mmu_notifier user would usually use ops
> for either the recall+lock/unlock family of callouts or the _page/_range
> family of callouts.

invalidate_page/age_page can return inside ptep_clear_flush/young and
Jack will need that too. Infact Jack will need an invalidate_page also
inside ptep_get_and_clear. And the range callout will be done always
in a sleeping context and it'll relay on the page-pin to be safe (when
details->i_mmap_lock != NULL invalidate_range it shouldn't be called
inside zap_page_range but before returning from
unmap_mapping_range_vma before cond_resched). This will make
everything a bit simpler and less prone to breakage IMHO, plus it'll
have a chance to work for Jack w/o page-pin without additional
cluttering of mm/*.c.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 17:04                   ` Andrea Arcangeli
@ 2008-01-30 17:30                     ` Robin Holt
  2008-01-30 18:25                       ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Robin Holt @ 2008-01-30 17:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Christoph Lameter, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 06:04:52PM +0100, Andrea Arcangeli wrote:
> On Wed, Jan 30, 2008 at 10:11:24AM -0600, Robin Holt wrote:
...
> > The three issues we need to simultaneously solve is revoking the remote
> > page table/tlb information while still in a sleepable context and not
> > having the remote faulters become out of sync with the granting process.
...
> > Could we consider doing a range-based recall and lock callout before
> > clearing the processes page tables/TLBs, then use the _page or _range
> > callouts from Andrea's patch to clear the mappings,  finally make a
> > range-based unlock callout.  The mmu_notifier user would usually use ops
> > for either the recall+lock/unlock family of callouts or the _page/_range
> > family of callouts.
> 
> invalidate_page/age_page can return inside ptep_clear_flush/young and
> Jack will need that too. Infact Jack will need an invalidate_page also
> inside ptep_get_and_clear. And the range callout will be done always
> in a sleeping context and it'll relay on the page-pin to be safe (when
> details->i_mmap_lock != NULL invalidate_range it shouldn't be called
> inside zap_page_range but before returning from
> unmap_mapping_range_vma before cond_resched). This will make
> everything a bit simpler and less prone to breakage IMHO, plus it'll
> have a chance to work for Jack w/o page-pin without additional
> cluttering of mm/*.c.

I don't think I saw the answer to my original question.  I assume your
original patch, extended in a way similar to what Christoph has done,
can be made to work to cover both the KVM and GRU (Jack's) case.

XPMEM, however, does not look to be solvable due to the three simultaneous
issues above.  To address that, I think I am coming to the conclusion
that we need an accompanying but seperate pair of callouts.  The first
will ensure the remote page tables and TLBs are cleared and all page
information is returned back to the process that is granting access to
its address space.  That will include an implicit block on the address
range so no further faults will be satisfied by the remote accessor
(forgot the KVM name for this, sorry).  Any faults will be held off
and only the processes page tables/TLBs are in play.  Once the normal
processing of the kernel is complete, an unlock callout would be made
for the range and then faulting may occur on behalf of the process again.

Currently, this is the only direct solution that I can see as a
possibility.  My question is two fold.  Does this seem like a reasonable
means to solve the three simultaneous issues above and if so, does it
seem like the most reasonable means?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 17:30                     ` Robin Holt
@ 2008-01-30 18:25                       ` Andrea Arcangeli
  2008-01-30 19:50                         ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 18:25 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Avi Kivity, Izik Eidus, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 11:30:09AM -0600, Robin Holt wrote:
> I don't think I saw the answer to my original question.  I assume your
> original patch, extended in a way similar to what Christoph has done,
> can be made to work to cover both the KVM and GRU (Jack's) case.

Yes, I think so.

> XPMEM, however, does not look to be solvable due to the three simultaneous
> issues above.  To address that, I think I am coming to the conclusion
> that we need an accompanying but seperate pair of callouts.  The first

The mmu_rmap_notifiers are already one separate pair of callouts and
we can add more of them of course.

> will ensure the remote page tables and TLBs are cleared and all page
> information is returned back to the process that is granting access to
> its address space.  That will include an implicit block on the address
> range so no further faults will be satisfied by the remote accessor
> (forgot the KVM name for this, sorry).  Any faults will be held off
> and only the processes page tables/TLBs are in play.  Once the normal

Good, this "block" is how you close the race condition, and you need
the second callout to "unblock" (this is why it could hardly work well
before with a single invalidate_range).

> processing of the kernel is complete, an unlock callout would be made
> for the range and then faulting may occur on behalf of the process again.

This sounds good.

> Currently, this is the only direct solution that I can see as a
> possibility.  My question is two fold.  Does this seem like a reasonable
> means to solve the three simultaneous issues above and if so, does it
> seem like the most reasonable means?

Yes.

KVM can deal with both invalidate_page (atomic) and invalidate_range (sleepy)

GRU can only deal with invalidate_page (atomic)

XPMEM requires with invalidate_range (sleepy) +
before_invalidate_range (sleepy). invalidate_all should also be called
before_release (both sleepy).

It sounds we need full overlap of information provided by
invalidate_page and invalidate_range to fit all three models (the
opposite of the zero objective that current V3 is taking). And the
swap will be handled only by invalidate_page either through linux rmap
or external rmap (with the latter that can sleep so it's ok for you,
the former not). GRU can safely use the either the linux rmap notifier
or the external rmap notifier equally well, because when try_to_unmap
is called the page is locked and obviously pinned by the VM itself.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 16:11                 ` Robin Holt
  2008-01-30 17:04                   ` Andrea Arcangeli
@ 2008-01-30 19:35                   ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:35 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Robin Holt wrote:

> I think I need to straighten this discussion out in my head a little bit.
> Am I correct in assuming Andrea's original patch set did not have any SMP
> race conditions for KVM?  If so, then we need to start looking at how to
> implement Christoph's and my changes in a safe fashion.  Andrea, I agree
> complete that our introduction of the range callouts have introduced
> SMP races.

The original patch drew the clearing of the sptes into ptep_clear_flush(). 
So the invalidate_page was called for each page regardless if we had been 
doing an invalidate range before or not. It seems that the the 
invalidate_range() was just there for optimization.
 
> The three issues we need to simultaneously solve is revoking the remote
> page table/tlb information while still in a sleepable context and not
> having the remote faulters become out of sync with the granting process.
> Currently, I don't see a way to do that cleanly with a single callout.

You could use the invalidate_page callouts to set a flag that no 
additional rmap entries may be added until the invalidate_range has 
occurred? We could add back all the original invalidate_pages() and pass
a flag that specifies that an invalidate range will follow. The notifier 
can then decide what to do with that information. If its okay to defer 
then do nothing and wait for the range_invalidate. XPmem could stop 
allowing external references to be established until the invalidate_range 
was successful.

Jack had a concern that multiple callouts for the same pte could cause 
problems.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 14:43                       ` Jack Steiner
@ 2008-01-30 19:41                         ` Christoph Lameter
  2008-01-30 20:29                           ` Jack Steiner
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:41 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Jack Steiner wrote:

> I see what you mean. I need to review to mail to see why this changed
> but in the original discussions with Christoph, the invalidate_range
> callouts were suppose to be made BEFORE the pages were put on the freelist.

Seems that we cannot rely on the invalidate_ranges for correctness at all?
We need to have invalidate_page() always. invalidate_range() is only an 
optimization.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 18:25                       ` Andrea Arcangeli
@ 2008-01-30 19:50                         ` Christoph Lameter
  2008-01-30 22:18                           ` Robin Holt
  2008-01-30 23:52                           ` Andrea Arcangeli
  0 siblings, 2 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Andrea Arcangeli wrote:

> XPMEM requires with invalidate_range (sleepy) +
> before_invalidate_range (sleepy). invalidate_all should also be called
> before_release (both sleepy).
> 
> It sounds we need full overlap of information provided by
> invalidate_page and invalidate_range to fit all three models (the
> opposite of the zero objective that current V3 is taking). And the
> swap will be handled only by invalidate_page either through linux rmap
> or external rmap (with the latter that can sleep so it's ok for you,
> the former not). GRU can safely use the either the linux rmap notifier
> or the external rmap notifier equally well, because when try_to_unmap
> is called the page is locked and obviously pinned by the VM itself.

So put the invalidate_page() callbacks in everywhere.

Then we have 

invalidate_range_start(mm)

and

invalidate_range_finish(mm, start, end)

in addition to the invalidate rmap_notifier?

---
 include/linux/mmu_notifier.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-30 11:49:02.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 11:49:57.000000000 -0800
@@ -69,10 +69,13 @@ struct mmu_notifier_ops {
 	/*
 	 * lock indicates that the function is called under spinlock.
 	 */
-	void (*invalidate_range)(struct mmu_notifier *mn,
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
 				 struct mm_struct *mm,
-				 unsigned long start, unsigned long end,
 				 int lock);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
 };
 
 struct mmu_rmap_notifier_ops;

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 19:41                         ` Christoph Lameter
@ 2008-01-30 20:29                           ` Jack Steiner
  2008-01-30 20:55                             ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Jack Steiner @ 2008-01-30 20:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 11:41:29AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
> 
> > I see what you mean. I need to review to mail to see why this changed
> > but in the original discussions with Christoph, the invalidate_range
> > callouts were suppose to be made BEFORE the pages were put on the freelist.
> 
> Seems that we cannot rely on the invalidate_ranges for correctness at all?
> We need to have invalidate_page() always. invalidate_range() is only an 
> optimization.
> 

I don't understand your point "an optimization". How would invalidate_range
as currently defined be correctly used?

It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
the page & drops it when invalidate_range is called. That may work (not sure)
for xpmem but not for the GRU.

--- jack


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 20:29                           ` Jack Steiner
@ 2008-01-30 20:55                             ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 20:55 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Jack Steiner wrote:

> > Seems that we cannot rely on the invalidate_ranges for correctness at all?
> > We need to have invalidate_page() always. invalidate_range() is only an 
> > optimization.
> > 
> 
> I don't understand your point "an optimization". How would invalidate_range
> as currently defined be correctly used?

We are changing definitions. The original patch by Andrea calls 
invalidate_page for each pte that is cleared. So strictly you would not 
need an invalidate_range.

> It _looks_ like it would work only if xpmem/gru/etc takes a refcnt on
> the page & drops it when invalidate_range is called. That may work (not sure)
> for xpmem but not for the GRU.

The refcount is not necessary if we adopt Andrea's approach of a callback 
on the clearing of each pte. At that point the page is still guaranteed to 
exist. If we do the range_invalidate later (as in V3) then the page may 
have been released (see sys_remap_file_pages() f.e.) before we zap the GRU 
ptes. So there will be a time when the GRU may write to a page that has 
been freed and used for another purpose.

Taking a refcount on the page defers the free until the range_invalidate 
runs.

I would prefer a solution that does not require taking refcounts (pins) 
for establishing an external pte and for release (like what the GRU does).

If we could effectively determine that there are no external ptes in a 
range then the invalidate_page() call may return immediately. Maybe it is 
then effective to do these gazillions of invalidate_page() calls when a 
process terminates or an remap is performed.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 19:50                         ` Christoph Lameter
@ 2008-01-30 22:18                           ` Robin Holt
  2008-01-30 23:52                           ` Andrea Arcangeli
  1 sibling, 0 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-30 22:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Andrea Arcangeli wrote:
> 
> > XPMEM requires with invalidate_range (sleepy) +
> > before_invalidate_range (sleepy). invalidate_all should also be called
> > before_release (both sleepy).
> > 
> > It sounds we need full overlap of information provided by
> > invalidate_page and invalidate_range to fit all three models (the
> > opposite of the zero objective that current V3 is taking). And the
> > swap will be handled only by invalidate_page either through linux rmap
> > or external rmap (with the latter that can sleep so it's ok for you,
> > the former not). GRU can safely use the either the linux rmap notifier
> > or the external rmap notifier equally well, because when try_to_unmap
> > is called the page is locked and obviously pinned by the VM itself.
> 
> So put the invalidate_page() callbacks in everywhere.

The way I am envisioning it, we essentially drop back to Andrea's original
patch.  We then introduce a invalidate_range_begin (I was really thinking
of it as invalidate_and_lock_range()) and an invalidate_range_end (again
I was thinking of unlock_range).

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 19:50                         ` Christoph Lameter
  2008-01-30 22:18                           ` Robin Holt
@ 2008-01-30 23:52                           ` Andrea Arcangeli
  2008-01-31  0:01                             ` Christoph Lameter
  1 sibling, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 23:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 11:50:26AM -0800, Christoph Lameter wrote:
> Then we have 
> 
> invalidate_range_start(mm)
> 
> and
> 
> invalidate_range_finish(mm, start, end)
> 
> in addition to the invalidate rmap_notifier?
> 
> ---
>  include/linux/mmu_notifier.h |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-30 11:49:02.000000000 -0800
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 11:49:57.000000000 -0800
> @@ -69,10 +69,13 @@ struct mmu_notifier_ops {
>  	/*
>  	 * lock indicates that the function is called under spinlock.
>  	 */
> -	void (*invalidate_range)(struct mmu_notifier *mn,
> +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
>  				 struct mm_struct *mm,
> -				 unsigned long start, unsigned long end,
>  				 int lock);
> +
> +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end);
>  };

start/finish/begin/end/before/after? ;)

I'd drop the 'int lock', you should skip the before/after if
i_mmap_lock isn't null and offload it to the caller before taking the
lock. At least for the "after" call that looks a few liner change,
didn't figure out the "before" yet.

Given the amount of changes that are going on in design terms to cover
both XPMEM and GRE, can we split the minimal invalidate_page that
provides an obviously safe and feature complete mmu notifier code for
KVM, and merge that first patch that will cover KVM 100%, it will
cover GRE 90%, and then we add invalidate_range_before/after in a
separate patch and we close the remaining 10% for GRE covering
ptep_get_and_clear or whatever else ptep_*?  The mmu notifiers are
made so that are extendible in backwards compatible way. I think
invalidate_page inside ptep_clear_flush is the first fundamental block
of the mmu notifiers. Then once the fundamental is in and obviously
safe and feature complete for KVM, the rest can be added very easily
with incremental patches as far as I can tell. That would be my
preferred route ;)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-30 23:52                           ` Andrea Arcangeli
@ 2008-01-31  0:01                             ` Christoph Lameter
  2008-01-31  0:34                               ` [kvm-devel] " Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  0:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > -	void (*invalidate_range)(struct mmu_notifier *mn,
> > +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
> >  				 struct mm_struct *mm,
> > -				 unsigned long start, unsigned long end,
> >  				 int lock);
> > +
> > +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> > +				 struct mm_struct *mm,
> > +				 unsigned long start, unsigned long end);
> >  };
> 
> start/finish/begin/end/before/after? ;)

Well lets pick one and then stick to it.

> I'd drop the 'int lock', you should skip the before/after if
> i_mmap_lock isn't null and offload it to the caller before taking the
> lock. At least for the "after" call that looks a few liner change,
> didn't figure out the "before" yet.

How we offload that? Before the scan of the rmaps we do not have the 
mmstruct. So we'd need another notifier_rmap_callback.

> Given the amount of changes that are going on in design terms to cover
> both XPMEM and GRE, can we split the minimal invalidate_page that
> provides an obviously safe and feature complete mmu notifier code for
> KVM, and merge that first patch that will cover KVM 100%, it will

The obvious solution does not scale. You will have a callback for every 
page and there may be a million of those if you have a 4GB process.

> made so that are extendible in backwards compatible way. I think
> invalidate_page inside ptep_clear_flush is the first fundamental block
> of the mmu notifiers. Then once the fundamental is in and obviously
> safe and feature complete for KVM, the rest can be added very easily
> with incremental patches as far as I can tell. That would be my
> preferred route ;)

We need to have a coherent notifier solution that works for multiple 
scenarios. I think a working invalidate_range would also be required for 
KVM. KVM and GRUB are very similar so they should be able to use the same 
mechanisms and we need to properly document how that mechanism is safe. 
Either both take a page refcount or none.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  0:01                             ` Christoph Lameter
@ 2008-01-31  0:34                               ` Andrea Arcangeli
  2008-01-31  1:46                                 ` Christoph Lameter
  2008-01-31  2:08                                 ` Christoph Lameter
  0 siblings, 2 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-31  0:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
> How we offload that? Before the scan of the rmaps we do not have the 
> mmstruct. So we'd need another notifier_rmap_callback.

My assumption is that that "int lock" exists just because
unmap_mapping_range_vma exists. If I'm right then my suggestion was to
move the invalidate_range after dropping the i_mmap_lock and not to
invoke it inside zap_page_range.

> The obvious solution does not scale. You will have a callback for every 

Scale is the wrong word. The PT lock will prevent any other cpu to
trash on the mmu_lock, so it's a fixed cost for each pte_clear with no
scalability risk, nor any complexity issue. Certainly we could average
certain fixed costs over more than one pte_clear to boost performance,
and that's good idea. Not really a short term concern, we need to swap
reliably first ;).

> page and there may be a million of those if you have a 4GB process.

That can be optimized adding a __ptep_clear_flush and an
invalidate_pages (let's call it pages to better show it's an
'clustered' version of invalidate_page, to avoid the confusion with
_range_before/after that does an entirely different thing). Also for
_range I tend to like before/after, as a means to say before the
pte_clear and after the pte_clear but any other meaning is ok with me.

We add invalidate_page and invalidate_pages
immediately. invalidate_pages may never be called initially by the
linux VM, we can start calling it later as we replace ptep_clear_flush
with __ptep_clear_flush (or local_ptep_clear_flush).

I don't see any problem with this approach and it looks quite clean to
me and it leaves you full room for experimenting in practice with
range_before/after while knowing those range_before/after won't
require many changes.

And for things like the age_page it will never happen that you could
call the respective ptep_clear_flush_young w/o mmu notifier age_page
after it, so you won't ever risk having to add an age_pages or a
__ptep_clear_flush_young.

> We need to have a coherent notifier solution that works for multiple 
> scenarios. I think a working invalidate_range would also be required for 
> KVM. KVM and GRUB are very similar so they should be able to use the same 
> mechanisms and we need to properly document how that mechanism is safe. 
> Either both take a page refcount or none.

There's no reason why KVM should take any risk of corrupting memory
due to a single missing mmu notifier, with not taking the
refcount. get_user_pages will take it for us, so we have to pay the
atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
notifier, and not immediately like this:

	  get_user_pages(pages)
	  __free_page(pages[0])

The idea is that what works for GRU, works for KVM too. So we do a
single invalidate_page and clustered invalidate_pages, we add that,
and then we make sure all places are covered so GRU will not
kernel-crash, and KVM won't risk to run oom or to generate _userland_
corruption.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  0:34                               ` [kvm-devel] " Andrea Arcangeli
@ 2008-01-31  1:46                                 ` Christoph Lameter
  2008-01-31  2:34                                   ` Robin Holt
  2008-01-31 10:52                                   ` [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Andrea Arcangeli
  2008-01-31  2:08                                 ` Christoph Lameter
  1 sibling, 2 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  1:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 04:01:31PM -0800, Christoph Lameter wrote:
> > How we offload that? Before the scan of the rmaps we do not have the 
> > mmstruct. So we'd need another notifier_rmap_callback.
> 
> My assumption is that that "int lock" exists just because
> unmap_mapping_range_vma exists. If I'm right then my suggestion was to
> move the invalidate_range after dropping the i_mmap_lock and not to
> invoke it inside zap_page_range.

There is still no pointer to the mm_struct available there because pages 
of a mapping may belong to multiple processes. So we need to add another 
rmap method?

The same issue is also occurring for unmap_hugepages().
 
> There's no reason why KVM should take any risk of corrupting memory
> due to a single missing mmu notifier, with not taking the
> refcount. get_user_pages will take it for us, so we have to pay the
> atomic-op anyway. It sure worth doing the atomic_dec inside the mmu
> notifier, and not immediately like this:

Well the GRU uses follow_page() instead of get_user_pages. Performance is 
a major issue for the GRU. 


> 	  get_user_pages(pages)
> 	  __free_page(pages[0])
> 
> The idea is that what works for GRU, works for KVM too. So we do a
> single invalidate_page and clustered invalidate_pages, we add that,
> and then we make sure all places are covered so GRU will not
> kernel-crash, and KVM won't risk to run oom or to generate _userland_
> corruption.

Hmmmm.. Could we go to a scheme where we do not have to increase the page 
count? Modifications of the page struct require dirtying a cache line and 
it seems that we do not need an increased page count if we have an
invalidate_range_start() that clears all the external references 
and stops the establishment of new ones and invalidate_range_end() that 
reenables new external references?

Then we do not need the frequent invalidate_page() calls.

The typical case would be anyways that invalidate_all() is called 
before anything else on exit. Invalidate_all() would remove all pages 
and disable creation of new references to the memory in the mm_struct.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  0:34                               ` [kvm-devel] " Andrea Arcangeli
  2008-01-31  1:46                                 ` Christoph Lameter
@ 2008-01-31  2:08                                 ` Christoph Lameter
  2008-01-31  2:42                                   ` Andrea Arcangeli
  1 sibling, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  2:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins


Patch to


1. Remove sync on notifier_release. Must be called when only a 
   single process remain.

2. Add invalidate_range_start/end. This should allow safe removal
   of ranges of external ptes without having to resort to a callback
   for every individual page.

This must be able to nest so the driver needs to keep a refcount of range 
invalidates and wait if the refcount != 0.


---
 include/linux/mmu_notifier.h |   11 +++++++++--
 mm/fremap.c                  |    3 ++-
 mm/hugetlb.c                 |    3 ++-
 mm/memory.c                  |   16 ++++++++++------
 mm/mmu_notifier.c            |    9 ++++-----
 5 files changed, 27 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 18:00:26.000000000 -0800
@@ -13,23 +13,22 @@
 #include <linux/mm.h>
 #include <linux/mmu_notifier.h>
 
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
 void mmu_notifier_release(struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
 	struct hlist_node *n, *t;
 
 	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
-		down_write(&mm->mmap_sem);
-		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
 			hlist_del_rcu(&mn->hlist);
 			if (mn->ops->release)
 				mn->ops->release(mn, mm);
 		}
-		rcu_read_unlock();
-		up_write(&mm->mmap_sem);
-		synchronize_rcu();
 	}
 }
 
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 18:00:26.000000000 -0800
@@ -67,15 +67,22 @@ struct mmu_notifier_ops {
 				int dummy);
 
 	/*
+	 * invalidate_range_begin() and invalidate_range_end() are paired.
+	 *
+	 * invalidate_range_begin must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
 	 * lock indicates that the function is called under spinlock.
 	 */
 	void (*invalidate_range_begin)(struct mmu_notifier *mn,
 				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
 				 int lock);
 
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
-				 struct mm_struct *mm,
-				 unsigned long start, unsigned long end);
+				 struct mm_struct *mm);
 };
 
 struct mmu_rmap_notifier_ops;
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/fremap.c	2008-01-30 18:00:26.000000000 -0800
@@ -212,8 +212,9 @@ asmlinkage long sys_remap_file_pages(uns
 		spin_unlock(&mapping->i_mmap_lock);
 	}
 
+	mmu_notifier(invalidate_range_start, mm, start, start + size, 0);
 	err = populate_range(mm, vma, start, size, pgoff);
-	mmu_notifier(invalidate_range, mm, start, start + size, 0);
+	mmu_notifier(invalidate_range_end, mm);
 	if (!err && !(flags & MAP_NONBLOCK)) {
 		if (unlikely(has_write_lock)) {
 			downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/hugetlb.c	2008-01-30 18:00:26.000000000 -0800
@@ -744,6 +744,7 @@ void __unmap_hugepage_range(struct vm_ar
 	BUG_ON(start & ~HPAGE_MASK);
 	BUG_ON(end & ~HPAGE_MASK);
 
+	mmu_notifier(invalidate_range_start, mm, start, end, 1);
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -764,7 +765,7 @@ void __unmap_hugepage_range(struct vm_ar
 	}
 	spin_unlock(&mm->page_table_lock);
 	flush_tlb_range(vma, start, end);
-	mmu_notifier(invalidate_range, mm, start, end, 1);
+	mmu_notifier(invalidate_range_end, mm);
 	list_for_each_entry_safe(page, tmp, &page_list, lru) {
 		list_del(&page->lru);
 		put_page(page);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-01-30 17:58:48.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-01-30 18:00:51.000000000 -0800
@@ -888,11 +888,12 @@ unsigned long zap_page_range(struct vm_a
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
+	mmu_notifier(invalidate_range_start, mm, address, end,
+		(details ? (details->i_mmap_lock != NULL)  : 0));
 	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
-	mmu_notifier(invalidate_range, mm, address, end,
-		(details ? (details->i_mmap_lock != NULL)  : 0));
+	mmu_notifier(invalidate_range_end, mm);
 	return end;
 }
 
@@ -1355,6 +1356,7 @@ int remap_pfn_range(struct vm_area_struc
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
+	mmu_notifier(invalidate_range_start, mm, start, end, 0);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = remap_pud_range(mm, pgd, addr, next,
@@ -1362,7 +1364,7 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	mmu_notifier(invalidate_range, mm, start, end, 0);
+	mmu_notifier(invalidate_range_end, mm);
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
@@ -1450,6 +1452,7 @@ int apply_to_page_range(struct mm_struct
 	int err;
 
 	BUG_ON(addr >= end);
+	mmu_notifier(invalidate_range_start, mm, start, end, 0);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -1457,7 +1460,7 @@ int apply_to_page_range(struct mm_struct
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
-	mmu_notifier(invalidate_range, mm, start, end, 0);
+	mmu_notifier(invalidate_range_end, mm);
 	return err;
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
@@ -1635,6 +1638,8 @@ gotten:
 		goto oom;
 	cow_user_page(new_page, old_page, address, vma);
 
+	mmu_notifier(invalidate_range_start, mm, address,
+				address + PAGE_SIZE - 1, 0);
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1673,8 +1678,7 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	mmu_notifier(invalidate_range, mm, address,
-				address + PAGE_SIZE - 1, 0);
+	mmu_notifier(invalidate_range_end, mm);
 	if (dirty_page) {
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  1:46                                 ` Christoph Lameter
@ 2008-01-31  2:34                                   ` Robin Holt
  2008-01-31  2:37                                     ` Christoph Lameter
  2008-01-31  2:56                                     ` [kvm-devel] mmu_notifier: invalidate_range_start with lock=1 Christoph Lameter
  2008-01-31 10:52                                   ` [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Andrea Arcangeli
  1 sibling, 2 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-31  2:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Nick Piggin, Peter Zijlstra, linux-mm,
	Benjamin Herrenschmidt, steiner, linux-kernel, Avi Kivity,
	kvm-devel, daniel.blueman, Robin Holt, Hugh Dickins

> Well the GRU uses follow_page() instead of get_user_pages. Performance is 
> a major issue for the GRU. 

Worse, the GRU takes its TLB faults from within an interrupt so we
use follow_page to prevent going to sleep.  That said, I think we
could probably use follow_page() with FOLL_GET set to accomplish the
requirements of mmu_notifier invalidate_range call.  Doesn't look too
promising for hugetlb pages.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  2:34                                   ` Robin Holt
@ 2008-01-31  2:37                                     ` Christoph Lameter
  2008-01-31  2:56                                     ` [kvm-devel] mmu_notifier: invalidate_range_start with lock=1 Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  2:37 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Nick Piggin, Peter Zijlstra, linux-mm,
	Benjamin Herrenschmidt, steiner, linux-kernel, Avi Kivity,
	kvm-devel, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Robin Holt wrote:

> > Well the GRU uses follow_page() instead of get_user_pages. Performance is 
> > a major issue for the GRU. 
> 
> Worse, the GRU takes its TLB faults from within an interrupt so we
> use follow_page to prevent going to sleep.  That said, I think we
> could probably use follow_page() with FOLL_GET set to accomplish the
> requirements of mmu_notifier invalidate_range call.  Doesn't look too
> promising for hugetlb pages.

There may be no need to with the range_start/end scheme. The driver can 
have its own lock to make follow page secure. The lock needs to serialize 
the follow_page handler and the range_start/end calls as well as the 
invalidate_page callouts. I think that avoids the need for 
get_user_pages().



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  2:08                                 ` Christoph Lameter
@ 2008-01-31  2:42                                   ` Andrea Arcangeli
  2008-01-31  2:51                                     ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-31  2:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
>  		hlist_for_each_entry_safe_rcu(mn, n, t,
				         ^^^^

>  					  &mm->mmu_notifier.head, hlist) {
>  			hlist_del_rcu(&mn->hlist);
				 ^^^^

_rcu can go away from both, if hlist_del_rcu can be called w/o locks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  2:42                                   ` Andrea Arcangeli
@ 2008-01-31  2:51                                     ` Christoph Lameter
  2008-01-31 13:39                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  2:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> On Wed, Jan 30, 2008 at 06:08:14PM -0800, Christoph Lameter wrote:
> >  		hlist_for_each_entry_safe_rcu(mn, n, t,
> 				         ^^^^
> 
> >  					  &mm->mmu_notifier.head, hlist) {
> >  			hlist_del_rcu(&mn->hlist);
> 				 ^^^^
> 
> _rcu can go away from both, if hlist_del_rcu can be called w/o locks.

True. hlist_del_init ok? That would allow to check the driver that the 
mmu_notifier is already linked in using !hlist_unhashed(). Driver then 
needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] mmu_notifier: invalidate_range_start with lock=1
  2008-01-31  2:34                                   ` Robin Holt
  2008-01-31  2:37                                     ` Christoph Lameter
@ 2008-01-31  2:56                                     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-31  2:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Nick Piggin, Peter Zijlstra, linux-mm,
	Benjamin Herrenschmidt, steiner, linux-kernel, Avi Kivity,
	kvm-devel, daniel.blueman, Hugh Dickins

One possible way that XPmem could deal with a call of 
invalidate_range_start with the lock flag set:

Scan through the rmaps you have for ptes. If you find one then elevate the 
refcount of the corresponding page and mark in the maps that you have done 
so. Also make them readonly. The increased refcount will prevent the 
freeing of the page. The page will be unmapped from the process and XPmem 
will retain the only reference.

Then some shepherding process that you have anyways with XPmem can 
sometime later zap the remote ptes and free the pages. Would leave stale 
data visible on the remote side for awhile. Would that be okay?

This would only be used for truncate that uses the unmap_mapping_range 
call. So we are not in reclaim or other distress.




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  1:46                                 ` Christoph Lameter
  2008-01-31  2:34                                   ` Robin Holt
@ 2008-01-31 10:52                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 10:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Wed, Jan 30, 2008 at 05:46:21PM -0800, Christoph Lameter wrote:
> Well the GRU uses follow_page() instead of get_user_pages. Performance is 
> a major issue for the GRU. 

GRU is a external TLB, we have to allocate RAM instead but we do it
through the regular userland paging mechanism. Performance is a major
issue for kvm too, but the result of get_user_pages is used to fill a
spte, so then the cpu will use the spte in hardware to fill its
tlb, we won't have to keep calling follow_page in software to fill the
tlb like GRU has to do, so you can imagine the difference in cpu
utilization spent in those paths (plus our requirement to allocate
memory).

> Hmmmm.. Could we go to a scheme where we do not have to increase the page 
> count? Modifications of the page struct require dirtying a cache line and 

I doubt the atomic_inc is measurable given the rest of overhead like
building the rmap for each new spte.

There's no technical reason for not wanting proper reference counting
other than microoptimization. What will work for GRU will work for KVM
too regardless of whatever reference counting. Each mmu-notifier user
should be free to do what it think it's better/safer or more
convenient (and for anybody calling get_user_pages having the
refcounting on external references is natural and zero additional
cost).

> it seems that we do not need an increased page count if we have an
> invalidate_range_start() that clears all the external references 
> and stops the establishment of new ones and invalidate_range_end() that 
> reenables new external references?
> 
> Then we do not need the frequent invalidate_page() calls.

The increased page count is _mandatory_ to safely use range_start/end
called outside the locks with _end called after releasing the old
page. sptes will build themself the whole time until the pte_clear is
called on the main linux pte. We don't want to clutter the VM fast
paths with additional locks to stop the kvm pagefault while the VM is
in the _range_start/end critical section like xpmem has to do be
safe. So you're contradicting yourself by suggesting not to use
invalidate_page and not to use a increased page count at the same
time. And I need invalidate_page anyway for rmap.c which can't be
provided as an invalidate_range and it can't sleep either.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges
  2008-01-31  2:51                                     ` Christoph Lameter
@ 2008-01-31 13:39                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-31 13:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Peter Zijlstra, linux-mm, Benjamin Herrenschmidt,
	steiner, linux-kernel, Avi Kivity, kvm-devel, daniel.blueman,
	Robin Holt, Hugh Dickins

On Wed, Jan 30, 2008 at 06:51:26PM -0800, Christoph Lameter wrote:
> True. hlist_del_init ok? That would allow to check the driver that the 
> mmu_notifier is already linked in using !hlist_unhashed(). Driver then 
> needs to properly initialize the mmu_notifier list with INIT_HLIST_NODE().

A driver couldn't possibly care about the mmu notifier anymore at that
point, we just agreed a moment ago that the list can't change under
mmu_notifier_release, and in turn no driver could possibly call
mmu_notifier_unregister/register at that point anymore regardless of
the outcome of hlist_unhashed and external serialization must let the
driver know he's done with the notifiers.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
                     ` (3 preceding siblings ...)
  2008-01-29 16:07   ` Robin Holt
@ 2008-02-05 18:05   ` Andy Whitcroft
  2008-02-05 18:17     ` Peter Zijlstra
  2008-02-05 18:19     ` Christoph Lameter
  4 siblings, 2 replies; 97+ messages in thread
From: Andy Whitcroft @ 2008-02-05 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Mon, Jan 28, 2008 at 12:28:41PM -0800, Christoph Lameter wrote:
> Core code for mmu notifiers.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
> 
> ---
>  include/linux/list.h         |   14 ++
>  include/linux/mm_types.h     |    6 +
>  include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/page-flags.h   |   10 ++
>  kernel/fork.c                |    2 
>  mm/Kconfig                   |    4 
>  mm/Makefile                  |    1 
>  mm/mmap.c                    |    2 
>  mm/mmu_notifier.c            |  101 ++++++++++++++++++++
>  9 files changed, 350 insertions(+)
> 
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h	2008-01-28 11:35:22.000000000 -0800
> @@ -153,6 +153,10 @@ struct vm_area_struct {
>  #endif
>  };
>  
> +struct mmu_notifier_head {
> +	struct hlist_head head;
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -219,6 +223,8 @@ struct mm_struct {
>  	/* aio bits */
>  	rwlock_t		ioctx_list_lock;
>  	struct kioctx		*ioctx_list;
> +
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-01-28 11:43:03.000000000 -0800
> @@ -0,0 +1,210 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier
> + *
> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * the external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.
> + *
> + * These fall into two classes
> + *
> + * 1. mmu_notifier
> + *
> + * 	These are callbacks registered with an mm_struct. If mappings are
> + * 	removed from an address space then callbacks are performed.
> + * 	Spinlocks must be held in order to the walk reverse maps and the
> + * 	notifications are performed while the spinlock is held.
> + *
> + *
> + * 2. mmu_rmap_notifier
> + *
> + *	Callbacks for subsystems that provide their own rmaps. These
> + *	need to walk their own rmaps for a page. The invalidate_page
> + *	callback is outside of locks so that we are not in a strictly
> + *	atomic context (but we may be in a PF_MEMALLOC context if the
> + *	notifier is called from reclaim code) and are able to sleep.
> + *	Rmap notifiers need an extra page bit and are only available
> + *	on 64 bit platforms. It is up to the subsystem to mark pags
> + *	as PageExternalRmap as needed to trigger the callbacks. Pages
> + *	must be marked dirty if dirty bits are set in the external
> + *	pte.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> +	/*
> +	 * Note: The mmu_notifier structure must be released with
> +	 * call_rcu() since other processors are only guaranteed to
> +	 * see the changes after a quiescent period.
> +	 */
> +	void (*release)(struct mmu_notifier *mn,
> +			struct mm_struct *mm);
> +
> +	int (*age_page)(struct mmu_notifier *mn,
> +			struct mm_struct *mm,
> +			unsigned long address);
> +
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);
> +
> +	/*
> +	 * lock indicates that the function is called under spinlock.
> +	 */
> +	void (*invalidate_range)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int lock);
> +};
> +
> +struct mmu_rmap_notifier_ops;
> +
> +struct mmu_rmap_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_rmap_notifier_ops *ops;
> +};
> +
> +struct mmu_rmap_notifier_ops {
> +	/*
> +	 * Called with the page lock held after ptes are modified or removed
> +	 * so that a subsystem with its own rmap's can remove remote ptes
> +	 * mapping a page.
> +	 */
> +	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
> +						struct page *page);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void __mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +/* Will acquire mmap_sem for write*/
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +/*
> + * Will acquire mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> +				    struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> +				 unsigned long address);
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> +	INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		struct mmu_notifier *__mn;				\
> +		struct hlist_node *__n;					\
> +									\
> +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> +			rcu_read_lock();				\
> +			hlist_for_each_entry_rcu(__mn, __n,		\
> +					     &(mm)->mmu_notifier.head,	\
> +					     hlist)			\
> +				if (__mn->ops->function)		\
> +					__mn->ops->function(__mn,	\
> +							    mm,		\
> +							    args);	\
> +			rcu_read_unlock();				\
> +		}							\
> +	} while (0)
> +
> +extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
> +extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
> +
> +extern struct hlist_head mmu_rmap_notifier_list;
> +
> +#define mmu_rmap_notifier(function, args...)				\
> +	do {								\
> +		struct mmu_rmap_notifier *__mrn;			\
> +		struct hlist_node *__n;					\
> +									\
> +		rcu_read_lock();					\
> +		hlist_for_each_entry_rcu(__mrn, __n,			\
> +				&mmu_rmap_notifier_list, 		\
> +						hlist)			\
> +			if (__mrn->ops->function)			\
> +				__mrn->ops->function(__mrn, args);	\
> +		rcu_read_unlock();					\
> +	} while (0);
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_notifier *__mn;			\
> +									\
> +			__mn = (struct mmu_notifier *)(0x00ff);		\
> +			__mn->ops->function(__mn, mm, args);		\
> +		};							\
> +	} while (0)
> +
> +#define mmu_rmap_notifier(function, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_rmap_notifier *__mrn;		\
> +									\
> +			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
> +			__mrn->ops->function(__mrn, args);		\
> +		}							\
> +	} while (0);
> +
> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> +				unsigned long address)
> +{
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +									{}
> +static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +									{}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/page-flags.h	2008-01-28 11:35:22.000000000 -0800
> @@ -105,6 +105,7 @@
>   * 64 bit  |           FIELDS             | ??????         FLAGS         |
>   *         63                            32                              0
>   */
> +#define PG_external_rmap	30	/* Page has external rmap */
>  #define PG_uncached		31	/* Page has been mapped as uncached */
>  #endif
>  
> @@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
>  #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
>  #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
>  
> +#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
> +#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
> +#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
> +#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
> +							&(page)->flags)
> +#else
> +#define PageExternalRmap(page)	0
> +#endif
> +
>  struct page;	/* forward declaration */
>  
>  extern void cancel_dirty_page(struct page *page, unsigned int account_size);
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Kconfig	2008-01-28 11:35:22.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	def_bool y
> +	bool "MMU notifier, for paging KVM/RDMA"
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/Makefile	2008-01-28 11:35:22.000000000 -0800
> @@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>  
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c	2008-01-28 11:35:22.000000000 -0800
> @@ -0,0 +1,101 @@
> +/*
> + *  linux/mm/mmu_notifier.c
> + *
> + *  Copyright (C) 2008  Qumranet, Inc.
> + *  Copyright (C) 2008  SGI
> + *  		Christoph Lameter <clameter@sgi.com>
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);

Does this ->release actually release the 'nm' and its associated hlist?
I see in this thread that this ordering is deemed "use after free" which
implies so.

If it does that seems wrong.  This is an RCU hlist, therefore the list
integrity must be maintained through the next grace period in case there
are parallell readers using the element, in particular its forward
pointer for traversal.

> +			hlist_del(&mn->hlist);

For this to be updating the list, you must have some form of "write-side"
exclusion as these primatives are not "parallel write safe".  It would
be helpful for this routine to state what that write side exclusion is.

> +		}
> +		rcu_read_unlock();
> +		synchronize_rcu();
> +	}
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int young = 0;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_rcu(mn, n,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->age_page)
> +				young |= mn->ops->age_page(mn, mm, address);
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	return young;
> +}
> +
> +/*
> + * Note that all notifiers use RCU. The updates are only guaranteed to be
> + * visible to other processes after a RCU quiescent period!
> + */
> +void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	__mmu_notifier_register(mn, mm);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	down_write(&mm->mmap_sem);
> +	hlist_del_rcu(&mn->hlist);
> +	up_write(&mm->mmap_sem);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> +
> +static DEFINE_SPINLOCK(mmu_notifier_list_lock);
> +HLIST_HEAD(mmu_rmap_notifier_list);
> +
> +void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_register);
> +
> +void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
> +{
> +	spin_lock(&mmu_notifier_list_lock);
> +	hlist_del_rcu(&mrn->hlist);
> +	spin_unlock(&mmu_notifier_list_lock);
> +}
> +EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
> +
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/kernel/fork.c	2008-01-28 11:35:22.000000000 -0800
> @@ -51,6 +51,7 @@
>  #include <linux/random.h>
>  #include <linux/tty.h>
>  #include <linux/proc_fs.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -359,6 +360,7 @@ static struct mm_struct * mm_init(struct
>  
>  	if (likely(!mm_alloc_pgd(mm))) {
>  		mm->def_flags = 0;
> +		mmu_notifier_head_init(&mm->mmu_notifier);
>  		return mm;
>  	}
>  	free_mm(mm);
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/mm/mmap.c	2008-01-28 11:37:53.000000000 -0800
> @@ -26,6 +26,7 @@
>  #include <linux/mount.h>
>  #include <linux/mempolicy.h>
>  #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
>  	vm_unacct_memory(nr_accounted);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
>  	tlb_finish_mmu(tlb, 0, end);
> +	mmu_notifier_release(mm);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,
> Index: linux-2.6/include/linux/list.h
> ===================================================================
> --- linux-2.6.orig/include/linux/list.h	2008-01-28 11:35:20.000000000 -0800
> +++ linux-2.6/include/linux/list.h	2008-01-28 11:35:22.000000000 -0800
> @@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
>  		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
>  	     pos = pos->next)
>  
> +/**
> + * hlist_for_each_entry_safe_rcu	- iterate over list of given type
> + * @tpos:	the type * to use as a loop cursor.
> + * @pos:	the &struct hlist_node to use as a loop cursor.
> + * @n:		temporary pointer
> + * @head:	the head for your list.
> + * @member:	the name of the hlist_node within the struct.
> + */
> +#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
> +	for (pos = (head)->first;					 \
> +	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
> +	     pos = n)
> +
>  #else
>  #warning "don't include kernel headers in userspace"
>  #endif /* __KERNEL__ */

I am not sure it makes sense to add a _safe_rcu variant.  As I understand
things an _safe variant is used where we are going to remove the current
list element in the middle of a list walk.  However the key feature of an
RCU data structure is that it will always be in a "safe" state until any
parallel readers have completed.  For an hlist this means that the removed
entry and its forward link must remain valid for as long as there may be
a parallel reader traversing this list, ie. until the next grace period.
If this link is valid for the parallel reader, then it must be valid for
us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
to cope in the face of entries being unlinked as we traverse the list.

-apw

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-05 18:05   ` Andy Whitcroft
@ 2008-02-05 18:17     ` Peter Zijlstra
  2008-02-05 18:19     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2008-02-05 18:17 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins


On Tue, 2008-02-05 at 18:05 +0000, Andy Whitcroft wrote:

> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.
> 
> If it does that seems wrong.  This is an RCU hlist, therefore the list
> integrity must be maintained through the next grace period in case there
> are parallell readers using the element, in particular its forward
> pointer for traversal.

That is not quite so, list elements must be preserved, not the list
order.

> 
> > +			hlist_del(&mn->hlist);
> 
> For this to be updating the list, you must have some form of "write-side"
> exclusion as these primatives are not "parallel write safe".  It would
> be helpful for this routine to state what that write side exclusion is.

Yeah, has been noticed, read on in the thread :-)

> I am not sure it makes sense to add a _safe_rcu variant.  As I understand
> things an _safe variant is used where we are going to remove the current
> list element in the middle of a list walk.  However the key feature of an
> RCU data structure is that it will always be in a "safe" state until any
> parallel readers have completed.  For an hlist this means that the removed
> entry and its forward link must remain valid for as long as there may be
> a parallel reader traversing this list, ie. until the next grace period.
> If this link is valid for the parallel reader, then it must be valid for
> us, and if so it feels that hlist_for_each_entry_rcu should be sufficient
> to cope in the face of entries being unlinked as we traverse the list.

It does make sense, hlist_del_rcu() maintains the fwd reference, but it
does unlink it from the list proper. As long as there is a write side
exclusion around the actual removal as you noted.

rcu_read_lock();
hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member) {

	if (foo) {
		spin_lock(write_lock);
		hlist_del_rcu(tpos);
		spin_unlock(write_unlock);
	}
}
rcu_read_unlock();

is a safe construct in that the list itself stays a proper list, and
even items that might be caught in the to-be-deleted entries will have a
fwd way out.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-05 18:05   ` Andy Whitcroft
  2008-02-05 18:17     ` Peter Zijlstra
@ 2008-02-05 18:19     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-02-05 18:19 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Tue, 5 Feb 2008, Andy Whitcroft wrote:

> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> Does this ->release actually release the 'nm' and its associated hlist?
> I see in this thread that this ordering is deemed "use after free" which
> implies so.

Right that was fixed in a later release and discussed extensively later. 
See V5.

> I am not sure it makes sense to add a _safe_rcu variant.  As I understand
> things an _safe variant is used where we are going to remove the current

It was dropped in V5.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-02-16  3:37   ` Andrew Morton
@ 2008-02-18 22:33   ` Roland Dreier
  1 sibling, 0 replies; 97+ messages in thread
From: Roland Dreier @ 2008-02-18 22:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

It seems that we've come up with two reasonable cases where it makes
sense to use these notifiers for InfiniBand/RDMA:

First, the ability to safely to DMA to/from userspace memory with the
memory regions mlock()ed but the pages not pinned.  In this case the
notifiers here would seem to suit us well:

 > +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
 > +				 struct mm_struct *mm,
 > +				 unsigned long start, unsigned long end,
 > +				 int atomic);
 > +
 > +	void (*invalidate_range_end)(struct mmu_notifier *mn,
 > +				 struct mm_struct *mm,
 > +				 unsigned long start, unsigned long end,
 > +				 int atomic);

If I understand correctly, the IB stack would have to get the hardware
driver to shoot down translation entries and suspend access to the
region when an invalidate_range_begin notifier is called, and wait for
the invalidate_range_end notifier to repopulate the adapter
translation tables.  This will probably work OK as long as the
interval between the invalidate_range_begin and invalidate_range_end
calls is not "too long."

Also, using this effectively requires us to figure out how we want to
mlock() regions that are going to be used for RDMA.  We could require
userspace to do it, but it's not clear to me that we're safe in the
case where userspace decides not to... what happens if some pages get
swapped out after the invalidate_range_begin notifier?

The second case where some form of notifiers are useful is for
userspace to know when a memory registration is still valid, ie Pete
Wyckoff's work:

    http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf
    http://www.osc.edu/~pw/dreg/

however these MMU notifiers seem orthogonal to that: the registration
cache is concerned with address spaces, not page mapping, and hence
the existing vma operations seem to be a better fit.

 - R.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-17  3:01       ` Andrea Arcangeli
@ 2008-02-17 12:24         ` Robin Holt
  0 siblings, 0 replies; 97+ messages in thread
From: Robin Holt @ 2008-02-17 12:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Andrew Morton, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman

On Sun, Feb 17, 2008 at 04:01:20AM +0100, Andrea Arcangeli wrote:
> On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> > On Fri, 15 Feb 2008, Andrew Morton wrote:
> > 
> > > What is the status of getting infiniband to use this facility?
> > 
> > Well we are talking about this it seems.
> 
> It seems the IB folks think allowing RDMA over virtual memory is not
> interesting, their argument seem to be that RDMA is only interesting
> on RAM (and they seem not interested in allowing RDMA over a ram+swap
> backed _virtual_ memory allocation). They've just to decide if
> ram+swap allocation for RDMA is useful or not.

I don't think that is a completely fair characterization.  It would be
more fair to say that the changes required to their library/user api
would be too significant to allow an adaptation to any scheme which
allowed removal of physical memory below a virtual mapping.

I agree with the IB folks when they say it is impossible with their
current scheme.  The fact that any consumer of their endpoint identifier
can use any identifier without notifying the kernel prior to its use
certainly makes any implementation under any scheme impossible.

I guess we could possibly make things work for IB if we did some heavy
work.  Let's assume, instead of passing around the physical endpoint
identifiers, they passed around a handle.  In order for any IB endpoint
to commuicate, it would need to request the kernel translate a handle
into an endpoint identifier.  In order for the kernel to put a TLB
entry into the processes address space allowing the process access to
the _CARD_, it would need to ensure all the current endpoint identifiers
for this process were "active" meaning we have verified with the other
endpoint that all pages are faulted and TLB/PFN information is in the
owning card's TLB/PFN tables.  Once all of a processes endoints are
"active" we would drop in the PFN for the adapter into the pages tables.
Any time pages are being revoked from under an active handle, we would
shoot-down the IB adapter card TLB entries for all the remote users of
this handle and quiesce the cards state to ensure transfers are either
complete or terminated.  When their are no active transfers, we would
respond back to the owner and they could complete the source process
page table cleaning.  Any time all of the pages for a handle can not be
mapped from virtual to physical, the remote process would be SIGBUS'd
instead of having it IB adapter TLB installed.

This is essentially how XPMEM does it except we have the benefit of
working on individual pages.

Again, not knowing what I am talking about, but under the assumption that
MPI IB use is contained to a library, I would hope the changes could be
contained under the MPI-to-IB library interface and would not need any
changes at the MPI-user library interface.

We do keep track of the virtual address ranges within a handle that
are being used.  I assume the IB folks will find that helpful as well.
Otherwise, I think they could make things operate this way.  XPMEM has
the advantage of not needing to have virtual-to-physical at all times,
but otherwise it is essentially the same.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code 
  2008-02-16  3:37   ` Andrew Morton
                       ` (2 preceding siblings ...)
  2008-02-16 19:21     ` Christoph Lameter
@ 2008-02-17  5:04     ` Doug Maxey
  3 siblings, 0 replies; 97+ messages in thread
From: Doug Maxey @ 2008-02-17  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, kvm-devel, Peter Zijlstra, general, Steve Wise,
	Roland Dreier, Kanoj Sarcar, steiner, linux-kernel, linux-mm,
	daniel.blueman, Ben Herrenschmidt, Jan-Bernd Themann


On Fri, 15 Feb 2008 19:37:19 PST, Andrew Morton wrote:
> Which other potential clients have been identified and how important it it
> to those?

The powerpc ehea utilizes its own mmu.  Not sure about the importance 
to the driver. (But will investigate :)

++doug


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 19:21     ` Christoph Lameter
@ 2008-02-17  3:01       ` Andrea Arcangeli
  2008-02-17 12:24         ` Robin Holt
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-02-17  3:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Sat, Feb 16, 2008 at 11:21:07AM -0800, Christoph Lameter wrote:
> On Fri, 15 Feb 2008, Andrew Morton wrote:
> 
> > What is the status of getting infiniband to use this facility?
> 
> Well we are talking about this it seems.

It seems the IB folks think allowing RDMA over virtual memory is not
interesting, their argument seem to be that RDMA is only interesting
on RAM (and they seem not interested in allowing RDMA over a ram+swap
backed _virtual_ memory allocation). They've just to decide if
ram+swap allocation for RDMA is useful or not.

> > How important is this feature to KVM?
> 
> Andrea can answer this.

I think I already did in separate email.

> > That sucks big time.  What do we need to do to make get the callback
> > functions called in non-atomic context?

I sure agree given I also asked to drop the lock param and enforce the
invalidate_range_* to always be called in non atomic context.

> We would have to drop the inode_mmap_lock. Could be done with some minor 
> work.

The invalidate may be deferred after releasing the lock, the lock may
not have to be dropped to cleanup the API (and make xpmem life easier).

> That is one implementation (XPmem does that). The other is to simply stop 
> all references when any invalidate_range is in progress (KVM and GRU do 
> that).

KVM doesn't stop new references. It doesn't need to because it holds a
reference on the page (GRU doesn't). KVM can invalidate the spte and
flush the tlb only after the linux pte has been cleared and after the
page has been released by the VM (because the page doesn't go in the
freelist and it remains pinned for a little while, until the spte is
dropped too inside invalidate_range_end). GRU has to invalidate
_before_ the linux pte is cleared so it has to stop new references
from being established in the invalidate_range_start/end critical
section.

> Andrea put this in to check the reference status of a page. It functions 
> like the accessed bit.

In short each pte can have some spte associated to it. So whenever we
do a ptep_clear_flush protected by the PT lock, we also have to run
invalidate_page that will internally invoke a sort-of
sptep_clear_flush protected by a kvm->mmu_lock (equivalent of
page_table_lock/PT-lock). sptes just like ptes maps virtual addresses
to physical addresses, so you can read/write to RAM either through a
pte or through a spte.

Just like it would be insane to have any requirement that
ptep_clear_flush has to run in not-atomic context (forcing a
conversion of the PT lock to a mutex), it's also weird require the
invalidate_page/age_page to run in atomic context.

All troubles start with the xpmem requirements of having to schedule
in its equivalent of the sptep_clear_flush because it's not a
gigaherz-in-cpu thing but a gigabit thing where the network stack is
involved with its own software linux driven skb memory allocations,
schedules waiting for network I/O, etc... Imagine ptes allocated in a
remote node, no surprise its brings a new set of problems (assuming it
can work reliably during oom given its memory requirements in the
try_to_unmap path, no page can ever be freed until the skbs have been
allocated and sent and allocated again to receive the ack).

Furthermore xpmem doesn't associate any pte to a spte, it associates a
page_t to certain remote references, or it would be in trouble with
invalidate_page that corresponds to ptep_clear_flush on a virtual
address that exists thanks to the anon_vma/i_mmap lock held (and not
thanks to the mmap_sem like in all invalidate_range calls).

Christoph's patch is a mix of two entirely separated features. KVM can
live with V7 just fine, but it's a lot more than what is needed by KVM.

I don't think that invalidate_page/age_page must be allowed to sleep
because invalidate_range also can sleep. You've to just ask yourself
if the VM locks shall remain spinlocks, for the VM own good (not for
the mmu notifiers good). It'd be bad to make the VM underperform with
mutex protecting tiny critical sections to please some mmu notifier
user. But if they're spinlocks, then clearly invalidate_page/age_page
based on virtual addresses can't sleep or the virtual address wouldn't
make sense anymore by the time the spinlock is released.

> > This function looks like it was tossed in at the last minute.  It's
> > mysterious, undocumented, poorly commented, poorly named.  A better name
> > would be one which has some correlation with the return value.
> > 
> > Because anyone who looks at some code which does
> > 
> > 	if (mmu_notifier_age_page(mm, address))
> > 		...
> > 
> > has to go and reverse-engineer the implementation of
> > mmu_notifier_age_page() to work out under which circumstances the "..."
> > will be executed.  But this should be apparent just from reading the callee
> > implementation.
> > 
> > This function *really* does need some documentation.  What does it *mean*
> > when the ->age_page() from some of the notifiers returned "1" and the
> > ->age_page() from some other notifiers returned zero?  Dunno.
> 
> Andrea: Could you provide some more detail here?

age_page is simply the ptep_clear_flush_young equivalent for
sptes. It's meant to provide aging to the pages mapped by secondary
mmus. Its return value is the same one of ptep_clear_flush_young but
it represents the sptes associated with the pte,
ptep_clear_flush_young instead only takes care of the pte itself.

For KVM the below would be all that is needed, the fact
invalidate_range can sleep and invalidate_page/age_page can't, is
because their users are very different. With my approach the mmu
notifiers callback are always protected by the PT lock (just like
ptep_clear_flush and the other pte+tlb manglings) and they're called
after the pte is cleared and before the VM reference on the page has
been dropped. That makes it safe for GRU too, so for my initial
approach _none_ of the callbacks was allowed to sleep, and that was a
feature that allows GRU not to block its tlb miss interrupt with any
further locking (the PT-lock taken by follow_page automatically
serialized the GRU interrupt against the MMU notifiers and the linux
page fault). For KVM the invalidate_pages of my patch is converted to
invalidate_range_end because it doesn't matter for KVM if it's called
after the PT lock has been dropped. In the try_to_unmap case
invalidate_page is called by atomic context in Christoph's patch too,
because a virtual address and in turn a pte and in turn certain sptes,
can only exist thanks to the spinlocks taken by the VM. Changing the
VM to make mmu notifiers sleepable in the try_to_unmap path sounds bad
to me, especially given not even xpmem needs this.

You can see how everything looks simpler and more symmetric by
assuming the secondary mmu-references are established and dropped like
ptes, like in the KVM case where infact sptes are a pure cpu thing
exact like the ptes. XPMEM adds the requirement that sptes are infact
remote entities that are mangled by a message passing protocol over
the network, it's the same as ptep_clear_flush being required to
schedule and send skbs to be successful and allowing try_to_unmap to
do its work. Same problem. No wonder patch gets more complicated then.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -46,6 +46,7 @@
 	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
 	if (__young)							\
 		flush_tlb_page(__vma, __address);			\
+	__young |= mmu_notifier_age_page((__vma)->vm_mm, __address);	\
 	__young;							\
 })
 #endif
@@ -86,6 +87,7 @@ do {									\
 	pte_t __pte;							\
 	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
 	flush_tlb_page(__vma, __address);				\
+	mmu_notifier(invalidate_page, (__vma)->vm_mm, __address);	\
 	__pte;								\
 })
 #endif
diff --git a/include/asm-s390/pgtable.h b/include/asm-s390/pgtable.h
--- a/include/asm-s390/pgtable.h
+++ b/include/asm-s390/pgtable.h
@@ -712,6 +712,7 @@ static inline pte_t ptep_clear_flush(str
 {
 	pte_t pte = *ptep;
 	ptep_invalidate(address, ptep);
+	mmu_notifier(invalidate_page, vma->vm_mm, address);
 	return pte;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include <linux/rbtree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
+#include <linux/mmu_notifier.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -219,6 +220,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,132 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct mmu_notifier;
+
+struct mmu_notifier_ops {
+	/*
+	 * Called when nobody can register any more notifier in the mm
+	 * and after the "mn" notifier has been disarmed already.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * invalidate_page[s] is called in atomic context
+	 * after any pte has been updated and before
+	 * dropping the PT lock required to update any Linux pte.
+	 * Once the PT lock will be released the pte will have its
+	 * final value to export through the secondary MMU.
+	 * Before this is invoked any secondary MMU is still ok
+	 * to read/write to the page previously pointed by the
+	 * Linux pte because the old page hasn't been freed yet.
+	 * If required set_page_dirty has to be called internally
+	 * to this method.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+	void (*invalidate_pages)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end);
+
+	/*
+	 * Age page is called in atomic context inside the PT lock
+	 * right after the VM is test-and-clearing the young/accessed
+	 * bitflag in the pte. This way the VM will provide proper aging
+	 * to the accesses to the page through the secondary MMUs
+	 * and not only to the ones through the Linux pte.
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+};
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_head {
+	struct hlist_head head;
+	spinlock_t lock;
+};
+
+#include <linux/mm_types.h>
+
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads.
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the "struct mmu_notifier" can be freed. Alternatively it
+ * can be synchronously freed inside ->release when the list can't
+ * change anymore and nobody could possibly walk it.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+	spin_lock_init(&mnh->lock);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+						 &(mm)->mmu_notifier.head, \
+						 hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+struct mmu_notifier_head {};
+
+#define mmu_notifier_register(mn, mm) do {} while(0)
+#define mmu_notifier_unregister(mn, mm) do {} while (0)
+#define mmu_notifier_release(mm) do {} while (0)
+#define mmu_notifier_age_page(mm, address) ({ 0; })
+#define mmu_notifier_head_init(mmh) do {} while (0)
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)			       \
+	do {							       \
+		if (0) {					       \
+			struct mmu_notifier *__mn;		       \
+								       \
+			__mn = (struct mmu_notifier *)(0x00ff);	       \
+			__mn->ops->function(__mn, mm, args);	       \
+		};						       \
+	} while (0)
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -360,6 +360,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,3 +193,7 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -756,6 +756,7 @@ void __unmap_hugepage_range(struct vm_ar
 		if (pte_none(pte))
 			continue;
 
+		mmu_notifier(invalidate_page, mm, address);
 		page = pte_page(pte);
 		if (pte_dirty(pte))
 			set_page_dirty(page);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -494,6 +494,7 @@ static int copy_pte_range(struct mm_stru
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	unsigned long start;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -505,6 +506,7 @@ again:
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 	arch_enter_lazy_mmu_mode();
 
+	start = addr;
 	do {
 		/*
 		 * We are holding two locks at this point - either of them
@@ -525,6 +527,8 @@ again:
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	if (is_cow_mapping(vma->vm_flags))
+		mmu_notifier(invalidate_pages, vma->vm_mm, start, addr);
 	spin_unlock(src_ptl);
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
@@ -660,6 +664,7 @@ static unsigned long zap_pte_range(struc
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			mmu_notifier(invalidate_page, mm, addr);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1248,6 +1253,7 @@ static int remap_pte_range(struct mm_str
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1259,6 +1265,7 @@ static int remap_pte_range(struct mm_str
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 	return 0;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2044,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
new file mode 100644
--- /dev/null
+++ b/mm/mmu_notifier.c
@@ -0,0 +1,73 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *             Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+#include <linux/rcupdate.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *tmp;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, tmp,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					 &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	spin_lock(&mm->mmu_notifier.lock);
+	hlist_del_rcu(&mn->hlist);
+	spin_unlock(&mm->mmu_notifier.lock);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@ static void change_pte_range(struct mm_s
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -71,6 +72,7 @@ static void change_pte_range(struct mm_s
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
+	mmu_notifier(invalidate_pages, mm, start, addr);
 	pte_unmap_unlock(pte - 1, ptl);
 }
 


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 10:58       ` Andrew Morton
@ 2008-02-16 19:31         ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Brice Goglin, Andrea Arcangeli, linux-kernel, linux-mm

On Sat, 16 Feb 2008, Andrew Morton wrote:

> "looks good" maybe.  But it's in the details where I fear this will come
> unstuck.  The likelihood that some callbacks really will want to be able to
> block in places where this interface doesn't permit that - either to wait
> for IO to complete or to wait for other threads to clear critical regions.

We can get the invalidate_range to always be called without spinlocks if 
we deal with the case of the inode_mmap_lock being held in truncate case.

If you always want to be able to sleep then we could drop the 
invalidate_page() that is called while pte locks held and require the use 
of a device driver rmap?

> >From that POV it doesn't look like a sufficiently general and useful
> design.  Looks like it was grafted onto the current VM implementation in a
> way which just about suits two particular clients if they try hard enough.

You missed KVM. We did the best we could being as least invasive as 
possible.

> Which is all perfectly understandable - it would be hard to rework core MM
> to be able to make this interface more general.  But I do think it's
> half-baked and there is a decent risk that future (or present) code which
> _could_ use something like this won't be able to use this one, and will
> continue to futz with mlock, page-pinning, etc.
> 
> Not that I know what the fix to that is..

You do not see a chance of this being okay if we adopt the two measures 
that I mentioned above?
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
  2008-02-16 10:41     ` Brice Goglin
@ 2008-02-16 19:21     ` Christoph Lameter
  2008-02-17  3:01       ` Andrea Arcangeli
  2008-02-17  5:04     ` Doug Maxey
  3 siblings, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-02-16 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Fri, 15 Feb 2008, Andrew Morton wrote:

> What is the status of getting infiniband to use this facility?

Well we are talking about this it seems.
> 
> How important is this feature to KVM?

Andrea can answer this.

> To xpmem?

Without this feature we are stuck with page pinning by increasing 
refcounts which leads to endless lru scanning and other misbehavior. Also 
applications that use XPmem will not be able to swap or be able to use 
things like remap.
 
> Which other potential clients have been identified and how important it it
> to those?

It is likely important to various DMA engines, framebuffers devices etc 
etc. Seems to be a generally useful feature.


> > +The notifier chains provide two callback mechanisms. The
> > +first one is required for any device that establishes external mappings.
> > +The second (rmap) mechanism is required if a device needs to be
> > +able to sleep when invalidating references. Sleeping may be necessary
> > +if we are mapping across a network or to different Linux instances
> > +in the same address space.
> 
> I'd have thought that a major reason for sleeping would be to wait for IO
> to complete.  Worth mentioning here?

Right.

> Why is that "easy"?  I's have thought that it would only be easy if the
> driver happened to be using those same locks for its own purposes. 
> Otherwise it is "awkward"?

Its relatively easy because it is tied directly to a process and can use
external tlb shootdown / external page table clearing directly. The other 
method requires an rmap in the device driver where it can lookup the 
processes that are mapping the page.
 
> > +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> > +called most of the time without any locks held. It is only called with
> > +locks held for file backed mappings that are truncated. A flag indicates
> > +in which mode we are. A driver can use that mechanism to f.e.
> > +delay the freeing of the pages during truncate until no locks are held.
> 
> That sucks big time.  What do we need to do to make get the callback
> functions called in non-atomic context?

We would have to drop the inode_mmap_lock. Could be done with some minor 
work.

> > +Pages must be marked dirty if dirty bits are found to be set in
> > +the external ptes during unmap.
> 
> That sentence is too vague.  Define "marked dirty"?

Call set_page_dirty().

> > +The *release* method is called when a Linux process exits. It is run before
> 
> We'd conventionally use a notation such as "->release()" here, rather than
> the asterisks.

Ok.

> 
> > +the pages and mappings of a process are torn down and gives the device driver
> > +a chance to zap all the external mappings in one go.
> 
> I assume what you mean here is that ->release() is called during exit()
> when the final reference to an mm is being dropped.

Right.

> > +An example for a code that can be used to build a notifier mechanism into
> > +a device driver can be found in the file
> > +Documentation/mmu_notifier/skeleton.c
> 
> Should that be in samples/?

Oh. We have that?

> > +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> > +*before* the Linux rmaps are walked. At that point only the page lock is
> > +held. The invalidate_page() function must walk the driver rmaps and evict
> > +all the references to the page.
> 
> What happens if it cannot do so?

The page is not reclaimed if we were called from try_to_unmap(). From 
page_mkclean() we must always evict the page to switch off the write 
protect bit.

> > +There is no process information available before the rmaps are consulted.
> 
> Not sure what that sentence means.  I guess "available to the core VM"?

At that point we only have the page. We do not know which processes map 
the page. In order to find out we need to take a spinlock.


> > +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> > +it is a global callback list. Having to perform a callback for each and every
> > +page that is reclaimed would be inefficient. Therefore we add an additional
> > +page flag: PageRmapExternal().
> 
> How many page flags are left?

30 or so. Its only available on 64bit.

> Is this feature important enough to justfy consumption of another one?
> 
> > Only pages that are marked with this bit can
> > +be exported and the rmap callbacks will only be performed for pages marked
> > +that way.
> 
> "exported": new term, unclear what it means.

Something external to the kernel references the page.

> > +The required additional Page flag is only availabe in 64 bit mode and
> > +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
> 
> whoa.  Is that good?  You just made your feature unavailable on the great
> majority of Linux systems.

rmaps are usually used by complex drivers that are typically used in large 
systems.

> > + * Notifier functions for hardware and software that establishes external
> > + * references to pages of a Linux system. The notifier calls ensure that
> > + * external mappings are removed when the Linux VM removes memory ranges
> > + * or individual pages from a process.
> 
> So the callee cannot fail.  hm.  If it can't block, it's likely screwed in
> that case.  In other cases it might be screwed anyway.  I suspect we'll
> need to be able to handle callee failure.

Probably.

> 
> > + * These fall into two classes:
> > + *
> > + * 1. mmu_notifier
> > + *
> > + * 	These are callbacks registered with an mm_struct. If pages are
> > + * 	removed from an address space then callbacks are performed.
> 
> "to be removed", I guess.  It's called before the page is actually removed?

Its called after the pte was cleared while holding the pte lock.

> > + * 	The invalidate_range_start/end callbacks can be performed in contexts
> > + * 	where sleeping is allowed or in atomic contexts. A flag is passed
> > + * 	to indicate an atomic context.
> 
> We generally would prefer separate callbacks, rather than a unified
> callback with a mode flag.

We could drop the inode_mmap_lock when doing truncate. That would make 
this work but its a kind of invasive thing for the VM.

> > +struct mmu_notifier_ops {
> > +	/*
> > +	 * The release notifier is called when no other execution threads
> > +	 * are left. Synchronization is not necessary.
> 
> "and the mm is about to be destroyed"?

Right.

> > +	/*
> > +	 * invalidate_range_begin() and invalidate_range_end() must be paired.
> > +	 *
> > +	 * Multiple invalidate_range_begin/ends may be nested or called
> > +	 * concurrently.
> 
> Under what circumstances would they be nested?

Hmmmm.. Right they cannot be nested. Multiple processors can have 
invalidates() concurrently in progress.

> > That is legit. However, no new external references
> 
> references to what?

To the ranges that are in the process of being invalidated.

> > +	 * invalidate_range_begin() must clear all references in the range
> > +	 * and stop the establishment of new references.
> 
> and stop the establishment of new references within the range, I assume?

Right.
 
> If so, that's putting a heck of a lot of complexity into the driver, isn't
> it?  It needs to temporarily remember an arbitrarily large number of
> regions in this mm against which references may not be taken?

That is one implementation (XPmem does that). The other is to simply stop 
all references when any invalidate_range is in progress (KVM and GRU do 
that).


> > +	 * invalidate_range_end() reenables the establishment of references.
> 
> within the range?

Right.

> > +extern void mmu_notifier_release(struct mm_struct *mm);
> > +extern int mmu_notifier_age_page(struct mm_struct *mm,
> > +				 unsigned long address);
> 
> There's the mysterious age_page again.

Andrea put this in to check the reference status of a page. It functions 
like the accessed bit.

> > +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> > +{
> > +	INIT_HLIST_HEAD(&mnh->head);
> > +}
> > +
> > +#define mmu_notifier(function, mm, args...)				\
> > +	do {								\
> > +		struct mmu_notifier *__mn;				\
> > +		struct hlist_node *__n;					\
> > +									\
> > +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> > +			rcu_read_lock();				\
> > +			hlist_for_each_entry_rcu(__mn, __n,		\
> > +					     &(mm)->mmu_notifier.head,	\
> > +					     hlist)			\
> > +				if (__mn->ops->function)		\
> > +					__mn->ops->function(__mn,	\
> > +							    mm,		\
> > +							    args);	\
> > +			rcu_read_unlock();				\
> > +		}							\
> > +	} while (0)
> 
> The macro references its args more than once.  Anyone who does
> 
> 	mmu_notifier(function, some_function_which_has_side_effects())
> 
> will get a surprise.  Use temporaries.

Ok.

> > +#define mmu_notifier(function, mm, args...)				\
> > +	do {								\
> > +		if (0) {						\
> > +			struct mmu_notifier *__mn;			\
> > +									\
> > +			__mn = (struct mmu_notifier *)(0x00ff);		\
> > +			__mn->ops->function(__mn, mm, args);		\
> > +		};							\
> > +	} while (0)
> 
> That's a bit weird.  Can't we do the old
> 
> 	(void)function;
> 	(void)mm;
> 
> trick?  Or make it a staic inline function?

Static inline wont allow the checking of the parameters.

(void) may be a good thing here.

> > +config MMU_NOTIFIER
> > +	def_bool y
> > +	bool "MMU notifier, for paging KVM/RDMA"
> 
> Why is this not selectable?  The help seems a bit brief.
> 
> Does this cause 32-bit systems to drag in a bunch of code they're not
> allowed to ever use?

I have selected it a number of times. We could make that a bit longer 
right.


> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		hlist_for_each_entry_safe(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_init(&mn->hlist);
> > +			if (mn->ops->release)
> > +				mn->ops->release(mn, mm);
> 
> We do this a lot, but back in the old days people didn't like optional
> callbacks which can be NULL.  If we expect that mmu_notifier_ops.release is
> usually implemented, the just unconditionally call it and require that all
> clients implement it.  Perhaps provide an exported-to-modules stuv in core
> kernel for clients which didn't want to implement ->release().

Ok.

> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n;
> > +	int young = 0;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_rcu(mn, n,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			if (mn->ops->age_page)
> > +				young |= mn->ops->age_page(mn, mm, address);
> > +		}
> > +		rcu_read_unlock();
> > +	}
> > +
> > +	return young;
> > +}
> 
> should the rcu_read_lock() cover the hlist_empty() test?
> 
> This function looks like it was tossed in at the last minute.  It's
> mysterious, undocumented, poorly commented, poorly named.  A better name
> would be one which has some correlation with the return value.
> 
> Because anyone who looks at some code which does
> 
> 	if (mmu_notifier_age_page(mm, address))
> 		...
> 
> has to go and reverse-engineer the implementation of
> mmu_notifier_age_page() to work out under which circumstances the "..."
> will be executed.  But this should be apparent just from reading the callee
> implementation.
> 
> This function *really* does need some documentation.  What does it *mean*
> when the ->age_page() from some of the notifiers returned "1" and the
> ->age_page() from some other notifiers returned zero?  Dunno.

Andrea: Could you provide some more detail here?


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16 10:41     ` Brice Goglin
@ 2008-02-16 10:58       ` Andrew Morton
  2008-02-16 19:31         ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrew Morton @ 2008-02-16 10:58 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm

On Sat, 16 Feb 2008 11:41:35 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Andrew Morton wrote:
> > What is the status of getting infiniband to use this facility?
> >
> > How important is this feature to KVM?
> >
> > To xpmem?
> >
> > Which other potential clients have been identified and how important it it
> > to those?
> >   
> 
> As I said when Andrea posted the first patch series, I used something
> very similar for non-RDMA-based HPC about 4 years ago. I haven't had
> time yet to look in depth and try the latest proposed API but my feeling
> is that it looks good.
> 

"looks good" maybe.  But it's in the details where I fear this will come
unstuck.  The likelihood that some callbacks really will want to be able to
block in places where this interface doesn't permit that - either to wait
for IO to complete or to wait for other threads to clear critical regions.

>From that POV it doesn't look like a sufficiently general and useful
design.  Looks like it was grafted onto the current VM implementation in a
way which just about suits two particular clients if they try hard enough.

Which is all perfectly understandable - it would be hard to rework core MM
to be able to make this interface more general.  But I do think it's
half-baked and there is a decent risk that future (or present) code which
_could_ use something like this won't be able to use this one, and will
continue to futz with mlock, page-pinning, etc.

Not that I know what the fix to that is..

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
@ 2008-02-16 10:41     ` Brice Goglin
  2008-02-16 10:58       ` Andrew Morton
  2008-02-16 19:21     ` Christoph Lameter
  2008-02-17  5:04     ` Doug Maxey
  3 siblings, 1 reply; 97+ messages in thread
From: Brice Goglin @ 2008-02-16 10:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, Andrea Arcangeli, linux-kernel, linux-mm

Andrew Morton wrote:
> What is the status of getting infiniband to use this facility?
>
> How important is this feature to KVM?
>
> To xpmem?
>
> Which other potential clients have been identified and how important it it
> to those?
>   

As I said when Andrea posted the first patch series, I used something
very similar for non-RDMA-based HPC about 4 years ago. I haven't had
time yet to look in depth and try the latest proposed API but my feeling
is that it looks good.

Brice


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  8:56       ` Andrew Morton
@ 2008-02-16  9:21         ` Avi Kivity
  0 siblings, 0 replies; 97+ messages in thread
From: Avi Kivity @ 2008-02-16  9:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

Andrew Morton wrote:

  

>> Very.  kvm pins pages that are referenced by the guest;
>>     
>
> hm.  Why does it do that?
>
>   

It was deemed best not to allow the guest to write to a page that has 
been swapped out and assigned to an unrelated host process.

One way to view the kvm shadow page tables is as hardware dma 
descriptors. kvm pins pages for the same reason that drivers pin pages 
that are being dma'ed. It's also the reason why mmu notifiers are useful 
for such a wide range of dma capable hardware.

>> a 64-bit guest 
>> will easily pin its entire memory with the kernel map.
>>     
>
>   
>>  So this is 
>> critical for guest swapping to actually work.
>>     
>
> Curious.  If KVM can release guest pages at the request of this notifier so
> that they can be swapped out, why can't it release them by default, and
> allow swapping to proceed?
>
>   

If kvm releases a page, it must also zap any shadow ptes pointing at the 
page and flush the tlb. If you do that for all of memory you can't 
reference any of it.

Releasing a page has costs, both at the time of the release and when the 
guest eventually refers to the page again.

>> Other nice features like page migration are also enabled by this patch.
>>
>>     
>
> We already have page migration.  Do you mean page-migration-when-using-kvm?
>   

Yes, I'm obviously writing from a kvm-centric point of view. This is an 
important feature, as the virtualization future seems to be NUMA hosts 
(2- or 4- way, 4 cores per socket) running moderately sized guests. The 
ability to load-balance guests among the NUMA nodes is important for 
performance.

(btw, I'm also looking forward to memory defragmentation. large pages 
are important for virtualization workloads and mmu notifiers are again 
critical to getting it to work while running kvm).

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  8:45     ` Avi Kivity
@ 2008-02-16  8:56       ` Andrew Morton
  2008-02-16  9:21         ` Avi Kivity
  0 siblings, 1 reply; 97+ messages in thread
From: Andrew Morton @ 2008-02-16  8:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

On Sat, 16 Feb 2008 10:45:50 +0200 Avi Kivity <avi@qumranet.com> wrote:

> Andrew Morton wrote:
> > How important is this feature to KVM?
> >   
> 
> Very.  kvm pins pages that are referenced by the guest;

hm.  Why does it do that?

> a 64-bit guest 
> will easily pin its entire memory with the kernel map.

>  So this is 
> critical for guest swapping to actually work.

Curious.  If KVM can release guest pages at the request of this notifier so
that they can be swapped out, why can't it release them by default, and
allow swapping to proceed?

> 
> Other nice features like page migration are also enabled by this patch.
> 

We already have page migration.  Do you mean page-migration-when-using-kvm?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-16  3:37   ` Andrew Morton
@ 2008-02-16  8:45     ` Avi Kivity
  2008-02-16  8:56       ` Andrew Morton
  2008-02-16 10:41     ` Brice Goglin
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 97+ messages in thread
From: Avi Kivity @ 2008-02-16  8:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Andrea Arcangeli, Robin Holt, Izik Eidus,
	kvm-devel, Peter Zijlstra, general, Steve Wise, Roland Dreier,
	Kanoj Sarcar, steiner, linux-kernel, linux-mm, daniel.blueman

Andrew Morton wrote:
> How important is this feature to KVM?
>   

Very.  kvm pins pages that are referenced by the guest; a 64-bit guest 
will easily pin its entire memory with the kernel map.  So this is 
critical for guest swapping to actually work.

Other nice features like page migration are also enabled by this patch.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-02-16  3:37   ` Andrew Morton
  2008-02-16  8:45     ` Avi Kivity
                       ` (3 more replies)
  2008-02-18 22:33   ` Roland Dreier
  1 sibling, 4 replies; 97+ messages in thread
From: Andrew Morton @ 2008-02-16  3:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

On Thu, 14 Feb 2008 22:49:00 -0800 Christoph Lameter <clameter@sgi.com> wrote:

> MMU notifiers are used for hardware and software that establishes
> external references to pages managed by the Linux kernel. These are
> page table entriews or tlb entries or something else that allows
> hardware (such as DMA engines, scatter gather devices, networking,
> sharing of address spaces across operating system boundaries) and
> software (Virtualization solutions such as KVM, Xen etc) to
> access memory managed by the Linux kernel.
> 
> The MMU notifier will notify the device driver that subscribes to such
> a notifier that the VM is going to do something with the memory
> mapped by that device. The device must then drop references for the
> indicated memory area. The references may be reestablished later.
> 
> The notification scheme is much better than the current schemes of
> avoiding the danger of the VM removing pages that are externally
> mapped. We currently either mlock pages used for RDMA, XPmem etc
> in memory or increase the refcount to pin the pages. Increasing
> the refcount makes it impossible for the VM to reclaim the page.
> 
> Mlock causes problems with reclaim and may lead to OOM if too many
> pages are pinned in memory. It is also incorrect in terms what the POSIX
> specificies for what role mlock should play. Mlock does *not* pin pages in
> memory. Mlock just means do not allow the page to be moved to swap.
> 
> Linux can move pages in memory (for example through the page migration
> mechanism). These pages can be moved even if they are mlocked(!!!!).
> The current approach of page pinning in use by RDMA etc is conceptually
> broken but there are currently no other easy solutions.
> 
> The alternate of increasing the page count to pin pages is also not
> that enticing since there will be continual attempts to reclaim
> or migrate these pages.
> 
> The solution here allows us to finally fix this issue by requiring
> such devices to subscribe to a notification chain that will allow
> them to work without pinning. The VM gains control of its memory again
> and the memory that has external references can be managed like regular
> memory.
> 
> This patch: Core portion
> 

What is the status of getting infiniband to use this facility?

How important is this feature to KVM?

To xpmem?

Which other potential clients have been identified and how important it it
to those?


> Index: linux-2.6/Documentation/mmu_notifier/README
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/mmu_notifier/README	2008-02-14 22:27:19.000000000 -0800
> @@ -0,0 +1,105 @@
> +Linux MMU Notifiers
> +-------------------
> +
> +MMU notifiers are used for hardware and software that establishes
> +external references to pages managed by the Linux kernel. These are
> +page table entriews or tlb entries or something else that allows
> +hardware (such as DMA engines, scatter gather devices, networking,
> +sharing of address spaces across operating system boundaries) and
> +software (Virtualization solutions such as KVM, Xen etc) to
> +access memory managed by the Linux kernel.
> +
> +The MMU notifier will notify the device driver that subscribes to such
> +a notifier that the VM is going to do something with the memory
> +mapped by that device. The device must then drop references for the
> +indicated memory area. The references may be reestablished later.
> +
> +The notification scheme is much better than the current schemes of
> +dealing with the danger of the VM removing pages.
> +We currently mlock pages used for RDMA, XPmem etc in memory or
> +increase the refcount of the pages.
> +
> +Both cause problems with reclaim and may lead to OOM if too many
> +pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
> +specification of the role of mlock. Mlock does *not* pin pages in
> +memory. It just does not allow the page to be moved to swap.
> +The page refcount is used to track current users of a page struct.
> +Artificially inflating the refcount means that the VM cannot track
> +down all references to a page. It will not be able to reclaim or
> +move a page. However, the core code will try again and again because
> +the assumption is that an elevated refcount is a temporary situation.
> +
> +Linux can move pages in memory (for example through the page migration
> +mechanism). These pages can be moved even if they are mlocked(!!!!).
> +So the current approach in use by RDMA etc etc is conceptually broken
> +but there are currently no other easy solutions.
> +
> +The solution here allows us to finally fix this issue by requiring
> +such devices to subscribe to a notification chain that will allow
> +them to work without pinning.
> +
> +The notifier chains provide two callback mechanisms. The
> +first one is required for any device that establishes external mappings.
> +The second (rmap) mechanism is required if a device needs to be
> +able to sleep when invalidating references. Sleeping may be necessary
> +if we are mapping across a network or to different Linux instances
> +in the same address space.

I'd have thought that a major reason for sleeping would be to wait for IO
to complete.  Worth mentioning here?

> +mmu_notifier mechanism (for KVM/GRU etc)
> +----------------------------------------
> +Callbacks are registered with an mm_struct from a device driver using
> +mmu_notifier_register(). When the VM removes pages (or changes
> +permissions on pages etc) then callbacks are triggered.
> +
> +The invalidation function for a single page (*invalidate_page)

We already have an invalidatepage.  Ho hum.

> +is called with spinlocks (in particular the pte lock) held. This allow
> +for an easy implementation of external ptes that are on the local system.
>

Why is that "easy"?  I's have thought that it would only be easy if the
driver happened to be using those same locks for its own purposes. 
Otherwise it is "awkward"?

> +The invalidation mechanism for a range (*invalidate_range_begin/end*) is
> +called most of the time without any locks held. It is only called with
> +locks held for file backed mappings that are truncated. A flag indicates
> +in which mode we are. A driver can use that mechanism to f.e.
> +delay the freeing of the pages during truncate until no locks are held.

That sucks big time.  What do we need to do to make get the callback
functions called in non-atomic context?

> +Pages must be marked dirty if dirty bits are found to be set in
> +the external ptes during unmap.

That sentence is too vague.  Define "marked dirty"?

> +The *release* method is called when a Linux process exits. It is run before

We'd conventionally use a notation such as "->release()" here, rather than
the asterisks.

> +the pages and mappings of a process are torn down and gives the device driver
> +a chance to zap all the external mappings in one go.

I assume what you mean here is that ->release() is called during exit()
when the final reference to an mm is being dropped.

> +An example for a code that can be used to build a notifier mechanism into
> +a device driver can be found in the file
> +Documentation/mmu_notifier/skeleton.c

Should that be in samples/?

> +mmu_rmap_notifier mechanism (XPMEM etc)
> +---------------------------------------
> +The mmu_rmap_notifier allows the device driver to implement their own rmap

s/their/its/

> +and allows the device driver to sleep during page eviction. This is necessary
> +for complex drivers that f.e. allow the sharing of memory between processes
> +running on different Linux instances (typically over a network or in a
> +partitioned NUMA system).
> +
> +The mmu_rmap_notifier adds another invalidate_page() callout that is called
> +*before* the Linux rmaps are walked. At that point only the page lock is
> +held. The invalidate_page() function must walk the driver rmaps and evict
> +all the references to the page.

What happens if it cannot do so?

> +There is no process information available before the rmaps are consulted.

Not sure what that sentence means.  I guess "available to the core VM"?

> +The notifier mechanism can therefore not be attached to an mm_struct. Instead
> +it is a global callback list. Having to perform a callback for each and every
> +page that is reclaimed would be inefficient. Therefore we add an additional
> +page flag: PageRmapExternal().

How many page flags are left?

Is this feature important enough to justfy consumption of another one?

> Only pages that are marked with this bit can
> +be exported and the rmap callbacks will only be performed for pages marked
> +that way.

"exported": new term, unclear what it means.

> +The required additional Page flag is only availabe in 64 bit mode and
> +therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.

whoa.  Is that good?  You just made your feature unavailable on the great
majority of Linux systems.

> +An example of code to build a mmu_notifier mechanism with rmap capabilty
> +can be found in Documentation/mmu_notifier/skeleton_rmap.c
> +
> +February 9, 2008,
> +	Christoph Lameter <clameter@sgi.com
> +
> +Index: linux-2.6/include/linux/mm_types.h
> Index: linux-2.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm_types.h	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/include/linux/mm_types.h	2008-02-14 21:17:51.000000000 -0800
> @@ -159,6 +159,12 @@ struct vm_area_struct {
>  #endif
>  };
>  
> +struct mmu_notifier_head {
> +#ifdef CONFIG_MMU_NOTIFIER
> +	struct hlist_head head;
> +#endif
> +};
> +
>  struct mm_struct {
>  	struct vm_area_struct * mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> @@ -228,6 +234,7 @@ struct mm_struct {
>  #ifdef CONFIG_CGROUP_MEM_CONT
>  	struct mem_cgroup *mem_cgroup;
>  #endif
> +	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
>  };
>  
>  #endif /* _LINUX_MM_TYPES_H */
> Index: linux-2.6/include/linux/mmu_notifier.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/include/linux/mmu_notifier.h	2008-02-14 22:42:28.000000000 -0800
> @@ -0,0 +1,180 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +/*
> + * MMU motifier

typo

> + * Notifier functions for hardware and software that establishes external
> + * references to pages of a Linux system. The notifier calls ensure that
> + * external mappings are removed when the Linux VM removes memory ranges
> + * or individual pages from a process.

So the callee cannot fail.  hm.  If it can't block, it's likely screwed in
that case.  In other cases it might be screwed anyway.  I suspect we'll
need to be able to handle callee failure.

> + * These fall into two classes:
> + *
> + * 1. mmu_notifier
> + *
> + * 	These are callbacks registered with an mm_struct. If pages are
> + * 	removed from an address space then callbacks are performed.

"to be removed", I guess.  It's called before the page is actually removed?

> + * 	Spinlocks must be held in order to walk reverse maps. The
> + * 	invalidate_page() callbacks are performed with spinlocks held.

hm, yes, problem.   Permitting callee failure might be good enough.

> + * 	The invalidate_range_start/end callbacks can be performed in contexts
> + * 	where sleeping is allowed or in atomic contexts. A flag is passed
> + * 	to indicate an atomic context.

We generally would prefer separate callbacks, rather than a unified
callback with a mode flag.


> + *	Pages must be marked dirty if dirty bits are found to be set in
> + *	the external ptes.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/rcupdate.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier_ops;
> +
> +struct mmu_notifier {
> +	struct hlist_node hlist;
> +	const struct mmu_notifier_ops *ops;
> +};
> +
> +struct mmu_notifier_ops {
> +	/*
> +	 * The release notifier is called when no other execution threads
> +	 * are left. Synchronization is not necessary.

"and the mm is about to be destroyed"?

> +	 */
> +	void (*release)(struct mmu_notifier *mn,
> +			struct mm_struct *mm);
> +
> +	/*
> +	 * age_page is called from contexts where the pte_lock is held
> +	 */
> +	int (*age_page)(struct mmu_notifier *mn,
> +			struct mm_struct *mm,
> +			unsigned long address);

This wasn't documented.

> +	/*
> +	 * invalidate_page is called from contexts where the pte_lock is held.
> +	 */
> +	void (*invalidate_page)(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long address);
> +
> +	/*
> +	 * invalidate_range_begin() and invalidate_range_end() must be paired.
> +	 *
> +	 * Multiple invalidate_range_begin/ends may be nested or called
> +	 * concurrently.

Under what circumstances would they be nested?

> That is legit. However, no new external references

references to what?

> +	 * may be established as long as any invalidate_xxx is running or
> +	 * any invalidate_range_begin() and has not been completed through a

stray "and".

> +	 * corresponding call to invalidate_range_end().
> +	 *
> +	 * Locking within the notifier needs to serialize events correspondingly.
> +	 *
> +	 * invalidate_range_begin() must clear all references in the range
> +	 * and stop the establishment of new references.

and stop the establishment of new references within the range, I assume?

If so, that's putting a heck of a lot of complexity into the driver, isn't
it?  It needs to temporarily remember an arbitrarily large number of
regions in this mm against which references may not be taken?

> +	 * invalidate_range_end() reenables the establishment of references.

within the range?

> +	 * atomic indicates that the function is called in an atomic context.
> +	 * We can sleep if atomic == 0.
> +	 *
> +	 * invalidate_range_begin() must remove all external references.
> +	 * There will be no retries as with invalidate_page().
> +	 */
> +	void (*invalidate_range_begin)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int atomic);
> +
> +	void (*invalidate_range_end)(struct mmu_notifier *mn,
> +				 struct mm_struct *mm,
> +				 unsigned long start, unsigned long end,
> +				 int atomic);
> +};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * Must hold the mmap_sem for write.
> + *
> + * RCU is used to traverse the list. A quiescent period needs to pass
> + * before the notifier is guaranteed to be visible to all threads
> + */
> +extern void mmu_notifier_register(struct mmu_notifier *mn,
> +				  struct mm_struct *mm);
> +
> +/*
> + * Must hold mmap_sem for write.
> + *
> + * A quiescent period needs to pass before the mmu_notifier structure
> + * can be released. mmu_notifier_release() will wait for a quiescent period
> + * after calling the ->release callback. So it is safe to call
> + * mmu_notifier_unregister from the ->release function.
> + */
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> +				    struct mm_struct *mm);
> +
> +
> +extern void mmu_notifier_release(struct mm_struct *mm);
> +extern int mmu_notifier_age_page(struct mm_struct *mm,
> +				 unsigned long address);

There's the mysterious age_page again.

> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
> +{
> +	INIT_HLIST_HEAD(&mnh->head);
> +}
> +
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		struct mmu_notifier *__mn;				\
> +		struct hlist_node *__n;					\
> +									\
> +		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
> +			rcu_read_lock();				\
> +			hlist_for_each_entry_rcu(__mn, __n,		\
> +					     &(mm)->mmu_notifier.head,	\
> +					     hlist)			\
> +				if (__mn->ops->function)		\
> +					__mn->ops->function(__mn,	\
> +							    mm,		\
> +							    args);	\
> +			rcu_read_unlock();				\
> +		}							\
> +	} while (0)

The macro references its args more than once.  Anyone who does

	mmu_notifier(function, some_function_which_has_side_effects())

will get a surprise.  Use temporaries.

> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +/*
> + * Notifiers that use the parameters that they were passed so that the
> + * compiler does not complain about unused variables but does proper
> + * parameter checks even if !CONFIG_MMU_NOTIFIER.
> + * Macros generate no code.
> + */
> +#define mmu_notifier(function, mm, args...)				\
> +	do {								\
> +		if (0) {						\
> +			struct mmu_notifier *__mn;			\
> +									\
> +			__mn = (struct mmu_notifier *)(0x00ff);		\
> +			__mn->ops->function(__mn, mm, args);		\
> +		};							\
> +	} while (0)

That's a bit weird.  Can't we do the old

	(void)function;
	(void)mm;

trick?  Or make it a staic inline function?

> +static inline void mmu_notifier_register(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
> +						struct mm_struct *mm) {}
> +static inline void mmu_notifier_release(struct mm_struct *mm) {}
> +static inline int mmu_notifier_age_page(struct mm_struct *mm,
> +				unsigned long address)
> +{
> +	return 0;
> +}
> +
> +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Kconfig	2008-02-14 21:17:51.000000000 -0800
> @@ -193,3 +193,7 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> +	def_bool y
> +	bool "MMU notifier, for paging KVM/RDMA"

Why is this not selectable?  The help seems a bit brief.

Does this cause 32-bit systems to drag in a bunch of code they're not
allowed to ever use?

> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-02-14 20:59:01.000000000 -0800
> +++ linux-2.6/mm/Makefile	2008-02-14 21:17:51.000000000 -0800
> @@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_SMP) += allocpercpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>  
> Index: linux-2.6/mm/mmu_notifier.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/mmu_notifier.c	2008-02-14 22:41:55.000000000 -0800
> @@ -0,0 +1,76 @@
> +/*
> + *  linux/mm/mmu_notifier.c
> + *
> + *  Copyright (C) 2008  Qumranet, Inc.
> + *  Copyright (C) 2008  SGI
> + *  		Christoph Lameter <clameter@sgi.com>
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +
> +/*
> + * No synchronization. This function can only be called when only a single
> + * process remains that performs teardown.
> + */
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		hlist_for_each_entry_safe(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			hlist_del_init(&mn->hlist);
> +			if (mn->ops->release)
> +				mn->ops->release(mn, mm);

We do this a lot, but back in the old days people didn't like optional
callbacks which can be NULL.  If we expect that mmu_notifier_ops.release is
usually implemented, the just unconditionally call it and require that all
clients implement it.  Perhaps provide an exported-to-modules stuv in core
kernel for clients which didn't want to implement ->release().

> +		}
> +	}
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->age_page can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n;
> +	int young = 0;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_rcu(mn, n,
> +					  &mm->mmu_notifier.head, hlist) {
> +			if (mn->ops->age_page)
> +				young |= mn->ops->age_page(mn, mm, address);
> +		}
> +		rcu_read_unlock();
> +	}
> +
> +	return young;
> +}

should the rcu_read_lock() cover the hlist_empty() test?

This function looks like it was tossed in at the last minute.  It's
mysterious, undocumented, poorly commented, poorly named.  A better name
would be one which has some correlation with the return value.

Because anyone who looks at some code which does

	if (mmu_notifier_age_page(mm, address))
		...

has to go and reverse-engineer the implementation of
mmu_notifier_age_page() to work out under which circumstances the "..."
will be executed.  But this should be apparent just from reading the callee
implementation.

This function *really* does need some documentation.  What does it *mean*
when the ->age_page() from some of the notifiers returned "1" and the
->age_page() from some other notifiers returned zero?  Dunno.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-02-15  6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
@ 2008-02-15  6:49 ` Christoph Lameter
  2008-02-16  3:37   ` Andrew Morton
  2008-02-18 22:33   ` Roland Dreier
  0 siblings, 2 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-02-15  6:49 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, general, Steve Wise, Roland Dreier, Kanoj Sarcar,
	steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 19064 bytes --]

MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.

The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.

The notification scheme is much better than the current schemes of
avoiding the danger of the VM removing pages that are externally
mapped. We currently either mlock pages used for RDMA, XPmem etc
in memory or increase the refcount to pin the pages. Increasing
the refcount makes it impossible for the VM to reclaim the page.

Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.

Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.

The alternate of increasing the page count to pin pages is also not
that enticing since there will be continual attempts to reclaim
or migrate these pages.

The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning. The VM gains control of its memory again
and the memory that has external references can be managed like regular
memory.

This patch: Core portion

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 Documentation/mmu_notifier/README |  105 ++++++++++++++++++++++
 include/linux/mm_types.h          |    7 +
 include/linux/mmu_notifier.h      |  180 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                     |    2 
 mm/Kconfig                        |    4 
 mm/Makefile                       |    1 
 mm/mmap.c                         |    2 
 mm/mmu_notifier.c                 |   76 ++++++++++++++++
 8 files changed, 377 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README	2008-02-14 22:27:19.000000000 -0800
@@ -0,0 +1,105 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current schemes of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory or
+increase the refcount of the pages.
+
+Both cause problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. Mlock is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+The page refcount is used to track current users of a page struct.
+Artificially inflating the refcount means that the VM cannot track
+down all references to a page. It will not be able to reclaim or
+move a page. However, the core code will try again and again because
+the assumption is that an elevated refcount is a temporary situation.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+	Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-02-14 21:17:51.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-02-14 22:42:28.000000000 -0800
@@ -0,0 +1,180 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/*
+	 * invalidate_page is called from contexts where the pte_lock is held.
+	 */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must be paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 *
+	 * invalidate_range_begin() must remove all external references.
+	 * There will be no retries as with invalidate_page().
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-02-14 21:17:51.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-02-14 21:17:51.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-02-14 22:41:55.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-02-14 21:17:51.000000000 -0800
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-02-14 20:59:01.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-02-14 22:42:02.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
@ 2008-02-08 22:06 ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-02-08 22:06 UTC (permalink / raw)
  To: akpm
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus, kvm-devel,
	Peter Zijlstra, steiner, linux-kernel, linux-mm, daniel.blueman

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 18134 bytes --]

MMU notifiers are used for hardware and software that establishes
external references to pages managed by the Linux kernel. These are
page table entriews or tlb entries or something else that allows
hardware (such as DMA engines, scatter gather devices, networking,
sharing of address spaces across operating system boundaries) and
software (Virtualization solutions such as KVM, Xen etc) to
access memory managed by the Linux kernel.

The MMU notifier will notify the device driver that subscribes to such
a notifier that the VM is going to do something with the memory
mapped by that device. The device must then drop references for the
indicated memory area. The references may be reestablished later.

The notification scheme is much better than the current scheme of
avoiding the danger of the VM removing pages that are externally
mapped. We currently mlock pages used for RDMA, XPmem etc in memory.

Mlock causes problems with reclaim and may lead to OOM if too many
pages are pinned in memory. It is also incorrect in terms what the POSIX
specificies for what role mlock should play. Mlock does *not* pin pages in
memory. Mlock just means do not allow the page to be moved to swap.

Linux can move pages in memory (for example through the page migration
mechanism). These pages can be moved even if they are mlocked(!!!!).
The current approach of page pinning in use by RDMA etc is conceptually
broken but there are currently no other easy solutions.

The solution here allows us to finally fix this issue by requiring
such devices to subscribe to a notification chain that will allow
them to work without pinning.

This patch: Core portion

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 Documentation/mmu_notifier/README |   99 +++++++++++++++++++++
 include/linux/mm_types.h          |    7 +
 include/linux/mmu_notifier.h      |  175 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                     |    2 
 mm/Kconfig                        |    4 
 mm/Makefile                       |    1 
 mm/mmap.c                         |    2 
 mm/mmu_notifier.c                 |   76 ++++++++++++++++
 8 files changed, 366 insertions(+)

Index: linux-2.6/Documentation/mmu_notifier/README
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/mmu_notifier/README	2008-02-08 12:30:47.000000000 -0800
@@ -0,0 +1,99 @@
+Linux MMU Notifiers
+-------------------
+
+MMU notifiers are used for hardware and software that establishes
+external references to pages managed by the Linux kernel. These are
+page table entriews or tlb entries or something else that allows
+hardware (such as DMA engines, scatter gather devices, networking,
+sharing of address spaces across operating system boundaries) and
+software (Virtualization solutions such as KVM, Xen etc) to
+access memory managed by the Linux kernel.
+
+The MMU notifier will notify the device driver that subscribes to such
+a notifier that the VM is going to do something with the memory
+mapped by that device. The device must then drop references for the
+indicated memory area. The references may be reestablished later.
+
+The notification scheme is much better than the current scheme of
+dealing with the danger of the VM removing pages.
+We currently mlock pages used for RDMA, XPmem etc in memory.
+
+Mlock causes problems with reclaim and may lead to OOM if too many
+pages are pinned in memory. It is also incorrect in terms of the POSIX
+specification of the role of mlock. Mlock does *not* pin pages in
+memory. It just does not allow the page to be moved to swap.
+
+Linux can move pages in memory (for example through the page migration
+mechanism). These pages can be moved even if they are mlocked(!!!!).
+So the current approach in use by RDMA etc etc is conceptually broken
+but there are currently no other easy solutions.
+
+The solution here allows us to finally fix this issue by requiring
+such devices to subscribe to a notification chain that will allow
+them to work without pinning.
+
+The notifier chains provide two callback mechanisms. The
+first one is required for any device that establishes external mappings.
+The second (rmap) mechanism is required if a device needs to be
+able to sleep when invalidating references. Sleeping may be necessary
+if we are mapping across a network or to different Linux instances
+in the same address space.
+
+mmu_notifier mechanism (for KVM/GRU etc)
+----------------------------------------
+Callbacks are registered with an mm_struct from a device driver using
+mmu_notifier_register(). When the VM removes pages (or changes
+permissions on pages etc) then callbacks are triggered.
+
+The invalidation function for a single page (*invalidate_page)
+is called with spinlocks (in particular the pte lock) held. This allow
+for an easy implementation of external ptes that are on the local system.
+
+The invalidation mechanism for a range (*invalidate_range_begin/end*) is
+called most of the time without any locks held. It is only called with
+locks held for file backed mappings that are truncated. A flag indicates
+in which mode we are. A driver can use that mechanism to f.e.
+delay the freeing of the pages during truncate until no locks are held.
+
+Pages must be marked dirty if dirty bits are found to be set in
+the external ptes during unmap.
+
+The *release* method is called when a Linux process exits. It is run before
+the pages and mappings of a process are torn down and gives the device driver
+a chance to zap all the external mappings in one go.
+
+An example for a code that can be used to build a notifier mechanism into
+a device driver can be found in the file
+Documentation/mmu_notifier/skeleton.c
+
+mmu_rmap_notifier mechanism (XPMEM etc)
+---------------------------------------
+The mmu_rmap_notifier allows the device driver to implement their own rmap
+and allows the device driver to sleep during page eviction. This is necessary
+for complex drivers that f.e. allow the sharing of memory between processes
+running on different Linux instances (typically over a network or in a
+partitioned NUMA system).
+
+The mmu_rmap_notifier adds another invalidate_page() callout that is called
+*before* the Linux rmaps are walked. At that point only the page lock is
+held. The invalidate_page() function must walk the driver rmaps and evict
+all the references to the page.
+
+There is no process information available before the rmaps are consulted.
+The notifier mechanism can therefore not be attached to an mm_struct. Instead
+it is a global callback list. Having to perform a callback for each and every
+page that is reclaimed would be inefficient. Therefore we add an additional
+page flag: PageRmapExternal(). Only pages that are marked with this bit can
+be exported and the rmap callbacks will only be performed for pages marked
+that way.
+
+The required additional Page flag is only availabe in 64 bit mode and
+therefore the mmu_rmap_notifier portion is not available on 32 bit platforms.
+
+An example of code to build a mmu_notifier mechanism with rmap capabilty
+can be found in Documentation/mmu_notifier/skeleton_rmap.c
+
+February 9, 2008,
+	Christoph Lameter <clameter@sgi.com
+
+Index: linux-2.6/include/linux/mm_types.h
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-02-08 12:30:47.000000000 -0800
@@ -159,6 +159,12 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+#ifdef CONFIG_MMU_NOTIFIER
+	struct hlist_head head;
+#endif
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -228,6 +234,7 @@ struct mm_struct {
 #ifdef CONFIG_CGROUP_MEM_CONT
 	struct mem_cgroup *mem_cgroup;
 #endif
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-02-08 12:35:14.000000000 -0800
@@ -0,0 +1,175 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes:
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If pages are
+ * 	removed from an address space then callbacks are performed.
+ *
+ * 	Spinlocks must be held in order to walk reverse maps. The
+ * 	invalidate_page() callbacks are performed with spinlocks held.
+ *
+ * 	The invalidate_range_start/end callbacks can be performed in contexts
+ * 	where sleeping is allowed or in atomic contexts. A flag is passed
+ * 	to indicate an atomic context.
+ *
+ *	Pages must be marked dirty if dirty bits are found to be set in
+ *	the external ptes.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * The release notifier is called when no other execution threads
+	 * are left. Synchronization is not necessary.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	/*
+	 * age_page is called from contexts where the pte_lock is held
+	 */
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	/* invalidate_page is called from contexts where the pte_lock is held */
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * invalidate_range_begin() and invalidate_range_end() must paired.
+	 *
+	 * Multiple invalidate_range_begin/ends may be nested or called
+	 * concurrently. That is legit. However, no new external references
+	 * may be established as long as any invalidate_xxx is running or
+	 * any invalidate_range_begin() and has not been completed through a
+	 * corresponding call to invalidate_range_end().
+	 *
+	 * Locking within the notifier needs to serialize events correspondingly.
+	 *
+	 * invalidate_range_begin() must clear all references in the range
+	 * and stop the establishment of new references.
+	 *
+	 * invalidate_range_end() reenables the establishment of references.
+	 *
+	 * atomic indicates that the function is called in an atomic context.
+	 * We can sleep if atomic == 0.
+	 */
+	void (*invalidate_range_begin)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+
+	void (*invalidate_range_end)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int atomic);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+
+/*
+ * Must hold mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-02-08 12:30:47.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-02-08 12:30:47.000000000 -0800
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_CONT) += memcontrol.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-02-08 12:44:24.000000000 -0800
@@ -0,0 +1,76 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+
+/*
+ * No synchronization. This function can only be called when only a single
+ * process remains that performs teardown.
+ */
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		hlist_for_each_entry_safe(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_init(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ *
+ * Must hold mmap_sem writably when calling registration functions.
+ */
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_del_rcu(&mn->hlist);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-02-08 12:30:47.000000000 -0800
@@ -53,6 +53,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -362,6 +363,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-02-08 12:28:06.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-02-08 12:43:59.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2037,6 +2038,7 @@ void exit_mmap(struct mm_struct *mm)
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
+	mmu_notifier_release(mm);
 	arch_exit_mmap(mm);
 
 	lru_add_drain();`

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 23:38           ` Andrea Arcangeli
@ 2008-01-30 23:55             ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 23:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Jack Steiner, Avi Kivity, Izik Eidus, Nick Piggin,
	kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Thu, 31 Jan 2008, Andrea Arcangeli wrote:

> > I think Andrea's original concept of the lock in the mmu_notifier_head
> > structure was the best.  I agree with him that it should be a spinlock
> > instead of the rw_lock.
> 
> BTW, I don't see the scalability concern with huge number of tasks:
> the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
> up_write(mm->mmap_sem) is always going to scale worse than
> spin_lock(mm->somethingelse); oneinstruction;
> spin_unlock(mm->somethinglese).

If we put it elsewhere in the mm then we increase the size of the memory 
used in the mm_struct.

> Furthermore if we go this route and we don't relay on implicit
> serialization of all the mmu notifier users against exit_mmap
> (i.e. the mmu notifier user must agree to stop calling
> mmu_notifier_register on a mm after the last mmput) the autodisarming
> feature will likely have to be removed or it can't possibly be safe to
> run mmu_notifier_unregister while mmu_notifier_release runs. With the
> auto-disarming feature, there is no way to safely know if
> mmu_notifier_unregister has to be called or not. I'm ok with removing
> the auto-disarming feature and to have as self-contained-as-possible
> locking. Then mmu_notifier_release can just become the
> invalidate_all_after and invalidate_all, invalidate_all_before.

Hmmmm.. exit_mmap is only called when the last reference is removed 
against the mm right? So no tasks are running anymore. No pages are left. 
Do we need to serialize at all for mmu_notifier_release?

 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 22:20         ` Robin Holt
@ 2008-01-30 23:38           ` Andrea Arcangeli
  2008-01-30 23:55             ` Christoph Lameter
  0 siblings, 1 reply; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 23:38 UTC (permalink / raw)
  To: Robin Holt
  Cc: Christoph Lameter, Jack Steiner, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 04:20:35PM -0600, Robin Holt wrote:
> On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> > On Wed, 30 Jan 2008, Jack Steiner wrote:
> > 
> > > Moving to a different lock solves the problem.
> > 
> > Well it gets us back to the issue why we removed the lock. As Robin said 
> > before: If its global then we can have a huge number of tasks contending 
> > for the lock on startup of a process with a large number of ranks. The 
> > reason to go to mmap_sem was that it was placed in the mm_struct and so we 
> > would just have a couple of contentions per mm_struct.
> > 
> > I'll be looking for some other way to do this.
> 
> I think Andrea's original concept of the lock in the mmu_notifier_head
> structure was the best.  I agree with him that it should be a spinlock
> instead of the rw_lock.

BTW, I don't see the scalability concern with huge number of tasks:
the lock is still in the mm, down_write(mm->mmap_sem); oneinstruction;
up_write(mm->mmap_sem) is always going to scale worse than
spin_lock(mm->somethingelse); oneinstruction;
spin_unlock(mm->somethinglese).

Furthermore if we go this route and we don't relay on implicit
serialization of all the mmu notifier users against exit_mmap
(i.e. the mmu notifier user must agree to stop calling
mmu_notifier_register on a mm after the last mmput) the autodisarming
feature will likely have to be removed or it can't possibly be safe to
run mmu_notifier_unregister while mmu_notifier_release runs. With the
auto-disarming feature, there is no way to safely know if
mmu_notifier_unregister has to be called or not. I'm ok with removing
the auto-disarming feature and to have as self-contained-as-possible
locking. Then mmu_notifier_release can just become the
invalidate_all_after and invalidate_all, invalidate_all_before.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 19:19       ` Christoph Lameter
@ 2008-01-30 22:20         ` Robin Holt
  2008-01-30 23:38           ` Andrea Arcangeli
  0 siblings, 1 reply; 97+ messages in thread
From: Robin Holt @ 2008-01-30 22:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jack Steiner, Andrea Arcangeli, Robin Holt, Avi Kivity,
	Izik Eidus, Nick Piggin, kvm-devel, Benjamin Herrenschmidt,
	Peter Zijlstra, linux-kernel, linux-mm, daniel.blueman,
	Hugh Dickins

On Wed, Jan 30, 2008 at 11:19:28AM -0800, Christoph Lameter wrote:
> On Wed, 30 Jan 2008, Jack Steiner wrote:
> 
> > Moving to a different lock solves the problem.
> 
> Well it gets us back to the issue why we removed the lock. As Robin said 
> before: If its global then we can have a huge number of tasks contending 
> for the lock on startup of a process with a large number of ranks. The 
> reason to go to mmap_sem was that it was placed in the mm_struct and so we 
> would just have a couple of contentions per mm_struct.
> 
> I'll be looking for some other way to do this.

I think Andrea's original concept of the lock in the mmu_notifier_head
structure was the best.  I agree with him that it should be a spinlock
instead of the rw_lock.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 17:10     ` Peter Zijlstra
@ 2008-01-30 19:28       ` Christoph Lameter
  0 siblings, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

How about just taking the mmap_sem writelock in release? We have only a 
single caller of mmu_notifier_release() in mm/mmap.c and we know that we 
are not holding mmap_sem at that point. So just acquire it when needed?

Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-30 11:21:57.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 11:24:59.000000000 -0800
@@ -18,6 +19,7 @@ void mmu_notifier_release(struct mm_stru
 	struct hlist_node *n, *t;
 
 	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		down_write(&mm->mmap_sem);
 		rcu_read_lock();
 		hlist_for_each_entry_safe_rcu(mn, n, t,
 					  &mm->mmu_notifier.head, hlist) {
@@ -26,6 +28,7 @@ void mmu_notifier_release(struct mm_stru
 				mn->ops->release(mn, mm);
 		}
 		rcu_read_unlock();
+		up_write(&mm->mmap_sem);
 		synchronize_rcu();
 	}
 }

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:53     ` Jack Steiner
  2008-01-30 16:38       ` Andrea Arcangeli
@ 2008-01-30 19:19       ` Christoph Lameter
  2008-01-30 22:20         ` Robin Holt
  1 sibling, 1 reply; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:19 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Jack Steiner wrote:

> Moving to a different lock solves the problem.

Well it gets us back to the issue why we removed the lock. As Robin said 
before: If its global then we can have a huge number of tasks contending 
for the lock on startup of a process with a large number of ranks. The 
reason to go to mmap_sem was that it was placed in the mm_struct and so we 
would just have a couple of contentions per mm_struct.

I'll be looking for some other way to do this.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 18:02   ` Robin Holt
  2008-01-30 19:08     ` Christoph Lameter
@ 2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:14 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

Ok. So I added the following patch:

---
 include/linux/mmu_notifier.h |    1 +
 mm/mmu_notifier.c            |   12 ++++++++++++
 2 files changed, 13 insertions(+)

Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- linux-2.6.orig/include/linux/mmu_notifier.h	2008-01-30 11:09:06.000000000 -0800
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-30 11:10:38.000000000 -0800
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
 
 extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
 extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
 
 extern struct hlist_head mmu_rmap_notifier_list;
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- linux-2.6.orig/mm/mmu_notifier.c	2008-01-30 11:09:01.000000000 -0800
+++ linux-2.6/mm/mmu_notifier.c	2008-01-30 11:12:10.000000000 -0800
@@ -99,3 +99,15 @@ void mmu_rmap_notifier_unregister(struct
 }
 EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
 
+/*
+ * Export a page.
+ *
+ * Pagelock must be held.
+ * Must be called before a page is put on an external rmap.
+ */
+void mmu_rmap_export_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 18:02   ` Robin Holt
@ 2008-01-30 19:08     ` Christoph Lameter
  2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30 19:08 UTC (permalink / raw)
  To: Robin Holt
  Cc: Andrea Arcangeli, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Wed, 30 Jan 2008, Robin Holt wrote:

> Index: git-linus/mm/mmu_notifier.c
> ===================================================================
> --- git-linus.orig/mm/mmu_notifier.c	2008-01-30 11:43:45.000000000 -0600
> +++ git-linus/mm/mmu_notifier.c	2008-01-30 11:56:08.000000000 -0600
> @@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
>  }
>  EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
>  
> +void mmu_rmap_export_page(struct page *page)
> +{
> +	SetPageExternalRmap(page);
> +}
> +EXPORT_SYMBOL(mmu_rmap_export_page);

Then mmu_rmap_export_page would have to be called before the subsystem 
establishes the rmap entry for the page. Could we do all PageExternalRmap 
modifications under Pagelock?



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
  2008-01-30 15:37   ` Andrea Arcangeli
@ 2008-01-30 18:02   ` Robin Holt
  2008-01-30 19:08     ` Christoph Lameter
  2008-01-30 19:14     ` Christoph Lameter
  1 sibling, 2 replies; 97+ messages in thread
From: Robin Holt @ 2008-01-30 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrea Arcangeli, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	steiner, linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

Back to one of Andrea's points from a couple days ago, I think we still
have a problem with the PageExternalRmap page flag.

If I had two drivers with external rmap implementations, there is no way
I can think of for a simple flag to coordinate a single page being
exported and maintained by the two.

Since the intended use seems to point in the direction of the external
rmap must be maintained consistent with the all pages the driver has
exported and the driver will already need to handle cases where the page
does not appear in its rmap, I would propose the setting and clearing
should be handled in the mmu_notifier code.

This is the first of two patches.  This one is intended as an addition
to patch 1/6.  I will post the other shortly under the patch 3/6 thread.


Index: git-linus/include/linux/mmu_notifier.h
===================================================================
--- git-linus.orig/include/linux/mmu_notifier.h	2008-01-30 11:43:45.000000000 -0600
+++ git-linus/include/linux/mmu_notifier.h	2008-01-30 11:44:35.000000000 -0600
@@ -146,6 +146,7 @@ static inline void mmu_notifier_head_ini
 
 extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
 extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_export_page(struct page *page);
 
 extern struct hlist_head mmu_rmap_notifier_list;
 
Index: git-linus/mm/mmu_notifier.c
===================================================================
--- git-linus.orig/mm/mmu_notifier.c	2008-01-30 11:43:45.000000000 -0600
+++ git-linus/mm/mmu_notifier.c	2008-01-30 11:56:08.000000000 -0600
@@ -99,3 +99,8 @@ void mmu_rmap_notifier_unregister(struct
 }
 EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
 
+void mmu_rmap_export_page(struct page *page)
+{
+	SetPageExternalRmap(page);
+}
+EXPORT_SYMBOL(mmu_rmap_export_page);

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 15:53     ` Jack Steiner
@ 2008-01-30 17:10     ` Peter Zijlstra
  2008-01-30 19:28       ` Christoph Lameter
  1 sibling, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2008-01-30 17:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, steiner,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins


On Wed, 2008-01-30 at 16:37 +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n, *t;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_rcu(&mn->hlist);
> 
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.

Agreed, sorry for this oversight.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:53     ` Jack Steiner
@ 2008-01-30 16:38       ` Andrea Arcangeli
  2008-01-30 19:19       ` Christoph Lameter
  1 sibling, 0 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 16:38 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 09:53:06AM -0600, Jack Steiner wrote:
> That will also resolve the problem we discussed yesterday. 
> I want to unregister my mmu_notifier when a GRU segment is
> unmapped. This would not necessarily be at task termination.

My proof that there is something wrong in the smp locking of the
current code is very simple: it can't be right to use
hlist_for_each_entry_safe_rcu and rcu_read_lock inside
mmu_notifier_release, and then to call hlist_del_rcu without any
spinlock or semaphore. If we walk the list with
hlist_for_each_entry_safe_rcu (and not with
hlist_for_each_entry_safe), it means the list _can_ change from under
us, and in turn the hlist_del_rcu must be surrounded by a spinlock or
sempahore too!

If by design the list _can't_ change from under us and calling
hlist_del_rcu was safe w/o locks, then hlist_for_each_entry_safe is
_sure_ enough for mmu_notifier_release, and rcu_read_lock most
certainly can be removed too.

To make an usage case where the race could trigger, I was thinking at
somebody bumping the mm_count (not mm_users) and registering a
notifier while mmu_notifier_release runs and relaying on ->release to
know if it has to run mmu_notifier_unregister. However I now started
wondering how it can relay on ->release to know that if ->release is
called after hlist_del_rcu because with the latest changes ->release
will also allow the mn to release itself ;). It's unsafe to call
list_del_rcu twice (the second will crash on a poisoned entry).

This starts to make me think we should remove the auto-disarming
feature and require the notifier-user to have the ->release call
mmu_notifier_unregister first and to free the "mn" inside ->release
too if needed. Or alternatively the notifier-user can bump mm_count
and to call a mmu_notifier_unregister before calling mmdrop (like kvm
could do).

Another approach is to simply define mmu_notifier_release as
implicitly serialized by other code design, with a real lock (not rcu)
against the whole register/unregister operations. So to guarantee the
notifier list can't change from under us while mmu_notifier_release
runs. If we go this route, yes, the auto-disarming hlist_del can be
kept, the current code would have been safe, but to avoid confusion
the mmu_notifier_release shall become this:

void mmu_notifier_release(struct mm_struct *mm)
{
	struct mmu_notifier *mn;
	struct hlist_node *n, *t;

	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
		hlist_for_each_entry_safe(mn, n, t,
					  &mm->mmu_notifier.head, hlist) {
			hlist_del(&mn->hlist);
			if (mn->ops->release)
				mn->ops->release(mn, mm);
		}
	}
}

> However, the mmap_sem is already held for write by the core
> VM at the point I would call the unregister function.
> Currently, there is no __mmu_notifier_unregister() defined.
> 
> Moving to a different lock solves the problem.

Unless the mmu_notifier_release becomes like above and we rely on the
user of the mmu notifiers to implement a highlevel external lock that
will we definitely forbid to bump the mm_count of the mm, and to call
register/unregister while mmu_notifier_release could run, 1) moving to a
different lock and 2) removing the auto-disarming hlist_del_rcu from
mmu_notifier_release sounds the only possible smp safe way.

As far as KVM is concerned mmu_notifier_released could be changed to
the version I written above and everything should be ok. For KVM the
mm_count bump is done by the task that also holds a mm_user, so when
exit_mmap runs I don't think the list could possible change anymore.

Anyway those are details that can be perfected after mainline merging,
so this isn't something to worry about too much right now. My idea is
to keep working to perfect it while I hope progress is being made by
Christoph to merge the mmu notifiers V3 patchset in mainline ;).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30 15:37   ` Andrea Arcangeli
@ 2008-01-30 15:53     ` Jack Steiner
  2008-01-30 16:38       ` Andrea Arcangeli
  2008-01-30 19:19       ` Christoph Lameter
  2008-01-30 17:10     ` Peter Zijlstra
  1 sibling, 2 replies; 97+ messages in thread
From: Jack Steiner @ 2008-01-30 15:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Robin Holt, Avi Kivity, Izik Eidus,
	Nick Piggin, kvm-devel, Benjamin Herrenschmidt, Peter Zijlstra,
	linux-kernel, linux-mm, daniel.blueman, Hugh Dickins

On Wed, Jan 30, 2008 at 04:37:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> > +void mmu_notifier_release(struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier *mn;
> > +	struct hlist_node *n, *t;
> > +
> > +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> > +		rcu_read_lock();
> > +		hlist_for_each_entry_safe_rcu(mn, n, t,
> > +					  &mm->mmu_notifier.head, hlist) {
> > +			hlist_del_rcu(&mn->hlist);
> 
> This will race and kernel crash against mmu_notifier_register in
> SMP. You should resurrect the per-mmu_notifier_head lock in my last
> patch (except it can be converted from a rwlock_t to a regular
> spinlock_t) and drop the mmap_sem from
> mmu_notifier_register/unregister.

Agree.

That will also resolve the problem we discussed yesterday. 
I want to unregister my mmu_notifier when a GRU segment is
unmapped. This would not necessarily be at task termination.

However, the mmap_sem is already held for write by the core
VM at the point I would call the unregister function.
Currently, there is no __mmu_notifier_unregister() defined.

Moving to a different lock solves the problem.


-- jack

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
@ 2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 15:53     ` Jack Steiner
  2008-01-30 17:10     ` Peter Zijlstra
  2008-01-30 18:02   ` Robin Holt
  1 sibling, 2 replies; 97+ messages in thread
From: Andrea Arcangeli @ 2008-01-30 15:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

On Tue, Jan 29, 2008 at 06:29:10PM -0800, Christoph Lameter wrote:
> +void mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier *mn;
> +	struct hlist_node *n, *t;
> +
> +	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
> +		rcu_read_lock();
> +		hlist_for_each_entry_safe_rcu(mn, n, t,
> +					  &mm->mmu_notifier.head, hlist) {
> +			hlist_del_rcu(&mn->hlist);

This will race and kernel crash against mmu_notifier_register in
SMP. You should resurrect the per-mmu_notifier_head lock in my last
patch (except it can be converted from a rwlock_t to a regular
spinlock_t) and drop the mmap_sem from
mmu_notifier_register/unregister.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [patch 1/6] mmu_notifier: Core code
  2008-01-30  2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
@ 2008-01-30  2:29 ` Christoph Lameter
  2008-01-30 15:37   ` Andrea Arcangeli
  2008-01-30 18:02   ` Robin Holt
  0 siblings, 2 replies; 97+ messages in thread
From: Christoph Lameter @ 2008-01-30  2:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Robin Holt, Avi Kivity, Izik Eidus, Nick Piggin, kvm-devel,
	Benjamin Herrenschmidt, Peter Zijlstra, steiner, linux-kernel,
	linux-mm, daniel.blueman, Hugh Dickins

[-- Attachment #1: mmu_core --]
[-- Type: text/plain, Size: 15337 bytes --]

Core code for mmu notifiers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>

---
 include/linux/list.h         |   14 ++
 include/linux/mm_types.h     |    6 +
 include/linux/mmu_notifier.h |  210 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/page-flags.h   |   10 ++
 kernel/fork.c                |    2 
 mm/Kconfig                   |    4 
 mm/Makefile                  |    1 
 mm/mmap.c                    |    2 
 mm/mmu_notifier.c            |  101 ++++++++++++++++++++
 9 files changed, 350 insertions(+)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/mm_types.h	2008-01-29 16:56:36.000000000 -0800
@@ -153,6 +153,10 @@ struct vm_area_struct {
 #endif
 };
 
+struct mmu_notifier_head {
+	struct hlist_head head;
+};
+
 struct mm_struct {
 	struct vm_area_struct * mmap;		/* list of VMAs */
 	struct rb_root mm_rb;
@@ -219,6 +223,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+
+	struct mmu_notifier_head mmu_notifier; /* MMU notifier list */
 };
 
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/mmu_notifier.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/mmu_notifier.h	2008-01-29 16:56:36.000000000 -0800
@@ -0,0 +1,210 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+/*
+ * MMU motifier
+ *
+ * Notifier functions for hardware and software that establishes external
+ * references to pages of a Linux system. The notifier calls ensure that
+ * the external mappings are removed when the Linux VM removes memory ranges
+ * or individual pages from a process.
+ *
+ * These fall into two classes
+ *
+ * 1. mmu_notifier
+ *
+ * 	These are callbacks registered with an mm_struct. If mappings are
+ * 	removed from an address space then callbacks are performed.
+ * 	Spinlocks must be held in order to the walk reverse maps and the
+ * 	notifications are performed while the spinlock is held.
+ *
+ *
+ * 2. mmu_rmap_notifier
+ *
+ *	Callbacks for subsystems that provide their own rmaps. These
+ *	need to walk their own rmaps for a page. The invalidate_page
+ *	callback is outside of locks so that we are not in a strictly
+ *	atomic context (but we may be in a PF_MEMALLOC context if the
+ *	notifier is called from reclaim code) and are able to sleep.
+ *	Rmap notifiers need an extra page bit and are only available
+ *	on 64 bit platforms. It is up to the subsystem to mark pags
+ *	as PageExternalRmap as needed to trigger the callbacks. Pages
+ *	must be marked dirty if dirty bits are set in the external
+ *	pte.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/rcupdate.h>
+#include <linux/mm_types.h>
+
+struct mmu_notifier_ops;
+
+struct mmu_notifier {
+	struct hlist_node hlist;
+	const struct mmu_notifier_ops *ops;
+};
+
+struct mmu_notifier_ops {
+	/*
+	 * Note: The mmu_notifier structure must be released with
+	 * call_rcu() since other processors are only guaranteed to
+	 * see the changes after a quiescent period.
+	 */
+	void (*release)(struct mmu_notifier *mn,
+			struct mm_struct *mm);
+
+	int (*age_page)(struct mmu_notifier *mn,
+			struct mm_struct *mm,
+			unsigned long address);
+
+	void (*invalidate_page)(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long address);
+
+	/*
+	 * lock indicates that the function is called under spinlock.
+	 */
+	void (*invalidate_range)(struct mmu_notifier *mn,
+				 struct mm_struct *mm,
+				 unsigned long start, unsigned long end,
+				 int lock);
+};
+
+struct mmu_rmap_notifier_ops;
+
+struct mmu_rmap_notifier {
+	struct hlist_node hlist;
+	const struct mmu_rmap_notifier_ops *ops;
+};
+
+struct mmu_rmap_notifier_ops {
+	/*
+	 * Called with the page lock held after ptes are modified or removed
+	 * so that a subsystem with its own rmap's can remove remote ptes
+	 * mapping a page.
+	 */
+	void (*invalidate_page)(struct mmu_rmap_notifier *mrn,
+						struct page *page);
+};
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+/*
+ * Must hold the mmap_sem for write.
+ *
+ * RCU is used to traverse the list. A quiescent period needs to pass
+ * before the notifier is guaranteed to be visible to all threads
+ */
+extern void __mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/* Will acquire mmap_sem for write*/
+extern void mmu_notifier_register(struct mmu_notifier *mn,
+				  struct mm_struct *mm);
+/*
+ * Will acquire mmap_sem for write.
+ *
+ * A quiescent period needs to pass before the mmu_notifier structure
+ * can be released. mmu_notifier_release() will wait for a quiescent period
+ * after calling the ->release callback. So it is safe to call
+ * mmu_notifier_unregister from the ->release function.
+ */
+extern void mmu_notifier_unregister(struct mmu_notifier *mn,
+				    struct mm_struct *mm);
+
+
+extern void mmu_notifier_release(struct mm_struct *mm);
+extern int mmu_notifier_age_page(struct mm_struct *mm,
+				 unsigned long address);
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh)
+{
+	INIT_HLIST_HEAD(&mnh->head);
+}
+
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		struct mmu_notifier *__mn;				\
+		struct hlist_node *__n;					\
+									\
+		if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \
+			rcu_read_lock();				\
+			hlist_for_each_entry_rcu(__mn, __n,		\
+					     &(mm)->mmu_notifier.head,	\
+					     hlist)			\
+				if (__mn->ops->function)		\
+					__mn->ops->function(__mn,	\
+							    mm,		\
+							    args);	\
+			rcu_read_unlock();				\
+		}							\
+	} while (0)
+
+extern void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn);
+extern void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn);
+
+extern struct hlist_head mmu_rmap_notifier_list;
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		struct mmu_rmap_notifier *__mrn;			\
+		struct hlist_node *__n;					\
+									\
+		rcu_read_lock();					\
+		hlist_for_each_entry_rcu(__mrn, __n,			\
+				&mmu_rmap_notifier_list, 		\
+						hlist)			\
+			if (__mrn->ops->function)			\
+				__mrn->ops->function(__mrn, args);	\
+		rcu_read_unlock();					\
+	} while (0);
+
+#else /* CONFIG_MMU_NOTIFIER */
+
+/*
+ * Notifiers that use the parameters that they were passed so that the
+ * compiler does not complain about unused variables but does proper
+ * parameter checks even if !CONFIG_MMU_NOTIFIER.
+ * Macros generate no code.
+ */
+#define mmu_notifier(function, mm, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_notifier *__mn;			\
+									\
+			__mn = (struct mmu_notifier *)(0x00ff);		\
+			__mn->ops->function(__mn, mm, args);		\
+		};							\
+	} while (0)
+
+#define mmu_rmap_notifier(function, args...)				\
+	do {								\
+		if (0) {						\
+			struct mmu_rmap_notifier *__mrn;		\
+									\
+			__mrn = (struct mmu_rmap_notifier *)(0x00ff);	\
+			__mrn->ops->function(__mrn, args);		\
+		}							\
+	} while (0);
+
+static inline void mmu_notifier_register(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_unregister(struct mmu_notifier *mn,
+						struct mm_struct *mm) {}
+static inline void mmu_notifier_release(struct mm_struct *mm) {}
+static inline int mmu_notifier_age_page(struct mm_struct *mm,
+				unsigned long address)
+{
+	return 0;
+}
+
+static inline void mmu_notifier_head_init(struct mmu_notifier_head *mmh) {}
+
+static inline void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+									{}
+static inline void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+									{}
+
+#endif /* CONFIG_MMU_NOTIFIER */
+
+#endif /* _LINUX_MMU_NOTIFIER_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/page-flags.h	2008-01-29 16:56:36.000000000 -0800
@@ -105,6 +105,7 @@
  * 64 bit  |           FIELDS             | ??????         FLAGS         |
  *         63                            32                              0
  */
+#define PG_external_rmap	30	/* Page has external rmap */
 #define PG_uncached		31	/* Page has been mapped as uncached */
 #endif
 
@@ -260,6 +261,15 @@ static inline void __ClearPageTail(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#if defined(CONFIG_MMU_NOTIFIER) && defined(CONFIG_64BIT)
+#define PageExternalRmap(page)	test_bit(PG_external_rmap, &(page)->flags)
+#define SetPageExternalRmap(page) set_bit(PG_external_rmap, &(page)->flags)
+#define ClearPageExternalRmap(page) clear_bit(PG_external_rmap, \
+							&(page)->flags)
+#else
+#define PageExternalRmap(page)	0
+#endif
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Kconfig	2008-01-29 16:56:36.000000000 -0800
@@ -193,3 +193,7 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config MMU_NOTIFIER
+	def_bool y
+	bool "MMU notifier, for paging KVM/RDMA"
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/Makefile	2008-01-29 16:56:36.000000000 -0800
@@ -30,4 +30,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
+obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 
Index: linux-2.6/mm/mmu_notifier.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/mmu_notifier.c	2008-01-29 16:57:26.000000000 -0800
@@ -0,0 +1,101 @@
+/*
+ *  linux/mm/mmu_notifier.c
+ *
+ *  Copyright (C) 2008  Qumranet, Inc.
+ *  Copyright (C) 2008  SGI
+ *  		Christoph Lameter <clameter@sgi.com>
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/module.h>
+
+void mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n, *t;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_safe_rcu(mn, n, t,
+					  &mm->mmu_notifier.head, hlist) {
+			hlist_del_rcu(&mn->hlist);
+			if (mn->ops->release)
+				mn->ops->release(mn, mm);
+		}
+		rcu_read_unlock();
+		synchronize_rcu();
+	}
+}
+
+/*
+ * If no young bitflag is supported by the hardware, ->age_page can
+ * unmap the address and return 1 or 0 depending if the mapping previously
+ * existed or not.
+ */
+int mmu_notifier_age_page(struct mm_struct *mm, unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) {
+		rcu_read_lock();
+		hlist_for_each_entry_rcu(mn, n,
+					  &mm->mmu_notifier.head, hlist) {
+			if (mn->ops->age_page)
+				young |= mn->ops->age_page(mn, mm, address);
+		}
+		rcu_read_unlock();
+	}
+
+	return young;
+}
+
+/*
+ * Note that all notifiers use RCU. The updates are only guaranteed to be
+ * visible to other processes after a RCU quiescent period!
+ */
+void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier.head);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+
+void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	__mmu_notifier_register(mn, mm);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
+
+void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	down_write(&mm->mmap_sem);
+	hlist_del_rcu(&mn->hlist);
+	up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
+
+static DEFINE_SPINLOCK(mmu_notifier_list_lock);
+HLIST_HEAD(mmu_rmap_notifier_list);
+
+void mmu_rmap_notifier_register(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_add_head_rcu(&mrn->hlist, &mmu_rmap_notifier_list);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_register);
+
+void mmu_rmap_notifier_unregister(struct mmu_rmap_notifier *mrn)
+{
+	spin_lock(&mmu_notifier_list_lock);
+	hlist_del_rcu(&mrn->hlist);
+	spin_unlock(&mmu_notifier_list_lock);
+}
+EXPORT_SYMBOL(mmu_rmap_notifier_unregister);
+
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/kernel/fork.c	2008-01-29 16:56:36.000000000 -0800
@@ -52,6 +52,7 @@
 #include <linux/tty.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -360,6 +361,7 @@ static struct mm_struct * mm_init(struct
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
+		mmu_notifier_head_init(&mm->mmu_notifier);
 		return mm;
 	}
 	free_mm(mm);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/mm/mmap.c	2008-01-29 16:56:36.000000000 -0800
@@ -26,6 +26,7 @@
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm)
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
+	mmu_notifier_release(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
Index: linux-2.6/include/linux/list.h
===================================================================
--- linux-2.6.orig/include/linux/list.h	2008-01-29 16:56:33.000000000 -0800
+++ linux-2.6/include/linux/list.h	2008-01-29 16:56:36.000000000 -0800
@@ -991,6 +991,20 @@ static inline void hlist_add_after_rcu(s
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
+/**
+ * hlist_for_each_entry_safe_rcu	- iterate over list of given type
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct hlist_node to use as a loop cursor.
+ * @n:		temporary pointer
+ * @head:	the head for your list.
+ * @member:	the name of the hlist_node within the struct.
+ */
+#define hlist_for_each_entry_safe_rcu(tpos, pos, n, head, member)	 \
+	for (pos = (head)->first;					 \
+	     rcu_dereference(pos) && ({ n = pos->next; 1;}) &&		 \
+		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
+	     pos = n)
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */

-- 

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2008-02-18 22:33 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-28 20:28 [patch 0/6] [RFC] MMU Notifiers V2 Christoph Lameter
2008-01-28 20:28 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-28 22:06   ` Christoph Lameter
2008-01-29  0:05   ` Robin Holt
2008-01-29  1:19     ` Christoph Lameter
2008-01-29 13:59   ` Andrea Arcangeli
2008-01-29 14:34     ` Andrea Arcangeli
2008-01-29 19:49     ` Christoph Lameter
2008-01-29 20:41       ` Avi Kivity
2008-01-29 16:07   ` Robin Holt
2008-02-05 18:05   ` Andy Whitcroft
2008-02-05 18:17     ` Peter Zijlstra
2008-02-05 18:19     ` Christoph Lameter
2008-01-28 20:28 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-01-29 16:20   ` Andrea Arcangeli
2008-01-29 18:28     ` Andrea Arcangeli
2008-01-29 20:30       ` Christoph Lameter
2008-01-29 21:36         ` Andrea Arcangeli
2008-01-29 21:53           ` Christoph Lameter
2008-01-29 22:35             ` Andrea Arcangeli
2008-01-29 22:55               ` Christoph Lameter
2008-01-29 23:43                 ` Andrea Arcangeli
2008-01-30  0:34                   ` Christoph Lameter
2008-01-29 19:55     ` Christoph Lameter
2008-01-29 21:17       ` Andrea Arcangeli
2008-01-29 21:35         ` Christoph Lameter
2008-01-29 22:02           ` Andrea Arcangeli
2008-01-29 22:39             ` Christoph Lameter
2008-01-30  0:00               ` Andrea Arcangeli
2008-01-30  0:05                 ` Andrea Arcangeli
2008-01-30  0:22                   ` Christoph Lameter
2008-01-30  0:59                     ` Andrea Arcangeli
2008-01-30  8:26                       ` Peter Zijlstra
2008-01-30  0:20                 ` Christoph Lameter
2008-01-30  0:28                   ` Jack Steiner
2008-01-30  0:35                     ` Christoph Lameter
2008-01-30 13:37                     ` Andrea Arcangeli
2008-01-30 14:43                       ` Jack Steiner
2008-01-30 19:41                         ` Christoph Lameter
2008-01-30 20:29                           ` Jack Steiner
2008-01-30 20:55                             ` Christoph Lameter
2008-01-30 16:11                 ` Robin Holt
2008-01-30 17:04                   ` Andrea Arcangeli
2008-01-30 17:30                     ` Robin Holt
2008-01-30 18:25                       ` Andrea Arcangeli
2008-01-30 19:50                         ` Christoph Lameter
2008-01-30 22:18                           ` Robin Holt
2008-01-30 23:52                           ` Andrea Arcangeli
2008-01-31  0:01                             ` Christoph Lameter
2008-01-31  0:34                               ` [kvm-devel] " Andrea Arcangeli
2008-01-31  1:46                                 ` Christoph Lameter
2008-01-31  2:34                                   ` Robin Holt
2008-01-31  2:37                                     ` Christoph Lameter
2008-01-31  2:56                                     ` [kvm-devel] mmu_notifier: invalidate_range_start with lock=1 Christoph Lameter
2008-01-31 10:52                                   ` [kvm-devel] [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Andrea Arcangeli
2008-01-31  2:08                                 ` Christoph Lameter
2008-01-31  2:42                                   ` Andrea Arcangeli
2008-01-31  2:51                                     ` Christoph Lameter
2008-01-31 13:39                                       ` Andrea Arcangeli
2008-01-30 19:35                   ` Christoph Lameter
2008-01-28 20:28 ` [patch 3/6] mmu_notifier: invalidate_page callbacks for subsystems with rmap Christoph Lameter
2008-01-29 16:28   ` Robin Holt
2008-01-28 20:28 ` [patch 4/6] MMU notifier: invalidate_page callbacks using Linux rmaps Christoph Lameter
2008-01-29 14:03   ` Andrea Arcangeli
2008-01-29 14:24     ` Andrea Arcangeli
2008-01-29 19:51       ` Christoph Lameter
2008-01-28 20:28 ` [patch 5/6] mmu_notifier: Callbacks for xip_filemap.c Christoph Lameter
2008-01-28 20:28 ` [patch 6/6] mmu_notifier: Add invalidate_all() Christoph Lameter
2008-01-29 16:31   ` Robin Holt
2008-01-29 20:02     ` Christoph Lameter
2008-01-30  2:29 [patch 0/6] [RFC] MMU Notifiers V3 Christoph Lameter
2008-01-30  2:29 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-01-30 15:37   ` Andrea Arcangeli
2008-01-30 15:53     ` Jack Steiner
2008-01-30 16:38       ` Andrea Arcangeli
2008-01-30 19:19       ` Christoph Lameter
2008-01-30 22:20         ` Robin Holt
2008-01-30 23:38           ` Andrea Arcangeli
2008-01-30 23:55             ` Christoph Lameter
2008-01-30 17:10     ` Peter Zijlstra
2008-01-30 19:28       ` Christoph Lameter
2008-01-30 18:02   ` Robin Holt
2008-01-30 19:08     ` Christoph Lameter
2008-01-30 19:14     ` Christoph Lameter
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-15  6:48 [patch 0/6] MMU Notifiers V7 Christoph Lameter
2008-02-15  6:49 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-16  3:37   ` Andrew Morton
2008-02-16  8:45     ` Avi Kivity
2008-02-16  8:56       ` Andrew Morton
2008-02-16  9:21         ` Avi Kivity
2008-02-16 10:41     ` Brice Goglin
2008-02-16 10:58       ` Andrew Morton
2008-02-16 19:31         ` Christoph Lameter
2008-02-16 19:21     ` Christoph Lameter
2008-02-17  3:01       ` Andrea Arcangeli
2008-02-17 12:24         ` Robin Holt
2008-02-17  5:04     ` Doug Maxey
2008-02-18 22:33   ` Roland Dreier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).