LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] rmap 34 vm_flags page_table_lock
@ 2004-05-18 22:06 Hugh Dickins
  2004-05-18 22:07 ` [PATCH] rmap 35 mmap.c cleanups Hugh Dickins
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

First of a batch of seven rmap patches, based on 2.6.6-mm3.  Probably
the final batch: remaining issues outstanding can have isolated patches.
The first half of the batch is good for anonmm or anon_vma, the second
half of the batch replaces my anonmm rmap by Andrea's anon_vma rmap.

Judge for yourselves which you prefer.  I do think I was wrong to call
anon_vma more complex than anonmm (its lists are easier to understand
than my refcounting), and I'm happy with its vma merging after the last
patch.  It just comes down to whether we can spare the extra 24 bytes
(maximum, on 32-bit) per vma for its advantages in swapout and mremap.

rmap 34 vm_flags page_table_lock

Why do we guard vm_flags mods with page_table_lock when it's already
down_write guarded by mmap_sem?  There's probably a historical reason,
but no sign of any need for it now.  Andrea added a comment and removed
the instance from mprotect.c, Hugh plagiarized his comment and removed
the instances from madvise.c and mlock.c.  Huge leap in scalability...
not expected; but this should stop people asking why those spinlocks.

 mm/madvise.c  |    5 +++--
 mm/mlock.c    |    9 ++++++---
 mm/mprotect.c |    6 ++++--
 3 files changed, 13 insertions(+), 7 deletions(-)

--- rmap33/mm/madvise.c	2004-05-16 11:39:26.000000000 +0100
+++ rmap34/mm/madvise.c	2004-05-18 20:50:32.613664464 +0100
@@ -31,7 +31,9 @@ static long madvise_behavior(struct vm_a
 			return -EAGAIN;
 	}
 
-	spin_lock(&mm->page_table_lock);
+	/*
+	 * vm_flags is protected by the mmap_sem held in write mode.
+	 */
 	VM_ClearReadHint(vma);
 
 	switch (behavior) {
@@ -44,7 +46,6 @@ static long madvise_behavior(struct vm_a
 	default:
 		break;
 	}
-	spin_unlock(&mm->page_table_lock);
 
 	return 0;
 }
--- rmap33/mm/mlock.c	2004-05-16 11:39:25.000000000 +0100
+++ rmap34/mm/mlock.c	2004-05-18 20:50:32.613664464 +0100
@@ -32,10 +32,13 @@ static int mlock_fixup(struct vm_area_st
 			goto out;
 		}
 	}
-	
-	spin_lock(&mm->page_table_lock);
+
+	/*
+	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * It's okay if try_to_unmap_one unmaps a page just after we
+	 * set VM_LOCKED, make_pages_present below will bring it back.
+	 */
 	vma->vm_flags = newflags;
-	spin_unlock(&mm->page_table_lock);
 
 	/*
 	 * Keep track of amount of locked VM.
--- rmap33/mm/mprotect.c	2004-05-16 11:39:24.000000000 +0100
+++ rmap34/mm/mprotect.c	2004-05-18 20:50:32.614664312 +0100
@@ -212,10 +212,12 @@ mprotect_fixup(struct vm_area_struct *vm
 			goto fail;
 	}
 
-	spin_lock(&mm->page_table_lock);
+	/*
+	 * vm_flags and vm_page_prot are protected by the mmap_sem
+	 * held in write mode.
+	 */
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = newprot;
-	spin_unlock(&mm->page_table_lock);
 success:
 	change_protection(vma, start, end, newprot);
 	return 0;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 35 mmap.c cleanups
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
@ 2004-05-18 22:07 ` Hugh Dickins
  2004-05-18 22:07 ` [PATCH] rmap 36 mprotect use vma_merge Hugh Dickins
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Before some real vma_merge work in mmap.c in the next patch,
a patch of miscellaneous cleanups to cut down the noise:

- remove rb_parent arg from vma_merge: mm->mmap can do that case
- scatter pgoff_t around to ingratiate myself with the boss
- reorder is_mergeable_vma tests, vm_ops->close is least likely
- can_vma_merge_before take combined pgoff+pglen arg (from Andrea)
- rearrange do_mmap_pgoff's ever-confusing anonymous flags switch
- comment do_mmap_pgoff's mysterious (vm_flags & VM_SHARED) test
- fix ISO C90 warning on browse_rb if building with DEBUG_MM_RB
- stop that long MNT_NOEXEC line wrapping

Yes, buried in amidst these is indeed one pgoff replaced by
"next->vm_pgoff - pglen" (reverting a mod of mine which took pgoff
supplied by user too seriously in the anon case), and another pgoff
replaced by 0 (reverting anon_vma mod which crept in with NUMA API):
neither of them really matters, except perhaps in /proc/pid/maps.

 include/linux/mm.h |    2 -
 mm/mmap.c          |   90 +++++++++++++++++++++++++++--------------------------
 2 files changed, 47 insertions(+), 45 deletions(-)

--- rmap34/include/linux/mm.h	2004-05-16 11:39:23.000000000 +0100
+++ rmap35/include/linux/mm.h	2004-05-18 20:50:45.788661560 +0100
@@ -612,7 +612,7 @@ extern void insert_vm_struct(struct mm_s
 extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
 	struct rb_node **, struct rb_node *);
 extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
-	unsigned long addr, unsigned long len, unsigned long pgoff);
+	unsigned long addr, unsigned long len, pgoff_t pgoff);
 extern void exit_mmap(struct mm_struct *);
 
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
--- rmap34/mm/mmap.c	2004-05-16 11:39:25.000000000 +0100
+++ rmap35/mm/mmap.c	2004-05-18 20:50:45.816657304 +0100
@@ -153,10 +153,10 @@ out:
 }
 
 #ifdef DEBUG_MM_RB
-static int browse_rb(struct rb_root *root) {
-	int i, j;
+static int browse_rb(struct rb_root *root)
+{
+	int i = 0, j;
 	struct rb_node *nd, *pn = NULL;
-	i = 0;
 	unsigned long prev = 0, pend = 0;
 
 	for (nd = rb_first(root); nd; nd = rb_next(nd)) {
@@ -180,10 +180,11 @@ static int browse_rb(struct rb_root *roo
 	return i;
 }
 
-void validate_mm(struct mm_struct * mm) {
+void validate_mm(struct mm_struct *mm)
+{
 	int bug = 0;
 	int i = 0;
-	struct vm_area_struct * tmp = mm->mmap;
+	struct vm_area_struct *tmp = mm->mmap;
 	while (tmp) {
 		tmp = tmp->vm_next;
 		i++;
@@ -406,17 +407,17 @@ void vma_adjust(struct vm_area_struct *v
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
 			struct file *file, unsigned long vm_flags)
 {
-	if (vma->vm_ops && vma->vm_ops->close)
+	if (vma->vm_flags != vm_flags)
 		return 0;
 	if (vma->vm_file != file)
 		return 0;
-	if (vma->vm_flags != vm_flags)
+	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
 	return 1;
 }
 
 /*
- * Return true if we can merge this (vm_flags,file,vm_pgoff,size)
+ * Return true if we can merge this (vm_flags,file,vm_pgoff)
  * in front of (at a lower virtual address and file offset than) the vma.
  *
  * We don't check here for the merged mmap wrapping around the end of pagecache
@@ -425,12 +426,12 @@ static inline int is_mergeable_vma(struc
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct file *file, unsigned long vm_pgoff, unsigned long size)
+	struct file *file, pgoff_t vm_pgoff)
 {
 	if (is_mergeable_vma(vma, file, vm_flags)) {
 		if (!file)
 			return 1;	/* anon mapping */
-		if (vma->vm_pgoff == vm_pgoff + size)
+		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
 	}
 	return 0;
@@ -442,16 +443,16 @@ can_vma_merge_before(struct vm_area_stru
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct file *file, unsigned long vm_pgoff)
+	struct file *file, pgoff_t vm_pgoff)
 {
 	if (is_mergeable_vma(vma, file, vm_flags)) {
-		unsigned long vma_size;
+		pgoff_t vm_pglen;
 
 		if (!file)
 			return 1;	/* anon mapping */
 
-		vma_size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
-		if (vma->vm_pgoff + vma_size == vm_pgoff)
+		vm_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+		if (vma->vm_pgoff + vm_pglen == vm_pgoff)
 			return 1;
 	}
 	return 0;
@@ -463,12 +464,12 @@ can_vma_merge_after(struct vm_area_struc
  * both (it neatly fills a hole).
  */
 static struct vm_area_struct *vma_merge(struct mm_struct *mm,
-			struct vm_area_struct *prev,
-			struct rb_node *rb_parent, unsigned long addr, 
+			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
-		     	struct file *file, unsigned long pgoff,
+		     	struct file *file, pgoff_t pgoff,
 		        struct mempolicy *policy)
 {
+	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *next;
 
 	/*
@@ -479,7 +480,7 @@ static struct vm_area_struct *vma_merge(
 		return NULL;
 
 	if (!prev) {
-		next = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
+		next = mm->mmap;
 		goto merge_next;
 	}
 	next = prev->vm_next;
@@ -496,7 +497,7 @@ static struct vm_area_struct *vma_merge(
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags, file,
-					pgoff, (end - addr) >> PAGE_SHIFT)) {
+							pgoff+pglen)) {
 			vma_adjust(prev, prev->vm_start,
 				next->vm_end, prev->vm_pgoff, next);
 			if (file)
@@ -510,17 +511,18 @@ static struct vm_area_struct *vma_merge(
 		return prev;
 	}
 
+merge_next:
+
 	/*
 	 * Can this new request be merged in front of next?
 	 */
 	if (next) {
- merge_next:
 		if (end == next->vm_start &&
  				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags, file,
-					pgoff, (end - addr) >> PAGE_SHIFT)) {
-			vma_adjust(next, addr,
-				next->vm_end, pgoff, NULL);
+							pgoff+pglen)) {
+			vma_adjust(next, addr, next->vm_end,
+				next->vm_pgoff - pglen, NULL);
 			return next;
 		}
 	}
@@ -553,7 +555,8 @@ unsigned long do_mmap_pgoff(struct file 
 		if (!file->f_op || !file->f_op->mmap)
 			return -ENODEV;
 
-		if ((prot & PROT_EXEC) && (file->f_vfsmnt->mnt_flags & MNT_NOEXEC))
+		if ((prot & PROT_EXEC) &&
+		    (file->f_vfsmnt->mnt_flags & MNT_NOEXEC))
 			return -EPERM;
 	}
 
@@ -635,15 +638,14 @@ unsigned long do_mmap_pgoff(struct file 
 			return -EINVAL;
 		}
 	} else {
-		vm_flags |= VM_SHARED | VM_MAYSHARE;
 		switch (flags & MAP_TYPE) {
-		default:
-			return -EINVAL;
-		case MAP_PRIVATE:
-			vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
-			/* fall through */
 		case MAP_SHARED:
+			vm_flags |= VM_SHARED | VM_MAYSHARE;
+			break;
+		case MAP_PRIVATE:
 			break;
+		default:
+			return -EINVAL;
 		}
 	}
 
@@ -682,11 +684,14 @@ munmap_back:
 		}
 	}
 
-	/* Can we just expand an old anonymous mapping? */
-	if (!file && !(vm_flags & VM_SHARED) && rb_parent)
-		if (vma_merge(mm, prev, rb_parent, addr, addr + len,
-					vm_flags, NULL, pgoff, NULL))
-			goto out;
+	/*
+	 * Can we just expand an old private anonymous mapping?
+	 * The VM_SHARED test is necessary because shmem_zero_setup
+	 * will create the file object for a shared anonymous map below.
+	 */
+	if (!file && !(vm_flags & VM_SHARED) &&
+	    vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, 0, NULL))
+		goto out;
 
 	/*
 	 * Determine the object being mapped and call the appropriate
@@ -743,10 +748,8 @@ munmap_back:
 	 */
 	addr = vma->vm_start;
 
-	if (!file || !rb_parent || !vma_merge(mm, prev, rb_parent, addr,
-					      vma->vm_end,
-					      vma->vm_flags, file, pgoff,
-					      vma_policy(vma))) {
+	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
+			vma->vm_flags, file, pgoff, vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -1429,9 +1432,8 @@ unsigned long do_brk(unsigned long addr,
 
 	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
 
-	/* Can we just expand an old anonymous mapping? */
-	if (rb_parent && vma_merge(mm, prev, rb_parent, addr, addr + len,
-					flags, NULL, 0, NULL))
+	/* Can we just expand an old private anonymous mapping? */
+	if (vma_merge(mm, prev, addr, addr + len, flags, NULL, 0, NULL))
 		goto out;
 
 	/*
@@ -1524,7 +1526,7 @@ void insert_vm_struct(struct mm_struct *
  * prior to moving page table entries, to effect an mremap move.
  */
 struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
-	unsigned long addr, unsigned long len, unsigned long pgoff)
+	unsigned long addr, unsigned long len, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma = *vmap;
 	unsigned long vma_start = vma->vm_start;
@@ -1534,7 +1536,7 @@ struct vm_area_struct *copy_vma(struct v
 	struct mempolicy *pol;
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
-	new_vma = vma_merge(mm, prev, rb_parent, addr, addr + len,
+	new_vma = vma_merge(mm, prev, addr, addr + len,
 			vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
 	if (new_vma) {
 		/*


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 36 mprotect use vma_merge
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
  2004-05-18 22:07 ` [PATCH] rmap 35 mmap.c cleanups Hugh Dickins
@ 2004-05-18 22:07 ` Hugh Dickins
  2004-05-18 22:08 ` [PATCH] rmap 37 page_add_anon_rmap vma Hugh Dickins
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Earlier on, in 2.6.6, we took the vma merging code out of mremap.c
and let it rely on vma_merge instead (via copy_vma).  Now take the
vma merging code out of mprotect.c and let it rely on vma_merge too:
so vma_merge becomes the sole vma merging engine.  The fruit of this
consolidation is that mprotect now merges file-backed vmas naturally.
Make this change now because anon_vma will complicate the vma merging
rules, let's keep them all in one place.

vma_merge remains where the decisions are made, whether to merge with
prev and/or next; but now [addr,end) may be the latter part of prev, or
first part or whole of next, whereas before it was always a new area.

vma_adjust carries out vma_merge's decision, but when sliding the
boundary between vma and next, must temporarily remove next from the
prio_tree too.  And it turned out (by oops) to have a surer idea of
whether next needs to be removed than vma_merge, so the fput and
freeing moves into vma_adjust.

Too much decipherment of what's going on at the start of vma_adjust?
Yes, and there's a delicate assumption that you may use vma_adjust in
sliding a boundary, or splitting in two, or growing a vma (mremap uses
it in that way), but not for simply shrinking a vma.  Which is so, and
must be so (how could pages mapped in the part to go, be zapped without
first splitting?), but would feel better with some protection.

__vma_unlink can then be moved from mm.h to mmap.c, and
mm.h's more misleading than helpful can_vma_merge is deleted.

 include/linux/mm.h |   29 +-------
 mm/mmap.c          |  172 +++++++++++++++++++++++++++++++++++++++--------------
 mm/mprotect.c      |  110 ++++++++-------------------------
 3 files changed, 159 insertions(+), 152 deletions(-)

--- rmap35/include/linux/mm.h	2004-05-18 20:50:45.788661560 +0100
+++ rmap36/include/linux/mm.h	2004-05-18 20:50:58.875672032 +0100
@@ -607,7 +607,12 @@ struct vm_area_struct *vma_prio_tree_nex
 
 /* mmap.c */
 extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *next);
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
+extern struct vm_area_struct *vma_merge(struct mm_struct *,
+	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+	unsigned long vm_flags, struct file *, pgoff_t, struct mempolicy *);
+extern int split_vma(struct mm_struct *,
+	struct vm_area_struct *, unsigned long addr, int new_below);
 extern void insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
 extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
 	struct rb_node **, struct rb_node *);
@@ -638,26 +643,6 @@ extern int do_munmap(struct mm_struct *,
 
 extern unsigned long do_brk(unsigned long, unsigned long);
 
-static inline void
-__vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
-		struct vm_area_struct *prev)
-{
-	prev->vm_next = vma->vm_next;
-	rb_erase(&vma->vm_rb, &mm->mm_rb);
-	if (mm->mmap_cache == vma)
-		mm->mmap_cache = prev;
-}
-
-static inline int
-can_vma_merge(struct vm_area_struct *vma, unsigned long vm_flags)
-{
-#ifdef CONFIG_MMU
-	if (!vma->vm_file && vma->vm_flags == vm_flags)
-		return 1;
-#endif
-	return 0;
-}
-
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
@@ -691,8 +676,6 @@ extern int expand_stack(struct vm_area_s
 extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
 					     struct vm_area_struct **pprev);
-extern int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
-		     unsigned long addr, int new_below);
 
 /* Look up the first VMA which intersects the interval start_addr..end_addr-1,
    NULL if none.  Assume start_addr < end_addr. */
--- rmap35/mm/mmap.c	2004-05-18 20:50:45.816657304 +0100
+++ rmap36/mm/mmap.c	2004-05-18 20:50:58.878671576 +0100
@@ -338,20 +338,51 @@ __insert_vm_struct(struct mm_struct * mm
 	validate_mm(mm);
 }
 
+static inline void
+__vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
+		struct vm_area_struct *prev)
+{
+	prev->vm_next = vma->vm_next;
+	rb_erase(&vma->vm_rb, &mm->mm_rb);
+	if (mm->mmap_cache == vma)
+		mm->mmap_cache = prev;
+}
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
  * The following helper function should be used when such adjustments
- * are necessary.  The "next" vma (if any) is to be removed or inserted
+ * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
  */
 void vma_adjust(struct vm_area_struct *vma, unsigned long start,
-	unsigned long end, pgoff_t pgoff, struct vm_area_struct *next)
+	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *next = vma->vm_next;
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
+	long adjust_next = 0;
+	int remove_next = 0;
+
+	if (next && !insert) {
+		if (end >= next->vm_end) {
+again:			remove_next = 1 + (end > next->vm_end);
+			end = next->vm_end;
+		} else if (end < vma->vm_end || end > next->vm_start) {
+			/*
+			 * vma shrinks, and !insert tells it's not
+			 * split_vma inserting another: so it must
+			 * be mprotect shifting the boundary down.
+			 *   Or:
+			 * vma expands, overlapping part of the next:
+			 * must be mprotect shifting the boundary up.
+			 */
+			BUG_ON(vma->vm_end != next->vm_start);
+			adjust_next = end - next->vm_start;
+		}
+	}
 
 	if (file) {
 		mapping = file->f_mapping;
@@ -364,38 +395,67 @@ void vma_adjust(struct vm_area_struct *v
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
 		vma_prio_tree_remove(vma, root);
+		if (adjust_next)
+			vma_prio_tree_remove(next, root);
 	}
+
 	vma->vm_start = start;
 	vma->vm_end = end;
 	vma->vm_pgoff = pgoff;
+	if (adjust_next) {
+		next->vm_start += adjust_next;
+		next->vm_pgoff += adjust_next >> PAGE_SHIFT;
+	}
+
 	if (root) {
+		if (adjust_next) {
+			vma_prio_tree_init(next);
+			vma_prio_tree_insert(next, root);
+		}
 		vma_prio_tree_init(vma);
 		vma_prio_tree_insert(vma, root);
 		flush_dcache_mmap_unlock(mapping);
 	}
 
-	if (next) {
-		if (next == vma->vm_next) {
-			/*
-			 * vma_merge has merged next into vma, and needs
-			 * us to remove next before dropping the locks.
-			 */
-			__vma_unlink(mm, next, vma);
-			if (file)
-				__remove_shared_vm_struct(next, file, mapping);
-		} else {
-			/*
-			 * split_vma has split next from vma, and needs
-			 * us to insert next before dropping the locks
-			 * (next may either follow vma or precede it).
-			 */
-			__insert_vm_struct(mm, next);
-		}
+	if (remove_next) {
+		/*
+		 * vma_merge has merged next into vma, and needs
+		 * us to remove next before dropping the locks.
+		 */
+		__vma_unlink(mm, next, vma);
+		if (file)
+			__remove_shared_vm_struct(next, file, mapping);
+	} else if (insert) {
+		/*
+		 * split_vma has split insert from vma, and needs
+		 * us to insert it before dropping the locks
+		 * (it may either follow vma or precede it).
+		 */
+		__insert_vm_struct(mm, insert);
 	}
 
 	spin_unlock(&mm->page_table_lock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
+
+	if (remove_next) {
+		if (file)
+			fput(file);
+		mm->map_count--;
+		mpol_free(vma_policy(next));
+		kmem_cache_free(vm_area_cachep, next);
+		/*
+		 * In mprotect's case 6 (see comments on vma_merge),
+		 * we must remove another next too. It would clutter
+		 * up the code too much to do both in one go.
+		 */
+		if (remove_next == 2) {
+			next = vma->vm_next;
+			goto again;
+		}
+	}
+
+	validate_mm(mm);
 }
 
 /*
@@ -459,18 +519,42 @@ can_vma_merge_after(struct vm_area_struc
 }
 
 /*
- * Given a new mapping request (addr,end,vm_flags,file,pgoff), figure out
- * whether that can be merged with its predecessor or its successor.  Or
- * both (it neatly fills a hole).
+ * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
+ * whether that can be merged with its predecessor or its successor.
+ * Or both (it neatly fills a hole).
+ *
+ * In most cases - when called for mmap, brk or mremap - [addr,end) is
+ * certain not to be mapped by the time vma_merge is called; but when
+ * called for mprotect, it is certain to be already mapped (either at
+ * an offset within prev, or at the start of next), and the flags of
+ * this area are about to be changed to vm_flags - and the no-change
+ * case has already been eliminated.
+ *
+ * The following mprotect cases have to be considered, where AAAA is
+ * the area passed down from mprotect_fixup, never extending beyond one
+ * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after:
+ *
+ *     AAAA             AAAA                AAAA          AAAA
+ *    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPPPNNNNNN    PPPPNNNNXXXX
+ *    cannot merge    might become    might become    might become
+ *                    PPNNNNNNNNNN    PPPPPPPPPPNN    PPPPPPPPPPPP 6 or
+ *    mmap, brk or    case 4 below    case 5 below    PPPPPPPPXXXX 7 or
+ *    mremap move:                                    PPPPNNNNNNNN 8
+ *        AAAA
+ *    PPPP    NNNN    PPPPPPPPPPPP    PPPPPPPPNNNN    PPPPNNNNNNNN
+ *    might become    case 1 below    case 2 below    case 3 below
+ *
+ * Odd one out? Case 8, because it extends NNNN but needs flags of XXXX:
+ * mprotect_fixup updates vm_flags & vm_page_prot on successful return.
  */
-static struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 		     	struct file *file, pgoff_t pgoff,
 		        struct mempolicy *policy)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
-	struct vm_area_struct *next;
+	struct vm_area_struct *area, *next;
 
 	/*
 	 * We later require that vma->vm_flags == vm_flags,
@@ -479,16 +563,18 @@ static struct vm_area_struct *vma_merge(
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
-	if (!prev) {
+	if (prev)
+		next = prev->vm_next;
+	else
 		next = mm->mmap;
-		goto merge_next;
-	}
-	next = prev->vm_next;
+	area = next;
+	if (next && next->vm_end == end)		/* cases 6, 7, 8 */
+		next = next->vm_next;
 
 	/*
 	 * Can it merge with the predecessor?
 	 */
-	if (prev->vm_end == addr &&
+	if (prev && prev->vm_end == addr &&
   			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
 		/*
@@ -498,33 +584,29 @@ static struct vm_area_struct *vma_merge(
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags, file,
 							pgoff+pglen)) {
+							/* cases 1, 6 */
 			vma_adjust(prev, prev->vm_start,
-				next->vm_end, prev->vm_pgoff, next);
-			if (file)
-				fput(file);
-			mm->map_count--;
-			mpol_free(vma_policy(next));
-			kmem_cache_free(vm_area_cachep, next);
-		} else
+				next->vm_end, prev->vm_pgoff, NULL);
+		} else					/* cases 2, 5, 7 */
 			vma_adjust(prev, prev->vm_start,
 				end, prev->vm_pgoff, NULL);
 		return prev;
 	}
 
-merge_next:
-
 	/*
 	 * Can this new request be merged in front of next?
 	 */
-	if (next) {
-		if (end == next->vm_start &&
- 				mpol_equal(policy, vma_policy(next)) &&
-				can_vma_merge_before(next, vm_flags, file,
+	if (next && end == next->vm_start &&
+ 			mpol_equal(policy, vma_policy(next)) &&
+			can_vma_merge_before(next, vm_flags, file,
 							pgoff+pglen)) {
-			vma_adjust(next, addr, next->vm_end,
+		if (prev && addr < prev->vm_end)	/* case 4 */
+			vma_adjust(prev, prev->vm_start,
+				addr, prev->vm_pgoff, NULL);
+		else					/* cases 3, 8 */
+			vma_adjust(area, addr, next->vm_end,
 				next->vm_pgoff - pglen, NULL);
-			return next;
-		}
+		return area;
 	}
 
 	return NULL;
--- rmap35/mm/mprotect.c	2004-05-18 20:50:32.614664312 +0100
+++ rmap36/mm/mprotect.c	2004-05-18 20:50:58.879671424 +0100
@@ -106,53 +106,6 @@ change_protection(struct vm_area_struct 
 	spin_unlock(&current->mm->page_table_lock);
 	return;
 }
-/*
- * Try to merge a vma with the previous flag, return 1 if successful or 0 if it
- * was impossible.
- */
-static int
-mprotect_attempt_merge(struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long end, int newflags)
-{
-	struct mm_struct * mm;
-
-	if (!prev || !vma)
-		return 0;
-	mm = vma->vm_mm;
-	if (prev->vm_end != vma->vm_start)
-		return 0;
-	if (!can_vma_merge(prev, newflags))
-		return 0;
-	if (vma->vm_file || (vma->vm_flags & VM_SHARED))
-		return 0;
-	if (!vma_mpol_equal(vma, prev))
-		return 0;
-
-	/*
-	 * If the whole area changes to the protection of the previous one
-	 * we can just get rid of it.
-	 */
-	if (end == vma->vm_end) {
-		spin_lock(&mm->page_table_lock);
-		prev->vm_end = end;
-		__vma_unlink(mm, vma, prev);
-		spin_unlock(&mm->page_table_lock);
-
-		mpol_free(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
-		mm->map_count--;
-		return 1;
-	} 
-
-	/*
-	 * Otherwise extend it.
-	 */
-	spin_lock(&mm->page_table_lock);
-	prev->vm_end = end;
-	vma->vm_start = end;
-	spin_unlock(&mm->page_table_lock);
-	return 1;
-}
 
 static int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
@@ -161,6 +114,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	struct mm_struct * mm = vma->vm_mm;
 	unsigned long charged = 0;
 	pgprot_t newprot;
+	pgoff_t pgoff;
 	int error;
 
 	if (newflags == vma->vm_flags) {
@@ -187,15 +141,18 @@ mprotect_fixup(struct vm_area_struct *vm
 
 	newprot = protection_map[newflags & 0xf];
 
-	if (start == vma->vm_start) {
-		/*
-		 * Try to merge with the previous vma.
-		 */
-		if (mprotect_attempt_merge(vma, *pprev, end, newflags)) {
-			vma = *pprev;
-			goto success;
-		}
-	} else {
+	/*
+	 * First try to merge with previous and/or next vma.
+	 */
+	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+	*pprev = vma_merge(mm, *pprev, start, end, newflags,
+				vma->vm_file, pgoff, vma_policy(vma));
+	if (*pprev) {
+		vma = *pprev;
+		goto success;
+	}
+
+	if (start != vma->vm_start) {
 		error = split_vma(mm, vma, start, 1);
 		if (error)
 			goto fail;
@@ -212,13 +169,13 @@ mprotect_fixup(struct vm_area_struct *vm
 			goto fail;
 	}
 
+success:
 	/*
 	 * vm_flags and vm_page_prot are protected by the mmap_sem
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
 	vma->vm_page_prot = newprot;
-success:
 	change_protection(vma, start, end, newprot);
 	return 0;
 
@@ -231,7 +188,7 @@ asmlinkage long
 sys_mprotect(unsigned long start, size_t len, unsigned long prot)
 {
 	unsigned long vm_flags, nstart, end, tmp;
-	struct vm_area_struct * vma, * next, * prev;
+	struct vm_area_struct *vma, *prev;
 	int error = -EINVAL;
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
 	prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
@@ -275,10 +232,11 @@ sys_mprotect(unsigned long start, size_t
 				goto out;
 		}
 	}
+	if (start > vma->vm_start)
+		prev = vma;
 
 	for (nstart = start ; ; ) {
 		unsigned int newflags;
-		int last = 0;
 
 		/* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
 
@@ -298,41 +256,25 @@ sys_mprotect(unsigned long start, size_t
 		if (error)
 			goto out;
 
-		if (vma->vm_end > end) {
-			error = mprotect_fixup(vma, &prev, nstart, end, newflags);
-			goto out;
-		}
-		if (vma->vm_end == end)
-			last = 1;
-
 		tmp = vma->vm_end;
-		next = vma->vm_next;
+		if (tmp > end)
+			tmp = end;
 		error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
 		if (error)
 			goto out;
-		if (last)
-			break;
 		nstart = tmp;
-		vma = next;
+
+		if (nstart < prev->vm_end)
+			nstart = prev->vm_end;
+		if (nstart >= end)
+			goto out;
+
+		vma = prev->vm_next;
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
 			goto out;
 		}
 	}
-
-	if (next && prev->vm_end == next->vm_start &&
-			can_vma_merge(next, prev->vm_flags) &&
-	    	vma_mpol_equal(prev, next) &&
-			!prev->vm_file && !(prev->vm_flags & VM_SHARED)) {
-		spin_lock(&prev->vm_mm->page_table_lock);
-		prev->vm_end = next->vm_end;
-		__vma_unlink(prev->vm_mm, next, prev);
-		spin_unlock(&prev->vm_mm->page_table_lock);
-
-		mpol_free(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
-		prev->vm_mm->map_count--;
-	}
 out:
 	up_write(&current->mm->mmap_sem);
 	return error;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 37 page_add_anon_rmap vma
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
  2004-05-18 22:07 ` [PATCH] rmap 35 mmap.c cleanups Hugh Dickins
  2004-05-18 22:07 ` [PATCH] rmap 36 mprotect use vma_merge Hugh Dickins
@ 2004-05-18 22:08 ` Hugh Dickins
  2004-05-18 22:10 ` [PATCH] rmap 38 remove anonmm rmap Hugh Dickins
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg
to vma arg like anon_vma rmap, to smooth the transition between them.

 fs/exec.c            |    2 +-
 include/linux/rmap.h |    5 ++++-
 mm/memory.c          |    8 ++++----
 mm/rmap.c            |    6 +++---
 mm/swapfile.c        |    2 +-
 5 files changed, 13 insertions(+), 10 deletions(-)

--- rmap36/fs/exec.c	2004-05-16 11:39:24.000000000 +0100
+++ rmap37/fs/exec.c	2004-05-18 20:51:12.074665480 +0100
@@ -320,7 +320,7 @@ void install_arg_page(struct vm_area_str
 	lru_cache_add_active(page);
 	set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
-	page_add_anon_rmap(page, mm, address);
+	page_add_anon_rmap(page, vma, address);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);
 
--- rmap36/include/linux/rmap.h	2004-05-16 11:39:23.000000000 +0100
+++ rmap37/include/linux/rmap.h	2004-05-18 20:51:12.099661680 +0100
@@ -14,7 +14,10 @@
 
 #ifdef CONFIG_MMU
 
-void page_add_anon_rmap(struct page *, struct mm_struct *, unsigned long);
+/*
+ * rmap interfaces called when adding or removing pte of page
+ */
+void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
 void page_add_file_rmap(struct page *);
 void page_remove_rmap(struct page *);
 
--- rmap36/mm/memory.c	2004-05-16 11:39:23.000000000 +0100
+++ rmap37/mm/memory.c	2004-05-18 20:51:12.101661376 +0100
@@ -1099,7 +1099,7 @@ static int do_wp_page(struct mm_struct *
 			page_remove_rmap(old_page);
 		break_cow(vma, new_page, address, page_table);
 		lru_cache_add_active(new_page);
-		page_add_anon_rmap(new_page, mm, address);
+		page_add_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
 		new_page = old_page;
@@ -1377,7 +1377,7 @@ static int do_swap_page(struct mm_struct
 
 	flush_icache_page(vma, page);
 	set_pte(page_table, pte);
-	page_add_anon_rmap(page, mm, address);
+	page_add_anon_rmap(page, vma, address);
 
 	if (write_access || mremap_moved_anon_rmap(page, address)) {
 		if (do_wp_page(mm, vma, address,
@@ -1436,7 +1436,7 @@ do_anonymous_page(struct mm_struct *mm, 
 				      vma);
 		lru_cache_add_active(page);
 		mark_page_accessed(page);
-		page_add_anon_rmap(page, mm, addr);
+		page_add_anon_rmap(page, vma, addr);
 	}
 
 	set_pte(page_table, entry);
@@ -1543,7 +1543,7 @@ retry:
 		set_pte(page_table, entry);
 		if (anon) {
 			lru_cache_add_active(new_page);
-			page_add_anon_rmap(new_page, mm, address);
+			page_add_anon_rmap(new_page, vma, address);
 		} else
 			page_add_file_rmap(new_page);
 		pte_unmap(page_table);
--- rmap36/mm/rmap.c	2004-05-16 11:39:23.000000000 +0100
+++ rmap37/mm/rmap.c	2004-05-18 20:51:12.103661072 +0100
@@ -365,15 +365,15 @@ int page_referenced(struct page *page)
 /**
  * page_add_anon_rmap - add pte mapping to an anonymous page
  * @page:	the page to add the mapping to
- * @mm:		the mm in which the mapping is added
+ * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
  *
  * The caller needs to hold the mm->page_table_lock.
  */
 void page_add_anon_rmap(struct page *page,
-	struct mm_struct *mm, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anonmm *anonmm = mm->anonmm;
+	struct anonmm *anonmm = vma->vm_mm->anonmm;
 
 	BUG_ON(PageReserved(page));
 
--- rmap36/mm/swapfile.c	2004-05-16 11:39:22.000000000 +0100
+++ rmap37/mm/swapfile.c	2004-05-18 20:51:12.104660920 +0100
@@ -433,7 +433,7 @@ unuse_pte(struct vm_area_struct *vma, un
 	vma->vm_mm->rss++;
 	get_page(page);
 	set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	page_add_anon_rmap(page, vma->vm_mm, address);
+	page_add_anon_rmap(page, vma, address);
 	swap_free(entry);
 }
 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 38 remove anonmm rmap
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
                   ` (2 preceding siblings ...)
  2004-05-18 22:08 ` [PATCH] rmap 37 page_add_anon_rmap vma Hugh Dickins
@ 2004-05-18 22:10 ` Hugh Dickins
  2004-05-18 22:11 ` [PATCH] rmap 39 add anon_vma rmap Hugh Dickins
  2004-05-18 22:12 ` [PATCH] rmap 40 better anon_vma sharing Hugh Dickins
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Before moving on to anon_vma rmap, remove now what's peculiar to anonmm
rmap: the anonmm handling and the mremap move cows.  Temporarily reduce
page_referenced_anon and try_to_unmap_anon to stubs, so a kernel built
with this patch will not swap anonymous at all.

 include/linux/page-flags.h |    2 
 include/linux/rmap.h       |   53 ---------
 include/linux/sched.h      |    1 
 kernel/fork.c              |   13 --
 mm/memory.c                |    2 
 mm/mremap.c                |   51 +--------
 mm/rmap.c                  |  250 ---------------------------------------------
 mm/swapfile.c              |    9 -
 8 files changed, 18 insertions(+), 363 deletions(-)

--- rmap37/include/linux/page-flags.h	2004-05-16 11:39:21.000000000 +0100
+++ rmap38/include/linux/page-flags.h	2004-05-18 20:51:27.663295648 +0100
@@ -76,7 +76,7 @@
 #define PG_reclaim		18	/* To be reclaimed asap */
 #define PG_compound		19	/* Part of a compound page */
 
-#define PG_anon			20	/* Anonymous page: anonmm in mapping */
+#define PG_anon			20	/* Anonymous: anon_vma in mapping */
 
 
 /*
--- rmap37/include/linux/rmap.h	2004-05-18 20:51:12.099661680 +0100
+++ rmap38/include/linux/rmap.h	2004-05-18 20:51:27.664295496 +0100
@@ -35,54 +35,6 @@ static inline void page_dup_rmap(struct 
 	page_map_unlock(page);
 }
 
-int mremap_move_anon_rmap(struct page *page, unsigned long addr);
-
-/**
- * mremap_moved_anon_rmap - does new address clash with that noted?
- * @page:	the page just brought back in from swap
- * @addr:	the user virtual address at which it is mapped
- *
- * Returns boolean, true if addr clashes with address already in page.
- *
- * For do_swap_page and unuse_pte: anonmm rmap cannot find the page if
- * it's at different addresses in different mms, so caller must take a
- * copy of the page to avoid that: not very clever, but too rare a case
- * to merit cleverness.
- */
-static inline int mremap_moved_anon_rmap(struct page *page, unsigned long addr)
-{
-	return page->index != (addr & PAGE_MASK);
-}
-
-/**
- * make_page_exclusive - try to make page exclusive to one mm
- * @vma		the vm_area_struct covering this address
- * @addr	the user virtual address of the page in question
- *
- * Assumes that the page at this address is anonymous (COWable),
- * and that the caller holds mmap_sem for reading or for writing.
- *
- * For mremap's move_page_tables and for swapoff's unuse_process:
- * not a general purpose routine, and in general may not succeed.
- * But move_page_tables loops until it succeeds, and unuse_process
- * holds the original page locked, which protects against races.
- */
-static inline int make_page_exclusive(struct vm_area_struct *vma,
-					unsigned long addr)
-{
-	if (handle_mm_fault(vma->vm_mm, vma, addr, 1) != VM_FAULT_OOM)
-		return 0;
-	return -ENOMEM;
-}
-
-/*
- * Called from kernel/fork.c to manage anonymous memory
- */
-void init_rmap(void);
-int exec_rmap(struct mm_struct *);
-int dup_rmap(struct mm_struct *, struct mm_struct *oldmm);
-void exit_rmap(struct mm_struct *);
-
 /*
  * Called from mm/vmscan.c to handle paging out
  */
@@ -91,11 +43,6 @@ int try_to_unmap(struct page *);
 
 #else	/* !CONFIG_MMU */
 
-#define init_rmap()		do {} while (0)
-#define exec_rmap(mm)		(0)
-#define dup_rmap(mm, oldmm)	(0)
-#define exit_rmap(mm)		do {} while (0)
-
 #define page_referenced(page)	TestClearPageReferenced(page)
 #define try_to_unmap(page)	SWAP_FAIL
 
--- rmap37/include/linux/sched.h	2004-05-16 11:39:26.000000000 +0100
+++ rmap38/include/linux/sched.h	2004-05-18 20:51:27.665295344 +0100
@@ -215,7 +215,6 @@ struct mm_struct {
 						 * together off init_mm.mmlist, and are protected
 						 * by mmlist_lock
 						 */
-	struct anonmm *anonmm;			/* For rmap to track anon mem */
 
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
--- rmap37/kernel/fork.c	2004-05-16 11:39:22.000000000 +0100
+++ rmap38/kernel/fork.c	2004-05-18 20:51:27.667295040 +0100
@@ -432,11 +432,6 @@ struct mm_struct * mm_alloc(void)
 	if (mm) {
 		memset(mm, 0, sizeof(*mm));
 		mm = mm_init(mm);
-		if (mm && exec_rmap(mm)) {
-			mm_free_pgd(mm);
-			free_mm(mm);
-			mm = NULL;
-		}
 	}
 	return mm;
 }
@@ -465,7 +460,6 @@ void mmput(struct mm_struct *mm)
 		spin_unlock(&mmlist_lock);
 		exit_aio(mm);
 		exit_mmap(mm);
-		exit_rmap(mm);
 		mmdrop(mm);
 	}
 }
@@ -569,12 +563,6 @@ static int copy_mm(unsigned long clone_f
 	if (!mm_init(mm))
 		goto fail_nomem;
 
-	if (dup_rmap(mm, oldmm)) {
-		mm_free_pgd(mm);
-		free_mm(mm);
-		goto fail_nomem;
-	}
-
 	if (init_new_context(tsk,mm))
 		goto fail_nocontext;
 
@@ -1298,5 +1286,4 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
-	init_rmap();
 }
--- rmap37/mm/memory.c	2004-05-18 20:51:12.101661376 +0100
+++ rmap38/mm/memory.c	2004-05-18 20:51:27.703289568 +0100
@@ -1379,7 +1379,7 @@ static int do_swap_page(struct mm_struct
 	set_pte(page_table, pte);
 	page_add_anon_rmap(page, vma, address);
 
-	if (write_access || mremap_moved_anon_rmap(page, address)) {
+	if (write_access) {
 		if (do_wp_page(mm, vma, address,
 				page_table, pmd, pte) == VM_FAULT_OOM)
 			ret = VM_FAULT_OOM;
--- rmap37/mm/mremap.c	2004-05-16 11:39:24.000000000 +0100
+++ rmap38/mm/mremap.c	2004-05-18 20:51:27.704289416 +0100
@@ -15,7 +15,6 @@
 #include <linux/swap.h>
 #include <linux/fs.h>
 #include <linux/highmem.h>
-#include <linux/rmap.h>
 #include <linux/security.h>
 
 #include <asm/uaccess.h>
@@ -80,21 +79,6 @@ static inline pte_t *alloc_one_pte_map(s
 	return pte;
 }
 
-static inline int
-can_move_one_pte(pte_t *src, unsigned long new_addr)
-{
-	int move = 1;
-	if (pte_present(*src)) {
-		unsigned long pfn = pte_pfn(*src);
-		if (pfn_valid(pfn)) {
-			struct page *page = pfn_to_page(pfn);
-			if (PageAnon(page))
-				move = mremap_move_anon_rmap(page, new_addr);
-		}
-	}
-	return move;
-}
-
 static int
 move_one_page(struct vm_area_struct *vma, unsigned long old_addr,
 		unsigned long new_addr)
@@ -141,15 +125,12 @@ move_one_page(struct vm_area_struct *vma
 		 * page_table_lock, we should re-check the src entry...
 		 */
 		if (src) {
-			if (!dst)
-				error = -ENOMEM;
-			else if (!can_move_one_pte(src, new_addr))
-				error = -EAGAIN;
-			else {
+			if (dst) {
 				pte_t pte;
 				pte = ptep_clear_flush(vma, old_addr, src);
 				set_pte(dst, pte);
-			}
+			} else
+				error = -ENOMEM;
 			pte_unmap_nested(src);
 		}
 		if (dst)
@@ -163,7 +144,7 @@ move_one_page(struct vm_area_struct *vma
 
 static unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long new_addr, unsigned long old_addr,
-		unsigned long len, int *cows)
+		unsigned long len)
 {
 	unsigned long offset;
 
@@ -175,21 +156,7 @@ static unsigned long move_page_tables(st
 	 * only a few pages.. This also makes error recovery easier.
 	 */
 	for (offset = 0; offset < len; offset += PAGE_SIZE) {
-		int ret = move_one_page(vma, old_addr+offset, new_addr+offset);
-		/*
-		 * The anonmm objrmap can only track anon page movements
-		 * if the page is exclusive to one mm.  In the rare case
-		 * when mremap move is applied to a shared page, break
-		 * COW (take a copy of the page) to make it exclusive.
-		 * If shared while on swap, page will be copied when
-		 * brought back in (if it's still shared by then).
-		 */
-		if (ret == -EAGAIN) {
-			ret = make_page_exclusive(vma, old_addr+offset);
-			offset -= PAGE_SIZE;
-			(*cows)++;
-		}
-		if (ret)
+		if (move_one_page(vma, old_addr+offset, new_addr+offset) < 0)
 			break;
 		cond_resched();
 	}
@@ -207,7 +174,6 @@ static unsigned long move_vma(struct vm_
 	unsigned long moved_len;
 	unsigned long excess = 0;
 	int split = 0;
-	int cows = 0;
 
 	/*
 	 * We'd prefer to avoid failure later on in do_munmap:
@@ -221,22 +187,19 @@ static unsigned long move_vma(struct vm_
 	if (!new_vma)
 		return -ENOMEM;
 
-	moved_len = move_page_tables(vma, new_addr, old_addr, old_len, &cows);
+	moved_len = move_page_tables(vma, new_addr, old_addr, old_len);
 	if (moved_len < old_len) {
 		/*
 		 * On error, move entries back from new area to old,
 		 * which will succeed since page tables still there,
 		 * and then proceed to unmap new area instead of old.
 		 */
-		move_page_tables(new_vma, old_addr, new_addr, moved_len, &cows);
+		move_page_tables(new_vma, old_addr, new_addr, moved_len);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
 		new_addr = -ENOMEM;
 	}
-	if (cows)	/* Downgrade or remove this message later */
-		printk(KERN_WARNING "%s: mremap moved %d cows\n",
-							current->comm, cows);
 
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT) {
--- rmap37/mm/rmap.c	2004-05-18 20:51:12.103661072 +0100
+++ rmap38/mm/rmap.c	2004-05-18 20:51:27.706289112 +0100
@@ -27,125 +27,11 @@
 
 #include <asm/tlbflush.h>
 
-/*
- * struct anonmm: to track a bundle of anonymous memory mappings.
- *
- * Could be embedded in mm_struct, but mm_struct is rather heavyweight,
- * and we may need the anonmm to stay around long after the mm_struct
- * and its pgd have been freed: because pages originally faulted into
- * that mm have been duped into forked mms, and still need tracking.
- */
-struct anonmm {
-	atomic_t	 count;	/* ref count, including 1 per page */
-	spinlock_t	 lock;	/* head's locks list; others unused */
-	struct mm_struct *mm;	/* assoc mm_struct, NULL when gone */
-	struct anonmm	 *head;	/* exec starts new chain from head */
-	struct list_head list;	/* chain of associated anonmms */
-};
-static kmem_cache_t *anonmm_cachep;
-
-/**
- ** Functions for creating and destroying struct anonmm.
- **/
-
-void __init init_rmap(void)
-{
-	anonmm_cachep = kmem_cache_create("anonmm",
-			sizeof(struct anonmm), 0, SLAB_PANIC, NULL, NULL);
-}
-
-int exec_rmap(struct mm_struct *mm)
-{
-	struct anonmm *anonmm;
-
-	anonmm = kmem_cache_alloc(anonmm_cachep, SLAB_KERNEL);
-	if (unlikely(!anonmm))
-		return -ENOMEM;
-
-	atomic_set(&anonmm->count, 2);		/* ref by mm and head */
-	anonmm->lock = SPIN_LOCK_UNLOCKED;	/* this lock is used */
-	anonmm->mm = mm;
-	anonmm->head = anonmm;
-	INIT_LIST_HEAD(&anonmm->list);
-	mm->anonmm = anonmm;
-	return 0;
-}
-
-int dup_rmap(struct mm_struct *mm, struct mm_struct *oldmm)
-{
-	struct anonmm *anonmm;
-	struct anonmm *anonhd = oldmm->anonmm->head;
-
-	anonmm = kmem_cache_alloc(anonmm_cachep, SLAB_KERNEL);
-	if (unlikely(!anonmm))
-		return -ENOMEM;
-
-	/*
-	 * copy_mm calls us before dup_mmap has reset the mm fields,
-	 * so reset rss ourselves before adding to anonhd's list,
-	 * to keep away from this mm until it's worth examining.
-	 */
-	mm->rss = 0;
-
-	atomic_set(&anonmm->count, 1);		/* ref by mm */
-	anonmm->lock = SPIN_LOCK_UNLOCKED;	/* this lock is not used */
-	anonmm->mm = mm;
-	anonmm->head = anonhd;
-	spin_lock(&anonhd->lock);
-	atomic_inc(&anonhd->count);		/* ref by anonmm's head */
-	list_add_tail(&anonmm->list, &anonhd->list);
-	spin_unlock(&anonhd->lock);
-	mm->anonmm = anonmm;
-	return 0;
-}
-
-void exit_rmap(struct mm_struct *mm)
-{
-	struct anonmm *anonmm = mm->anonmm;
-	struct anonmm *anonhd = anonmm->head;
-	int anonhd_count;
-
-	mm->anonmm = NULL;
-	spin_lock(&anonhd->lock);
-	anonmm->mm = NULL;
-	if (atomic_dec_and_test(&anonmm->count)) {
-		BUG_ON(anonmm == anonhd);
-		list_del(&anonmm->list);
-		kmem_cache_free(anonmm_cachep, anonmm);
-		if (atomic_dec_and_test(&anonhd->count))
-			BUG();
-	}
-	anonhd_count = atomic_read(&anonhd->count);
-	spin_unlock(&anonhd->lock);
-	if (anonhd_count == 1) {
-		BUG_ON(anonhd->mm);
-		BUG_ON(!list_empty(&anonhd->list));
-		kmem_cache_free(anonmm_cachep, anonhd);
-	}
-}
-
-static void free_anonmm(struct anonmm *anonmm)
-{
-	struct anonmm *anonhd = anonmm->head;
-
-	BUG_ON(anonmm->mm);
-	BUG_ON(anonmm == anonhd);
-	spin_lock(&anonhd->lock);
-	list_del(&anonmm->list);
-	if (atomic_dec_and_test(&anonhd->count))
-		BUG();
-	spin_unlock(&anonhd->lock);
-	kmem_cache_free(anonmm_cachep, anonmm);
-}
-
 static inline void clear_page_anon(struct page *page)
 {
-	struct anonmm *anonmm = (struct anonmm *) page->mapping;
-
+	BUG_ON(!page->mapping);
 	page->mapping = NULL;
 	ClearPageAnon(page);
-	if (atomic_dec_and_test(&anonmm->count))
-		free_anonmm(anonmm);
 }
 
 /*
@@ -213,75 +99,7 @@ out_unlock:
 
 static inline int page_referenced_anon(struct page *page)
 {
-	unsigned int mapcount = page->mapcount;
-	struct anonmm *anonmm = (struct anonmm *) page->mapping;
-	struct anonmm *anonhd = anonmm->head;
-	struct anonmm *new_anonmm = anonmm;
-	struct list_head *seek_head;
-	int referenced = 0;
-	int failed = 0;
-
-	spin_lock(&anonhd->lock);
-	/*
-	 * First try the indicated mm, it's the most likely.
-	 * Make a note to migrate the page if this mm is extinct.
-	 */
-	if (!anonmm->mm)
-		new_anonmm = NULL;
-	else if (anonmm->mm->rss) {
-		referenced += page_referenced_one(page,
-			anonmm->mm, page->index, &mapcount, &failed);
-		if (!mapcount)
-			goto out;
-	}
-
-	/*
-	 * Then down the rest of the list, from that as the head.  Stop
-	 * when we reach anonhd?  No: although a page cannot get dup'ed
-	 * into an older mm, once swapped, its indicated mm may not be
-	 * the oldest, just the first into which it was faulted back.
-	 * If original mm now extinct, note first to contain the page.
-	 */
-	seek_head = &anonmm->list;
-	list_for_each_entry(anonmm, seek_head, list) {
-		if (!anonmm->mm || !anonmm->mm->rss)
-			continue;
-		referenced += page_referenced_one(page,
-			anonmm->mm, page->index, &mapcount, &failed);
-		if (!new_anonmm && mapcount < page->mapcount)
-			new_anonmm = anonmm;
-		if (!mapcount) {
-			anonmm = (struct anonmm *) page->mapping;
-			if (new_anonmm == anonmm)
-				goto out;
-			goto migrate;
-		}
-	}
-
-	/*
-	 * The warning below may appear if page_referenced_anon catches
-	 * the page in between page_add_anon_rmap and its replacement
-	 * demanded by mremap_moved_anon_page: so remove the warning once
-	 * we're convinced that anonmm rmap really is finding its pages.
-	 */
-	WARN_ON(!failed);
-out:
-	spin_unlock(&anonhd->lock);
-	return referenced;
-
-migrate:
-	/*
-	 * Migrate pages away from an extinct mm, so that its anonmm
-	 * can be freed in due course: we could leave this to happen
-	 * through the natural attrition of try_to_unmap, but that
-	 * would miss locked pages and frequently referenced pages.
-	 */
-	spin_unlock(&anonhd->lock);
-	page->mapping = (void *) new_anonmm;
-	atomic_inc(&new_anonmm->count);
-	if (atomic_dec_and_test(&anonmm->count))
-		free_anonmm(anonmm);
-	return referenced;
+	return 1;	/* until next patch */
 }
 
 /**
@@ -373,8 +191,6 @@ int page_referenced(struct page *page)
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	struct anonmm *anonmm = vma->vm_mm->anonmm;
-
 	BUG_ON(PageReserved(page));
 
 	page_map_lock(page);
@@ -382,8 +198,7 @@ void page_add_anon_rmap(struct page *pag
 		BUG_ON(page->mapping);
 		SetPageAnon(page);
 		page->index = address & PAGE_MASK;
-		page->mapping = (void *) anonmm;
-		atomic_inc(&anonmm->count);
+		page->mapping = (void *) vma;	/* until next patch */
 		inc_page_state(nr_mapped);
 	}
 	page->mapcount++;
@@ -432,32 +247,6 @@ void page_remove_rmap(struct page *page)
 	page_map_unlock(page);
 }
 
-/**
- * mremap_move_anon_rmap - try to note new address of anonymous page
- * @page:	page about to be moved
- * @address:	user virtual address at which it is going to be mapped
- *
- * Returns boolean, true if page is not shared, so address updated.
- *
- * For mremap's can_move_one_page: to update address when vma is moved,
- * provided that anon page is not shared with a parent or child mm.
- * If it is shared, then caller must take a copy of the page instead:
- * not very clever, but too rare a case to merit cleverness.
- */
-int mremap_move_anon_rmap(struct page *page, unsigned long address)
-{
-	int move = 0;
-	if (page->mapcount == 1) {
-		page_map_lock(page);
-		if (page->mapcount == 1) {
-			page->index = address & PAGE_MASK;
-			move = 1;
-		}
-		page_map_unlock(page);
-	}
-	return move;
-}
-
 /*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
@@ -651,38 +440,7 @@ out_unlock:
 
 static inline int try_to_unmap_anon(struct page *page)
 {
-	struct anonmm *anonmm = (struct anonmm *) page->mapping;
-	struct anonmm *anonhd = anonmm->head;
-	struct list_head *seek_head;
-	int ret = SWAP_AGAIN;
-
-	spin_lock(&anonhd->lock);
-	/*
-	 * First try the indicated mm, it's the most likely.
-	 */
-	if (anonmm->mm && anonmm->mm->rss) {
-		ret = try_to_unmap_one(page, anonmm->mm, page->index, NULL);
-		if (ret == SWAP_FAIL || !page->mapcount)
-			goto out;
-	}
-
-	/*
-	 * Then down the rest of the list, from that as the head.  Stop
-	 * when we reach anonhd?  No: although a page cannot get dup'ed
-	 * into an older mm, once swapped, its indicated mm may not be
-	 * the oldest, just the first into which it was faulted back.
-	 */
-	seek_head = &anonmm->list;
-	list_for_each_entry(anonmm, seek_head, list) {
-		if (!anonmm->mm || !anonmm->mm->rss)
-			continue;
-		ret = try_to_unmap_one(page, anonmm->mm, page->index, NULL);
-		if (ret == SWAP_FAIL || !page->mapcount)
-			goto out;
-	}
-out:
-	spin_unlock(&anonhd->lock);
-	return ret;
+	return SWAP_FAIL;	/* until next patch */
 }
 
 /**
--- rmap37/mm/swapfile.c	2004-05-18 20:51:12.104660920 +0100
+++ rmap38/mm/swapfile.c	2004-05-18 20:51:27.708288808 +0100
@@ -537,7 +537,6 @@ static int unuse_process(struct mm_struc
 {
 	struct vm_area_struct* vma;
 	unsigned long foundaddr = 0;
-	int ret = 0;
 
 	/*
 	 * Go through process' page directory.
@@ -553,10 +552,12 @@ static int unuse_process(struct mm_struc
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	if (foundaddr && mremap_moved_anon_rmap(page, foundaddr))
-		ret = make_page_exclusive(vma, foundaddr);
 	up_read(&mm->mmap_sem);
-	return ret;
+	/*
+	 * Currently unuse_process cannot fail, but leave error handling
+	 * at call sites for now, since we change it from time to time.
+	 */
+	return 0;
 }
 
 /*


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 39 add anon_vma rmap
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
                   ` (3 preceding siblings ...)
  2004-05-18 22:10 ` [PATCH] rmap 38 remove anonmm rmap Hugh Dickins
@ 2004-05-18 22:11 ` Hugh Dickins
  2004-05-18 22:12 ` [PATCH] rmap 40 better anon_vma sharing Hugh Dickins
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Andrea Arcangeli's anon_vma object-based reverse mapping scheme for
anonymous pages.  Instead of tracking anonymous pages by pte_chains
or by mm, this tracks them by vma.  But because vmas are frequently
split and merged (particularly by mprotect), a page cannot point
directly to its vma(s), but instead to an anon_vma list of those
vmas likely to contain the page - a list on which vmas can easily
be linked and unlinked as they come and go.  The vmas on one list
are all related, either by forking or by splitting.

This has three particular advantages over anonmm: that it can cope
effortlessly with mremap moves; and no longer needs page_table_lock
to protect an mm's vma tree, since try_to_unmap finds vmas via page
-> anon_vma -> vma instead of using find_vma; and should use less
cpu for swapout since it can locate its anonymous vmas more quickly.

It does have disadvantages too: a lot more change in mmap.c to deal
with anon_vmas, though small straightforward additions now that the
vma merging has been refactored there; more lowmem needed for each
anon_vma and vma structure; an additional restriction on the merging
of vmas (cannot be merged if already assigned different anon_vmas,
since then their pages will be pointing to different heads).

(There would be no need to enlarge the vma structure if anonymous pages
belonged only to anonymous vmas; but private file mappings accumulate
anonymous pages by copy-on-write, so need to be listed in both anon_vma
and prio_tree at the same time.  A different implementation could avoid
that by using anon_vmas only for purely anonymous vmas, and use the
existing prio_tree to locate cow pages - but that would involve a
long search for each single private copy, probably not a good idea.)

Where before the vm_pgoff of a purely anonymous (not file-backed) vma
was meaningless, now it represents the virtual start address at which
that vma is mapped - which the standard file pgoff manipulations treat
linearly as vmas are split and merged.  But if mremap moves the vma,
then it generally carries its original vm_pgoff to the new location,
so pages shared with the old location can still be found.  Magic.

Hugh has massaged it somewhat: building on the earlier rmap patches,
this patch is a fifth of the size of Andrea's original anon_vma patch.
Please note that this posting will be his first sight of this patch,
which he may or may not approve.

 fs/exec.c            |    4 
 include/linux/mm.h   |   20 +++
 include/linux/rmap.h |   63 +++++++++++-
 init/main.c          |    2 
 kernel/fork.c        |    3 
 mm/memory.c          |   10 +
 mm/mmap.c            |  159 ++++++++++++++++++++++--------
 mm/mprotect.c        |    2 
 mm/rmap.c            |  265 +++++++++++++++++++++++++++++++++++++++++++--------
 9 files changed, 439 insertions(+), 89 deletions(-)

--- rmap38/fs/exec.c	2004-05-18 20:51:12.074665480 +0100
+++ rmap39/fs/exec.c	2004-05-18 20:51:40.069409632 +0100
@@ -302,6 +302,9 @@ void install_arg_page(struct vm_area_str
 	pmd_t * pmd;
 	pte_t * pte;
 
+	if (unlikely(anon_vma_prepare(vma)))
+		goto out_sig;
+
 	flush_dcache_page(page);
 	pgd = pgd_offset(mm, address);
 
@@ -328,6 +331,7 @@ void install_arg_page(struct vm_area_str
 	return;
 out:
 	spin_unlock(&mm->page_table_lock);
+out_sig:
 	__free_page(page);
 	force_sig(SIGKILL, current);
 }
--- rmap38/include/linux/mm.h	2004-05-18 20:50:58.875672032 +0100
+++ rmap39/include/linux/mm.h	2004-05-18 20:51:40.071409328 +0100
@@ -15,6 +15,7 @@
 #include <linux/fs.h>
 
 struct mempolicy;
+struct anon_vma;
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -78,6 +79,15 @@ struct vm_area_struct {
 		struct prio_tree_node prio_tree_node;
 	} shared;
 
+	/*
+	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
+	 * list, after a COW of one of the file pages.  A MAP_SHARED vma
+	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
+	 * or brk vma (with NULL file) can only be in an anon_vma list.
+	 */
+	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
+	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
+
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;
 
@@ -201,7 +211,12 @@ struct page {
 					 * if PagePrivate set; used for
 					 * swp_entry_t if PageSwapCache
 					 */
-	struct address_space *mapping;	/* The inode (or ...) we belong to. */
+	struct address_space *mapping;	/* If PG_anon clear, points to
+					 * inode address_space, or NULL.
+					 * If page mapped as anonymous
+					 * memory, PG_anon is set, and
+					 * it points to anon_vma object.
+					 */
 	pgoff_t index;			/* Our offset within mapping. */
 	struct list_head lru;		/* Pageout list, eg. active_list
 					 * protected by zone->lru_lock !
@@ -610,7 +625,8 @@ extern void vma_adjust(struct vm_area_st
 	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
-	unsigned long vm_flags, struct file *, pgoff_t, struct mempolicy *);
+	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
+	struct mempolicy *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern void insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
--- rmap38/include/linux/rmap.h	2004-05-18 20:51:27.664295496 +0100
+++ rmap39/include/linux/rmap.h	2004-05-18 20:51:40.072409176 +0100
@@ -2,18 +2,75 @@
 #define _LINUX_RMAP_H
 /*
  * Declarations for Reverse Mapping functions in mm/rmap.c
- * Its structures are declared within that file.
  */
 
 #include <linux/config.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
 
 #define page_map_lock(page) \
 	bit_spin_lock(PG_maplock, (unsigned long *)&(page)->flags)
 #define page_map_unlock(page) \
 	bit_spin_unlock(PG_maplock, (unsigned long *)&(page)->flags)
 
+/*
+ * The anon_vma heads a list of private "related" vmas, to scan if
+ * an anonymous page pointing to this anon_vma needs to be unmapped:
+ * the vmas on the list will be related by forking, or by splitting.
+ *
+ * Since vmas come and go as they are split and merged (particularly
+ * in mprotect), the mapping field of an anonymous page cannot point
+ * directly to a vma: instead it points to an anon_vma, on whose list
+ * the related vmas can be easily linked or unlinked.
+ *
+ * After unlinking the last vma on the list, we must garbage collect
+ * the anon_vma object itself: we're guaranteed no page can be
+ * pointing to this anon_vma once its vma list is empty.
+ */
+struct anon_vma {
+	spinlock_t lock;	/* Serialize access to vma list */
+	struct list_head head;	/* List of private "related" vmas */
+};
+
 #ifdef CONFIG_MMU
 
+extern kmem_cache_t *anon_vma_cachep;
+
+static inline struct anon_vma *anon_vma_alloc(void)
+{
+	return kmem_cache_alloc(anon_vma_cachep, SLAB_KERNEL);
+}
+
+static inline void anon_vma_free(struct anon_vma *anon_vma)
+{
+	kmem_cache_free(anon_vma_cachep, anon_vma);
+}
+
+static inline void anon_vma_lock(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	if (anon_vma)
+		spin_lock(&anon_vma->lock);
+}
+
+static inline void anon_vma_unlock(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
+}
+
+/*
+ * anon_vma helper functions.
+ */
+void anon_vma_init(void);	/* create anon_vma_cachep */
+int  anon_vma_prepare(struct vm_area_struct *);
+void __anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_unlink(struct vm_area_struct *);
+void anon_vma_link(struct vm_area_struct *);
+void __anon_vma_link(struct vm_area_struct *);
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
@@ -43,6 +100,10 @@ int try_to_unmap(struct page *);
 
 #else	/* !CONFIG_MMU */
 
+#define anon_vma_init()		do {} while (0)
+#define anon_vma_prepare(vma)	(0)
+#define anon_vma_link(vma)	do {} while (0)
+
 #define page_referenced(page)	TestClearPageReferenced(page)
 #define try_to_unmap(page)	SWAP_FAIL
 
--- rmap38/init/main.c	2004-05-16 11:39:22.000000000 +0100
+++ rmap39/init/main.c	2004-05-18 20:51:40.073409024 +0100
@@ -42,6 +42,7 @@
 #include <linux/cpu.h>
 #include <linux/efi.h>
 #include <linux/unistd.h>
+#include <linux/rmap.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -466,6 +467,7 @@ asmlinkage void __init start_kernel(void
 	pidmap_init();
 	pgtable_cache_init();
 	prio_tree_init();
+	anon_vma_init();
 #ifdef CONFIG_X86
 	if (efi_enabled)
 		efi_enter_virtual_mode();
--- rmap38/kernel/fork.c	2004-05-18 20:51:27.667295040 +0100
+++ rmap39/kernel/fork.c	2004-05-18 20:51:40.074408872 +0100
@@ -322,8 +322,9 @@ static inline int dup_mmap(struct mm_str
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
-		file = tmp->vm_file;
+		anon_vma_link(tmp);
 		vma_prio_tree_init(tmp);
+		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file->f_dentry->d_inode;
 			get_file(file);
--- rmap38/mm/memory.c	2004-05-18 20:51:27.703289568 +0100
+++ rmap39/mm/memory.c	2004-05-18 20:51:40.076408568 +0100
@@ -1082,6 +1082,8 @@ static int do_wp_page(struct mm_struct *
 	page_cache_get(old_page);
 	spin_unlock(&mm->page_table_lock);
 
+	if (unlikely(anon_vma_prepare(vma)))
+		goto no_new_page;
 	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 	if (!new_page)
 		goto no_new_page;
@@ -1416,6 +1418,8 @@ do_anonymous_page(struct mm_struct *mm, 
 		pte_unmap(page_table);
 		spin_unlock(&mm->page_table_lock);
 
+		if (unlikely(anon_vma_prepare(vma)))
+			goto no_mem;
 		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
 		if (!page)
 			goto no_mem;
@@ -1498,7 +1502,11 @@ retry:
 	 * Should we do an early C-O-W break?
 	 */
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		struct page *page;
+
+		if (unlikely(anon_vma_prepare(vma)))
+			goto oom;
+		page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 		if (!page)
 			goto oom;
 		copy_user_highpage(page, new_page, address);
--- rmap38/mm/mmap.c	2004-05-18 20:50:58.878671576 +0100
+++ rmap39/mm/mmap.c	2004-05-18 20:51:40.080407960 +0100
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/mempolicy.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -99,6 +100,7 @@ static void remove_vm_struct(struct vm_a
 		vma->vm_ops->close(vma);
 	if (file)
 		fput(file);
+	anon_vma_unlink(vma);
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 }
@@ -294,6 +296,7 @@ __vma_link(struct mm_struct *mm, struct 
 	__vma_link_list(mm, vma, prev, rb_parent);
 	__vma_link_rb(mm, vma, rb_link, rb_parent);
 	__vma_link_file(vma);
+	__anon_vma_link(vma);
 }
 
 static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -307,9 +310,9 @@ static void vma_link(struct mm_struct *m
 
 	if (mapping)
 		spin_lock(&mapping->i_mmap_lock);
-	spin_lock(&mm->page_table_lock);
+	anon_vma_lock(vma);
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
-	spin_unlock(&mm->page_table_lock);
+	anon_vma_unlock(vma);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
@@ -320,7 +323,7 @@ static void vma_link(struct mm_struct *m
 
 /*
  * Insert vm structure into process list sorted by address and into the
- * inode's i_mmap tree. The caller should hold mm->page_table_lock and
+ * inode's i_mmap tree. The caller should hold mm->mmap_sem and
  * ->f_mappping->i_mmap_lock if vm_file is non-NULL.
  */
 static void
@@ -363,6 +366,7 @@ void vma_adjust(struct vm_area_struct *v
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
 	struct file *file = vma->vm_file;
+	struct anon_vma *anon_vma = NULL;
 	long adjust_next = 0;
 	int remove_next = 0;
 
@@ -370,6 +374,7 @@ void vma_adjust(struct vm_area_struct *v
 		if (end >= next->vm_end) {
 again:			remove_next = 1 + (end > next->vm_end);
 			end = next->vm_end;
+			anon_vma = next->anon_vma;
 		} else if (end < vma->vm_end || end > next->vm_start) {
 			/*
 			 * vma shrinks, and !insert tells it's not
@@ -381,6 +386,7 @@ again:			remove_next = 1 + (end > next->
 			 */
 			BUG_ON(vma->vm_end != next->vm_start);
 			adjust_next = end - next->vm_start;
+			anon_vma = next->anon_vma;
 		}
 	}
 
@@ -390,7 +396,15 @@ again:			remove_next = 1 + (end > next->
 			root = &mapping->i_mmap;
 		spin_lock(&mapping->i_mmap_lock);
 	}
-	spin_lock(&mm->page_table_lock);
+
+	/*
+	 * When changing only vma->vm_end, we don't really need
+	 * anon_vma lock: but is that case worth optimizing out?
+	 */
+	if (vma->anon_vma)
+		anon_vma = vma->anon_vma;
+	if (anon_vma)
+		spin_lock(&anon_vma->lock);
 
 	if (root) {
 		flush_dcache_mmap_lock(mapping);
@@ -425,6 +439,8 @@ again:			remove_next = 1 + (end > next->
 		__vma_unlink(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
+		if (next->anon_vma)
+			__anon_vma_merge(vma, next);
 	} else if (insert) {
 		/*
 		 * split_vma has split insert from vma, and needs
@@ -434,7 +450,8 @@ again:			remove_next = 1 + (end > next->
 		__insert_vm_struct(mm, insert);
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	if (anon_vma)
+		spin_unlock(&anon_vma->lock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
@@ -476,21 +493,29 @@ static inline int is_mergeable_vma(struc
 	return 1;
 }
 
+static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
+					struct anon_vma *anon_vma2)
+{
+	return !anon_vma1 || !anon_vma2 || (anon_vma1 == anon_vma2);
+}
+
 /*
- * Return true if we can merge this (vm_flags,file,vm_pgoff)
+ * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
  * in front of (at a lower virtual address and file offset than) the vma.
  *
+ * We cannot merge two vmas if they have differently assigned (non-NULL)
+ * anon_vmas, nor if same anon_vma is assigned but offsets incompatible.
+ *
  * We don't check here for the merged mmap wrapping around the end of pagecache
  * indices (16TB on ia32) because do_mmap_pgoff() does not permit mmap's which
  * wrap, nor mmaps which cover the final page at index -1UL.
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct file *file, pgoff_t vm_pgoff)
+	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
 {
-	if (is_mergeable_vma(vma, file, vm_flags)) {
-		if (!file)
-			return 1;	/* anon mapping */
+	if (is_mergeable_vma(vma, file, vm_flags) &&
+	    is_mergeable_anon_vma(anon_vma, vma->anon_vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
 	}
@@ -498,19 +523,19 @@ can_vma_merge_before(struct vm_area_stru
 }
 
 /*
- * Return true if we can merge this (vm_flags,file,vm_pgoff)
+ * Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
  * beyond (at a higher virtual address and file offset than) the vma.
+ *
+ * We cannot merge two vmas if they have differently assigned (non-NULL)
+ * anon_vmas, nor if same anon_vma is assigned but offsets incompatible.
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct file *file, pgoff_t vm_pgoff)
+	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
 {
-	if (is_mergeable_vma(vma, file, vm_flags)) {
+	if (is_mergeable_vma(vma, file, vm_flags) &&
+	    is_mergeable_anon_vma(anon_vma, vma->anon_vma)) {
 		pgoff_t vm_pglen;
-
-		if (!file)
-			return 1;	/* anon mapping */
-
 		vm_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 		if (vma->vm_pgoff + vm_pglen == vm_pgoff)
 			return 1;
@@ -550,8 +575,8 @@ can_vma_merge_after(struct vm_area_struc
 struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
-		     	struct file *file, pgoff_t pgoff,
-		        struct mempolicy *policy)
+		     	struct anon_vma *anon_vma, struct file *file,
+			pgoff_t pgoff, struct mempolicy *policy)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -576,14 +601,17 @@ struct vm_area_struct *vma_merge(struct 
 	 */
 	if (prev && prev->vm_end == addr &&
   			mpol_equal(vma_policy(prev), policy) &&
-			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
+			can_vma_merge_after(prev, vm_flags,
+						anon_vma, file, pgoff)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
-				can_vma_merge_before(next, vm_flags, file,
-							pgoff+pglen)) {
+				can_vma_merge_before(next, vm_flags,
+					anon_vma, file, pgoff+pglen) &&
+				is_mergeable_anon_vma(prev->anon_vma,
+						      next->anon_vma)) {
 							/* cases 1, 6 */
 			vma_adjust(prev, prev->vm_start,
 				next->vm_end, prev->vm_pgoff, NULL);
@@ -598,8 +626,8 @@ struct vm_area_struct *vma_merge(struct 
 	 */
 	if (next && end == next->vm_start &&
  			mpol_equal(policy, vma_policy(next)) &&
-			can_vma_merge_before(next, vm_flags, file,
-							pgoff+pglen)) {
+			can_vma_merge_before(next, vm_flags,
+					anon_vma, file, pgoff+pglen)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
@@ -725,6 +753,10 @@ unsigned long do_mmap_pgoff(struct file 
 			vm_flags |= VM_SHARED | VM_MAYSHARE;
 			break;
 		case MAP_PRIVATE:
+			/*
+			 * Set pgoff according to addr for anon_vma.
+			 */
+			pgoff = addr >> PAGE_SHIFT;
 			break;
 		default:
 			return -EINVAL;
@@ -772,7 +804,8 @@ munmap_back:
 	 * will create the file object for a shared anonymous map below.
 	 */
 	if (!file && !(vm_flags & VM_SHARED) &&
-	    vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, 0, NULL))
+	    vma_merge(mm, prev, addr, addr + len, vm_flags,
+					NULL, NULL, pgoff, NULL))
 		goto out;
 
 	/*
@@ -831,7 +864,7 @@ munmap_back:
 	addr = vma->vm_start;
 
 	if (!file || !vma_merge(mm, prev, addr, vma->vm_end,
-			vma->vm_flags, file, pgoff, vma_policy(vma))) {
+			vma->vm_flags, NULL, file, pgoff, vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -1062,25 +1095,32 @@ int expand_stack(struct vm_area_struct *
 		return -EFAULT;
 
 	/*
+	 * We must make sure the anon_vma is allocated
+	 * so that the anon_vma locking is not a noop.
+	 */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+	anon_vma_lock(vma);
+
+	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
-	 * is required to hold the mmap_sem in read mode. We need to get
-	 * the spinlock only before relocating the vma range ourself.
+	 * is required to hold the mmap_sem in read mode.  We need the
+	 * anon_vma lock to serialize against concurrent expand_stacks.
 	 */
 	address += 4 + PAGE_SIZE - 1;
 	address &= PAGE_MASK;
- 	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (address - vma->vm_end) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
 	if (security_vm_enough_memory(grow)) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	
 	if (address - vma->vm_start > current->rlim[RLIMIT_STACK].rlim_cur ||
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		anon_vma_unlock(vma);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -1088,7 +1128,7 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	anon_vma_unlock(vma);
 	return 0;
 }
 
@@ -1117,24 +1157,31 @@ int expand_stack(struct vm_area_struct *
 	unsigned long grow;
 
 	/*
+	 * We must make sure the anon_vma is allocated
+	 * so that the anon_vma locking is not a noop.
+	 */
+	if (unlikely(anon_vma_prepare(vma)))
+		return -ENOMEM;
+	anon_vma_lock(vma);
+
+	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
-	 * is required to hold the mmap_sem in read mode. We need to get
-	 * the spinlock only before relocating the vma range ourself.
+	 * is required to hold the mmap_sem in read mode.  We need the
+	 * anon_vma lock to serialize against concurrent expand_stacks.
 	 */
 	address &= PAGE_MASK;
- 	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 
 	/* Overcommit.. */
 	if (security_vm_enough_memory(grow)) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		anon_vma_unlock(vma);
 		return -ENOMEM;
 	}
 	
 	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
-		spin_unlock(&vma->vm_mm->page_table_lock);
+		anon_vma_unlock(vma);
 		vm_unacct_memory(grow);
 		return -ENOMEM;
 	}
@@ -1143,7 +1190,7 @@ int expand_stack(struct vm_area_struct *
 	vma->vm_mm->total_vm += grow;
 	if (vma->vm_flags & VM_LOCKED)
 		vma->vm_mm->locked_vm += grow;
-	spin_unlock(&vma->vm_mm->page_table_lock);
+	anon_vma_unlock(vma);
 	return 0;
 }
 
@@ -1304,8 +1351,6 @@ static void unmap_region(struct mm_struc
 /*
  * Create a list of vma's touched by the unmap, removing them from the mm's
  * vma list as we go..
- *
- * Called with the page_table_lock held.
  */
 static void
 detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1437,8 +1482,8 @@ int do_munmap(struct mm_struct *mm, unsi
 	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
-	spin_lock(&mm->page_table_lock);
 	detach_vmas_to_be_unmapped(mm, mpnt, prev, end);
+	spin_lock(&mm->page_table_lock);
 	unmap_region(mm, mpnt, prev, start, end);
 	spin_unlock(&mm->page_table_lock);
 
@@ -1472,6 +1517,7 @@ unsigned long do_brk(unsigned long addr,
 	struct vm_area_struct * vma, * prev;
 	unsigned long flags;
 	struct rb_node ** rb_link, * rb_parent;
+	pgoff_t pgoff = addr >> PAGE_SHIFT;
 
 	len = PAGE_ALIGN(len);
 	if (!len)
@@ -1515,7 +1561,8 @@ unsigned long do_brk(unsigned long addr,
 	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
 
 	/* Can we just expand an old private anonymous mapping? */
-	if (vma_merge(mm, prev, addr, addr + len, flags, NULL, 0, NULL))
+	if (vma_merge(mm, prev, addr, addr + len, flags,
+					NULL, NULL, pgoff, NULL))
 		goto out;
 
 	/*
@@ -1531,6 +1578,7 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_mm = mm;
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
+	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
 	vma->vm_page_prot = protection_map[flags & 0x0f];
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -1597,6 +1645,22 @@ void insert_vm_struct(struct mm_struct *
 	struct vm_area_struct * __vma, * prev;
 	struct rb_node ** rb_link, * rb_parent;
 
+	/*
+	 * The vm_pgoff of a purely anonymous vma should be irrelevant
+	 * until its first write fault, when page's anon_vma and index
+	 * are set.  But now set the vm_pgoff it will almost certainly
+	 * end up with (unless mremap moves it elsewhere before that
+	 * first wfault), so /proc/pid/maps tells a consistent story.
+	 *
+	 * By setting it to reflect the virtual start address of the
+	 * vma, merges and splits can happen in a seamless way, just
+	 * using the existing file pgoff checks and manipulations.
+	 * Similarly in do_mmap_pgoff and in do_brk.
+	 */
+	if (!vma->vm_file) {
+		BUG_ON(vma->anon_vma);
+		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
+	}
 	__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
 	if (__vma && __vma->vm_start < vma->vm_end)
 		BUG();
@@ -1617,9 +1681,16 @@ struct vm_area_struct *copy_vma(struct v
 	struct rb_node **rb_link, *rb_parent;
 	struct mempolicy *pol;
 
+	/*
+	 * If anonymous vma has not yet been faulted, update new pgoff
+	 * to match new location, to increase its chance of merging.
+	 */
+	if (!vma->vm_file && !vma->anon_vma)
+		pgoff = addr >> PAGE_SHIFT;
+
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
-	new_vma = vma_merge(mm, prev, addr, addr + len,
-			vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
+	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
--- rmap38/mm/mprotect.c	2004-05-18 20:50:58.879671424 +0100
+++ rmap39/mm/mprotect.c	2004-05-18 20:51:40.081407808 +0100
@@ -146,7 +146,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-				vma->vm_file, pgoff, vma_policy(vma));
+			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
 	if (*pprev) {
 		vma = *pprev;
 		goto success;
--- rmap38/mm/rmap.c	2004-05-18 20:51:27.706289112 +0100
+++ rmap39/mm/rmap.c	2004-05-18 20:51:40.083407504 +0100
@@ -6,6 +6,15 @@
  *
  * Simple, low overhead reverse mapping scheme.
  * Please try to keep this thing as modular as possible.
+ *
+ * Provides methods for unmapping each kind of mapped page:
+ * the anon methods track anonymous pages, and
+ * the file methods track pages belonging to an inode.
+ *
+ * Original design by Rik van Riel <riel@conectiva.com.br> 2001
+ * File methods by Dave McCracken <dmccr@us.ibm.com> 2003, 2004
+ * Anonymous methods by Andrea Arcangeli <andrea@suse.de> 2004
+ * Contributions by Hugh Dickins <hugh@veritas.com> 2003, 2004
  */
 
 /*
@@ -27,6 +36,128 @@
 
 #include <asm/tlbflush.h>
 
+//#define RMAP_DEBUG /* can be enabled only for debugging */
+
+kmem_cache_t *anon_vma_cachep;
+
+static inline void validate_anon_vma(struct vm_area_struct *find_vma)
+{
+#ifdef RMAP_DEBUG
+	struct anon_vma *anon_vma = find_vma->anon_vma;
+	struct vm_area_struct *vma;
+	unsigned int mapcount = 0;
+	int found = 0;
+
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		mapcount++;
+		BUG_ON(mapcount > 100000);
+		if (vma == find_vma)
+			found = 1;
+	}
+	BUG_ON(!found);
+#endif
+}
+
+/* This must be called under the mmap_sem. */
+int anon_vma_prepare(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	might_sleep();
+	if (unlikely(!anon_vma)) {
+		struct mm_struct *mm = vma->vm_mm;
+
+		anon_vma = anon_vma_alloc();
+		if (unlikely(!anon_vma))
+			return -ENOMEM;
+
+		/* page_table_lock to protect against threads */
+		spin_lock(&mm->page_table_lock);
+		if (likely(!vma->anon_vma)) {
+			vma->anon_vma = anon_vma;
+			list_add(&vma->anon_vma_node, &anon_vma->head);
+			anon_vma = NULL;
+		}
+		spin_unlock(&mm->page_table_lock);
+		if (unlikely(anon_vma))
+			anon_vma_free(anon_vma);
+	}
+	return 0;
+}
+
+void __anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next)
+{
+	if (!vma->anon_vma) {
+		BUG_ON(!next->anon_vma);
+		vma->anon_vma = next->anon_vma;
+		list_add(&vma->anon_vma_node, &next->anon_vma_node);
+	} else {
+		/* if they're both non-null they must be the same */
+		BUG_ON(vma->anon_vma != next->anon_vma);
+	}
+	list_del(&next->anon_vma_node);
+}
+
+void __anon_vma_link(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	if (anon_vma) {
+		list_add(&vma->anon_vma_node, &anon_vma->head);
+		validate_anon_vma(vma);
+	}
+}
+
+void anon_vma_link(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+
+	if (anon_vma) {
+		spin_lock(&anon_vma->lock);
+		list_add(&vma->anon_vma_node, &anon_vma->head);
+		validate_anon_vma(vma);
+		spin_unlock(&anon_vma->lock);
+	}
+}
+
+void anon_vma_unlink(struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = vma->anon_vma;
+	int empty;
+
+	if (!anon_vma)
+		return;
+
+	spin_lock(&anon_vma->lock);
+	validate_anon_vma(vma);
+	list_del(&vma->anon_vma_node);
+
+	/* We must garbage collect the anon_vma if it's empty */
+	empty = list_empty(&anon_vma->head);
+	spin_unlock(&anon_vma->lock);
+
+	if (empty)
+		anon_vma_free(anon_vma);
+}
+
+static void anon_vma_ctor(void *data, kmem_cache_t *cachep, unsigned long flags)
+{
+	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+						SLAB_CTOR_CONSTRUCTOR) {
+		struct anon_vma *anon_vma = data;
+
+		spin_lock_init(&anon_vma->lock);
+		INIT_LIST_HEAD(&anon_vma->head);
+	}
+}
+
+void __init anon_vma_init(void)
+{
+	anon_vma_cachep = kmem_cache_create("anon_vma",
+		sizeof(struct anon_vma), 0, SLAB_PANIC, anon_vma_ctor, NULL);
+}
+
+/* this needs the page->flags PG_maplock held */
 static inline void clear_page_anon(struct page *page)
 {
 	BUG_ON(!page->mapping);
@@ -35,15 +166,20 @@ static inline void clear_page_anon(struc
 }
 
 /*
- * At what user virtual address is pgoff expected in file-backed vma?
+ * At what user virtual address is page expected in vma?
  */
 static inline unsigned long
-vma_address(struct vm_area_struct *vma, pgoff_t pgoff)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	unsigned long address;
 
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+		/* page should be within any vma from prio_tree_next */
+		BUG_ON(!PageAnon(page));
+		return -EFAULT;
+	}
 	return address;
 }
 
@@ -52,21 +188,28 @@ vma_address(struct vm_area_struct *vma, 
  * repeatedly from either page_referenced_anon or page_referenced_file.
  */
 static int page_referenced_one(struct page *page,
-	struct mm_struct *mm, unsigned long address,
-	unsigned int *mapcount, int *failed)
+	struct vm_area_struct *vma, unsigned int *mapcount, int *failed)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
 	int referenced = 0;
 
+	if (!mm->rss)
+		goto out;
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
 	if (!spin_trylock(&mm->page_table_lock)) {
 		/*
 		 * For debug we're currently warning if not all found,
 		 * but in this case that's expected: suppress warning.
 		 */
 		(*failed)++;
-		return 0;
+		goto out;
 	}
 
 	pgd = pgd_offset(mm, address);
@@ -91,15 +234,32 @@ static int page_referenced_one(struct pa
 
 out_unmap:
 	pte_unmap(pte);
-
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+out:
 	return referenced;
 }
 
 static inline int page_referenced_anon(struct page *page)
 {
-	return 1;	/* until next patch */
+	unsigned int mapcount = page->mapcount;
+	struct anon_vma *anon_vma = (struct anon_vma *) page->mapping;
+	struct vm_area_struct *vma;
+	int referenced = 0;
+	int failed = 0;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(list_empty(&anon_vma->head));
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		referenced += page_referenced_one(page, vma,
+						  &mapcount, &failed);
+		if (!mapcount)
+			goto out;
+	}
+	WARN_ON(!failed);
+out:
+	spin_unlock(&anon_vma->lock);
+	return referenced;
 }
 
 /**
@@ -123,7 +283,6 @@ static inline int page_referenced_file(s
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma = NULL;
 	struct prio_tree_iter iter;
-	unsigned long address;
 	int referenced = 0;
 	int failed = 0;
 
@@ -137,13 +296,10 @@ static inline int page_referenced_file(s
 			referenced++;
 			goto out;
 		}
-		if (vma->vm_mm->rss) {
-			address = vma_address(vma, pgoff);
-			referenced += page_referenced_one(page,
-				vma->vm_mm, address, &mapcount, &failed);
-			if (!mapcount)
-				goto out;
-		}
+		referenced += page_referenced_one(page, vma,
+						  &mapcount, &failed);
+		if (!mapcount)
+			goto out;
 	}
 
 	if (list_empty(&mapping->i_mmap_nonlinear))
@@ -191,15 +347,39 @@ int page_referenced(struct page *page)
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
+	struct anon_vma *anon_vma = vma->anon_vma;
+	pgoff_t index;
+
 	BUG_ON(PageReserved(page));
+	BUG_ON(!anon_vma);
+
+	index = (address - vma->vm_start) >> PAGE_SHIFT;
+	index += vma->vm_pgoff;
+	index >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
 
+	/*
+	 * Setting and clearing PG_anon must always happen inside
+	 * page_map_lock to avoid races between mapping and
+	 * unmapping on different processes of the same
+	 * shared cow swapcache page. And while we take the
+	 * page_map_lock PG_anon cannot change from under us.
+	 * Actually PG_anon cannot change under fork either
+	 * since fork holds a reference on the page so it cannot
+	 * be unmapped under fork and in turn copy_page_range is
+	 * allowed to read PG_anon outside the page_map_lock.
+	 */
 	page_map_lock(page);
 	if (!page->mapcount) {
+		BUG_ON(PageAnon(page));
 		BUG_ON(page->mapping);
 		SetPageAnon(page);
-		page->index = address & PAGE_MASK;
-		page->mapping = (void *) vma;	/* until next patch */
+		page->index = index;
+		page->mapping = (struct address_space *) anon_vma;
 		inc_page_state(nr_mapped);
+	} else {
+		BUG_ON(!PageAnon(page));
+		BUG_ON(page->index != index);
+		BUG_ON(page->mapping != (struct address_space *) anon_vma);
 	}
 	page->mapcount++;
 	page_map_unlock(page);
@@ -251,15 +431,22 @@ void page_remove_rmap(struct page *page)
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
-static int try_to_unmap_one(struct page *page, struct mm_struct *mm,
-		unsigned long address, struct vm_area_struct *vma)
+static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
 	pte_t pteval;
 	int ret = SWAP_AGAIN;
 
+	if (!mm->rss)
+		goto out;
+	address = vma_address(page, vma);
+	if (address == -EFAULT)
+		goto out;
+
 	/*
 	 * We need the page_table_lock to protect us from page faults,
 	 * munmap, fork, etc...
@@ -282,13 +469,6 @@ static int try_to_unmap_one(struct page 
 	if (page_to_pfn(page) != pte_pfn(*pte))
 		goto out_unmap;
 
-	if (!vma) {
-		vma = find_vma(mm, address);
-		/* unmap_vmas drops page_table_lock with vma unlinked */
-		if (!vma)
-			goto out_unmap;
-	}
-
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
 	 * If it's recently referenced (perhaps page_referenced
@@ -327,10 +507,8 @@ static int try_to_unmap_one(struct page 
 
 out_unmap:
 	pte_unmap(pte);
-
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-
 out:
 	return ret;
 }
@@ -361,9 +539,10 @@ out:
 #endif
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
 
-static int try_to_unmap_cluster(struct mm_struct *mm, unsigned long cursor,
+static int try_to_unmap_cluster(unsigned long cursor,
 	unsigned int *mapcount, struct vm_area_struct *vma)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -440,7 +619,19 @@ out_unlock:
 
 static inline int try_to_unmap_anon(struct page *page)
 {
-	return SWAP_FAIL;	/* until next patch */
+	struct anon_vma *anon_vma = (struct anon_vma *) page->mapping;
+	struct vm_area_struct *vma;
+	int ret = SWAP_AGAIN;
+
+	spin_lock(&anon_vma->lock);
+	BUG_ON(list_empty(&anon_vma->head));
+	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+		ret = try_to_unmap_one(page, vma);
+		if (ret == SWAP_FAIL || !page->mapcount)
+			break;
+	}
+	spin_unlock(&anon_vma->lock);
+	return ret;
 }
 
 /**
@@ -461,7 +652,6 @@ static inline int try_to_unmap_file(stru
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	struct vm_area_struct *vma = NULL;
 	struct prio_tree_iter iter;
-	unsigned long address;
 	int ret = SWAP_AGAIN;
 	unsigned long cursor;
 	unsigned long max_nl_cursor = 0;
@@ -473,12 +663,9 @@ static inline int try_to_unmap_file(stru
 
 	while ((vma = vma_prio_tree_next(vma, &mapping->i_mmap,
 					&iter, pgoff, pgoff)) != NULL) {
-		if (vma->vm_mm->rss) {
-			address = vma_address(vma, pgoff);
-			ret = try_to_unmap_one(page, vma->vm_mm, address, vma);
-			if (ret == SWAP_FAIL || !page->mapcount)
-				goto out;
-		}
+		ret = try_to_unmap_one(page, vma);
+		if (ret == SWAP_FAIL || !page->mapcount)
+			goto out;
 	}
 
 	if (list_empty(&mapping->i_mmap_nonlinear))
@@ -523,7 +710,7 @@ static inline int try_to_unmap_file(stru
 			while (vma->vm_mm->rss &&
 				cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
-				ret = try_to_unmap_cluster(vma->vm_mm,
+				ret = try_to_unmap_cluster(
 						cursor, &mapcount, vma);
 				if (ret == SWAP_FAIL)
 					break;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] rmap 40 better anon_vma sharing
  2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
                   ` (4 preceding siblings ...)
  2004-05-18 22:11 ` [PATCH] rmap 39 add anon_vma rmap Hugh Dickins
@ 2004-05-18 22:12 ` Hugh Dickins
  5 siblings, 0 replies; 7+ messages in thread
From: Hugh Dickins @ 2004-05-18 22:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

anon_vma rmap will always necessarily be more restrictive about vma
merging than before: according to the history of the vmas in an mm,
they are liable to be allocated different anon_vma heads, and from
that point on be unmergeable.

Most of the time this doesn't matter at all; but in two cases it may
matter.  One case is that mremap refuses (-EFAULT) to span more than
a single vma: so it is conceivable that some app has relied on vma
merging prior to mremap in the past, and will now fail with anon_vma.
Conceivable but unlikely, let's cross that bridge if we come to it:
and the right answer would be to extend mremap, which should not be
exporting the kernel's implementation detail of vma to user interface.

The other case that matters is when a reasonable repetitive sequence of
syscalls and faults ends up with a large number of separate unmergeable
vmas, instead of the single merged vma it could have.

Andrea's mprotect-vma-merging patch fixed some such instances, but left
other plausible cases unmerged.  There is no perfect solution, and the
harder you try to allow vmas to be merged, the less efficient anon_vma
becomes, in the extreme there being one to span the whole address space,
from which hangs every private vma; but anonmm rmap is clearly superior
to that extreme.

Andrea's principle was that neighbouring vmas which could be mprotected
into mergeable vmas should be allowed to share anon_vma: good insight.
His implementation was to arrange this sharing when trying vma merge,
but that seems to be too early.  This patch sticks to the principle,
but implements it in anon_vma_prepare, when handling the first write
fault on a private vma: with better results.  The drawback is that this
first write fault needs an extra find_vma_prev (whereas prev was already
to hand when implementing anon_vma sharing at try-to-merge time).

 include/linux/mm.h |    1 
 mm/mmap.c          |   65 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c          |   17 ++++++++-----
 3 files changed, 77 insertions(+), 6 deletions(-)

--- rmap39/include/linux/mm.h	2004-05-18 20:51:40.071409328 +0100
+++ rmap40/include/linux/mm.h	2004-05-18 20:51:53.061434544 +0100
@@ -627,6 +627,7 @@ extern struct vm_area_struct *vma_merge(
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
 	struct mempolicy *);
+extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
 extern void insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
--- rmap39/mm/mmap.c	2004-05-18 20:51:40.080407960 +0100
+++ rmap40/mm/mmap.c	2004-05-18 20:51:53.064434088 +0100
@@ -641,6 +641,71 @@ struct vm_area_struct *vma_merge(struct 
 }
 
 /*
+ * find_mergeable_anon_vma is used by anon_vma_prepare, to check
+ * neighbouring vmas for a suitable anon_vma, before it goes off
+ * to allocate a new anon_vma.  It checks because a repetitive
+ * sequence of mprotects and faults may otherwise lead to distinct
+ * anon_vmas being allocated, preventing vma merge in subsequent
+ * mprotect.
+ */
+struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
+{
+	struct vm_area_struct *near;
+	unsigned long vm_flags;
+
+	near = vma->vm_next;
+	if (!near)
+		goto try_prev;
+
+	/*
+	 * Since only mprotect tries to remerge vmas, match flags
+	 * which might be mprotected into each other later on.
+	 * Neither mlock nor madvise tries to remerge at present,
+	 * so leave their flags as obstructing a merge.
+	 */
+	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+
+	if (near->anon_vma && vma->vm_end == near->vm_start &&
+ 			mpol_equal(vma_policy(vma), vma_policy(near)) &&
+			can_vma_merge_before(near, vm_flags,
+				NULL, vma->vm_file, vma->vm_pgoff +
+				((vma->vm_end - vma->vm_start) >> PAGE_SHIFT)))
+		return near->anon_vma;
+try_prev:
+	/*
+	 * It is potentially slow to have to call find_vma_prev here.
+	 * But it's only on the first write fault on the vma, not
+	 * every time, and we could devise a way to avoid it later
+	 * (e.g. stash info in next's anon_vma_node when assigning
+	 * an anon_vma, or when trying vma_merge).  Another time.
+	 */
+	if (find_vma_prev(vma->vm_mm, vma->vm_start, &near) != vma)
+		BUG();
+	if (!near)
+		goto none;
+
+	vm_flags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+	vm_flags |= near->vm_flags & (VM_READ|VM_WRITE|VM_EXEC);
+
+	if (near->anon_vma && near->vm_end == vma->vm_start &&
+  			mpol_equal(vma_policy(near), vma_policy(vma)) &&
+			can_vma_merge_after(near, vm_flags,
+				NULL, vma->vm_file, vma->vm_pgoff))
+		return near->anon_vma;
+none:
+	/*
+	 * There's no absolute need to look only at touching neighbours:
+	 * we could search further afield for "compatible" anon_vmas.
+	 * But it would probably just be a waste of time searching,
+	 * or lead to too many vmas hanging off the same anon_vma.
+	 * We're trying to allow mprotect remerging later on,
+	 * not trying to minimize memory used for anon_vmas.
+	 */
+	return NULL;
+}
+
+/*
  * The caller must hold down_write(current->mm->mmap_sem).
  */
 
--- rmap39/mm/rmap.c	2004-05-18 20:51:40.083407504 +0100
+++ rmap40/mm/rmap.c	2004-05-18 20:51:53.065433936 +0100
@@ -66,21 +66,26 @@ int anon_vma_prepare(struct vm_area_stru
 	might_sleep();
 	if (unlikely(!anon_vma)) {
 		struct mm_struct *mm = vma->vm_mm;
+		struct anon_vma *allocated = NULL;
 
-		anon_vma = anon_vma_alloc();
-		if (unlikely(!anon_vma))
-			return -ENOMEM;
+		anon_vma = find_mergeable_anon_vma(vma);
+		if (!anon_vma) {
+			anon_vma = anon_vma_alloc();
+			if (unlikely(!anon_vma))
+				return -ENOMEM;
+			allocated = anon_vma;
+		}
 
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			list_add(&vma->anon_vma_node, &anon_vma->head);
-			anon_vma = NULL;
+			allocated = NULL;
 		}
 		spin_unlock(&mm->page_table_lock);
-		if (unlikely(anon_vma))
-			anon_vma_free(anon_vma);
+		if (unlikely(allocated))
+			anon_vma_free(allocated);
 	}
 	return 0;
 }


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-05-18 22:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-18 22:06 [PATCH] rmap 34 vm_flags page_table_lock Hugh Dickins
2004-05-18 22:07 ` [PATCH] rmap 35 mmap.c cleanups Hugh Dickins
2004-05-18 22:07 ` [PATCH] rmap 36 mprotect use vma_merge Hugh Dickins
2004-05-18 22:08 ` [PATCH] rmap 37 page_add_anon_rmap vma Hugh Dickins
2004-05-18 22:10 ` [PATCH] rmap 38 remove anonmm rmap Hugh Dickins
2004-05-18 22:11 ` [PATCH] rmap 39 add anon_vma rmap Hugh Dickins
2004-05-18 22:12 ` [PATCH] rmap 40 better anon_vma sharing Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).