LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH -v6 0/2] Fixing the issue with memory-mapped file times
@ 2008-01-17 22:31 Anton Salikhmetov
  2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw)
  To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
   for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin   1200598736      1200598736      1200598617
write   1200598737      1200598737      1200598617
mmap    1200598737      1200598737      1200598738
b       1200598739      1200598739      1200598738
msync b 1200598739      1200598739      1200598738
c       1200598741      1200598741      1200598738
msync c 1200598741      1200598741      1200598738
d       1200598743      1200598743      1200598738
munmap  1200598743      1200598743      1200598738
close   1200598743      1200598743      1200598738
sync    1200598743      1200598743      1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

	msync01
	msync02
	msync03
	msync04
	msync05
	mmapstress01
	mmapstress09
	mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I wonder if these changes can be applied now.

These patches are the result of many fruitful discussions with
Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach,
and Jacob Oestergaard. I am grateful to you all for your support
during the days I was working on this long-standing nasty bug.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH -v6 1/2] Massive code cleanup of sys_msync()
  2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov
@ 2008-01-17 22:31 ` Anton Salikhmetov
  2008-01-18  9:33   ` Miklos Szeredi
  2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov
  2008-01-18  9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi
  2 siblings, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw)
  To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

Using the PAGE_ALIGN() macro instead of "manual" alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
 mm/msync.c |   77 ++++++++++++++++++++++++++++--------------------------------
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- *	linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com>
  */
 
-/*
- * The msync() system call.
- */
+#include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
 #include <linux/sched.h>
+#include <linux/syscalls.h>
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
-	struct mm_struct *mm = current->mm;
+	int error, unmapped_error;
 	struct vm_area_struct *vma;
-	int unmapped_error = 0;
-	int error = -EINVAL;
+	struct mm_struct *mm;
 
+	error = -EINVAL;
 	if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
 		goto out;
 	if (start & ~PAGE_MASK)
 		goto out;
 	if ((flags & MS_ASYNC) && (flags & MS_SYNC))
 		goto out;
+
 	error = -ENOMEM;
-	len = (len + ~PAGE_MASK) & PAGE_MASK;
+	len = PAGE_ALIGN(len);
 	end = start + len;
 	if (end < start)
 		goto out;
-	error = 0;
+
+	error = unmapped_error = 0;
 	if (end == start)
 		goto out;
+
 	/*
 	 * If the interval [start,end) covers some unmapped address ranges,
 	 * just ignore them, but return -ENOMEM at the end.
 	 */
+	mm = current->mm;
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
-	for (;;) {
+	do {
 		struct file *file;
 
-		/* Still start < end. */
 		error = -ENOMEM;
 		if (!vma)
-			goto out_unlock;
-		/* Here start < vma->vm_end. */
+			break;
 		if (start < vma->vm_start) {
 			start = vma->vm_start;
 			if (start >= end)
-				goto out_unlock;
-			unmapped_error = -ENOMEM;
-		}
-		/* Here vma->vm_start <= start < vma->vm_end. */
-		if ((flags & MS_INVALIDATE) &&
-				(vma->vm_flags & VM_LOCKED)) {
-			error = -EBUSY;
-			goto out_unlock;
+				break;
+			unmapped_error = error;
 		}
-		file = vma->vm_file;
+
+		error = -EBUSY;
+		if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+			break;
+
+		error = 0;
 		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
+		file = vma->vm_file;
+		if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
 			get_file(file);
 			up_read(&mm->mmap_sem);
 			error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 				goto out;
 			down_read(&mm->mmap_sem);
 			vma = find_vma(mm, start);
-		} else {
-			if (start >= end) {
-				error = 0;
-				goto out_unlock;
-			}
-			vma = vma->vm_next;
+			continue;
 		}
-	}
-out_unlock:
+
+		vma = vma->vm_next;
+	} while (start < end);
 	up_read(&mm->mmap_sem);
+
 out:
-	return error ? : unmapped_error;
+	return error ? error : unmapped_error;
 }
-- 
1.4.4.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov
  2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov
@ 2008-01-17 22:31 ` Anton Salikhmetov
  2008-01-18  9:51   ` Miklos Szeredi
  2008-01-18  9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi
  2 siblings, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw)
  To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
 mm/memory.c |    6 ++++++
 mm/msync.c  |   52 +++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
+
 		/*
 		 * Yes, Virginia, this is actually required to prevent a race
 		 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
 	if (anon)
 		page_cache_release(vmf.page);
 	else if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
+
 		set_page_dirty_balance(dirty_page, page_mkwrite);
 		put_page(dirty_page);
 	}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include <linux/syscalls.h>
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+	unsigned long addr;
+
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		spinlock_t *ptl;
+		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+		pud_t *pud = pud_offset(pgd, addr);
+		pmd_t *pmd = pmd_offset(pud, addr);
+		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+		if (pte_dirty(*pte) && pte_write(*pte))
+			*pte = pte_wrprotect(*pte);
+		pte_unmap_unlock(pte, ptl);
+	}
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 		error = 0;
 		start = vma->vm_end;
 		file = vma->vm_file;
-		if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-			get_file(file);
-			up_read(&mm->mmap_sem);
-			error = do_fsync(file, 0);
-			fput(file);
-			if (error || start >= end)
-				goto out;
-			down_read(&mm->mmap_sem);
-			vma = find_vma(mm, start);
-			continue;
+		if (file && (vma->vm_flags & VM_SHARED)) {
+			if (flags & MS_ASYNC)
+				vma_wrprotect(vma);
+			if (flags & MS_SYNC) {
+				get_file(file);
+				up_read(&mm->mmap_sem);
+				error = do_fsync(file, 0);
+				fput(file);
+				if (error || start >= end)
+					goto out;
+				down_read(&mm->mmap_sem);
+				vma = find_vma(mm, start);
+				continue;
+			}
 		}
 
 		vma = vma->vm_next;
-- 
1.4.4.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()
  2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov
@ 2008-01-18  9:33   ` Miklos Szeredi
  2008-01-18 10:30     ` Anton Salikhmetov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18  9:33 UTC (permalink / raw)
  To: salikhmetov
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

>  	unsigned long end;
> -	struct mm_struct *mm = current->mm;
> +	int error, unmapped_error;
>  	struct vm_area_struct *vma;
> -	int unmapped_error = 0;
> -	int error = -EINVAL;
> +	struct mm_struct *mm;
>  
> +	error = -EINVAL;

I think you may have misunderstood my last comment.  These are OK:

	struct mm_struct *mm = current->mm;
	int unmapped_error = 0;
	int error = -EINVAL;

This is not so good:

	int error, unmapped_error;

This is the worst:

	int error = -EINVAL, unmapped_error = 0;

So I think the original code is fine as it is.

Othewise patch looks OK now.

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times
  2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov
  2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov
  2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov
@ 2008-01-18  9:40 ` Miklos Szeredi
  2008-01-18 10:31   ` Anton Salikhmetov
  2008-01-18 19:48   ` Anton Salikhmetov
  2 siblings, 2 replies; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18  9:40 UTC (permalink / raw)
  To: salikhmetov
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

> 4. Performance test was done using the program available from the
> following link:
> 
> http://bugzilla.kernel.org/attachment.cgi?id=14493
> 
> Result: the impact of the changes was negligible for files of a few
> hundred megabytes.

Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
nanosecond timestamps, so the time updating code would be exercised
more during the page faults.

What about performance impact on msync(MS_ASYNC)?  Could you please do
some measurment of that as well?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov
@ 2008-01-18  9:51   ` Miklos Szeredi
  2008-01-18 10:15     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18  9:51 UTC (permalink / raw)
  To: salikhmetov
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	miklos, r.e.wolff, hidave.darkstar, hch

> Updating file times at write references to memory-mapped files and
> forcing file times update at the next write reference after
> calling the msync() system call with the MS_ASYNC flag.
> 
> Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
> ---
>  mm/memory.c |    6 ++++++
>  mm/msync.c  |   52 +++++++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bf0b6d..13d5bbf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1668,6 +1668,9 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
> +
>  		/*
>  		 * Yes, Virginia, this is actually required to prevent a race
>  		 * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2341,6 +2344,9 @@ out_unlocked:
>  	if (anon)
>  		page_cache_release(vmf.page);
>  	else if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
> +
>  		set_page_dirty_balance(dirty_page, page_mkwrite);
>  		put_page(dirty_page);
>  	}
> diff --git a/mm/msync.c b/mm/msync.c
> index a4de868..a49af28 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -13,11 +13,33 @@
>  #include <linux/syscalls.h>
>  
>  /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> +	unsigned long addr;
> +
> +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> +		spinlock_t *ptl;
> +		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> +		pud_t *pud = pud_offset(pgd, addr);
> +		pmd_t *pmd = pmd_offset(pud, addr);
> +		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> +		if (pte_dirty(*pte) && pte_write(*pte))
> +			*pte = pte_wrprotect(*pte);
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +}

What about ram based filesystems?  They don't start out with read-only
pte's, so I think they don't want them read-protected now either.
Unless this is essential for correct mtime/ctime accounting on these
filesystems (I don't think it really is).  But then the mapping should
start out read-only as well, otherwise the time update will only work
after an msync(MS_ASYNC).

> +
> +/*
>   * MS_SYNC syncs the entire file - including mappings.
>   *
> - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> - * Now it doesn't do anything, since dirty pages are properly tracked.
> + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> + * read-only by calling vma_wrprotect(). This is needed to catch the next
> + * write reference to the mapped region and update the file times
> + * accordingly.
>   *
>   * The application may now run fsync() to write out the dirty pages and
>   * wait on the writeout and check the result. Or the application may run
> @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
>  		error = 0;
>  		start = vma->vm_end;
>  		file = vma->vm_file;
> -		if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> -			get_file(file);
> -			up_read(&mm->mmap_sem);
> -			error = do_fsync(file, 0);
> -			fput(file);
> -			if (error || start >= end)
> -				goto out;
> -			down_read(&mm->mmap_sem);
> -			vma = find_vma(mm, start);
> -			continue;
> +		if (file && (vma->vm_flags & VM_SHARED)) {
> +			if (flags & MS_ASYNC)
> +				vma_wrprotect(vma);
> +			if (flags & MS_SYNC) {
> +				get_file(file);
> +				up_read(&mm->mmap_sem);
> +				error = do_fsync(file, 0);
> +				fput(file);
> +				if (error || start >= end)
> +					goto out;
> +				down_read(&mm->mmap_sem);
> +				vma = find_vma(mm, start);
> +				continue;
> +			}
>  		}
>  
>  		vma = vma->vm_next;
> -- 
> 1.4.4.4
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18  9:51   ` Miklos Szeredi
@ 2008-01-18 10:15     ` Peter Zijlstra
  2008-01-18 10:25       ` Peter Zijlstra
  2008-01-18 10:38       ` Miklos Szeredi
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2008-01-18 10:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks,
	riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch


On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:

> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..a49af28 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -13,11 +13,33 @@
> >  #include <linux/syscalls.h>
> >  
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > +	unsigned long addr;
> > +
> > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > +		spinlock_t *ptl;
> > +		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +		pud_t *pud = pud_offset(pgd, addr);
> > +		pmd_t *pmd = pmd_offset(pud, addr);
> > +		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +
> > +		if (pte_dirty(*pte) && pte_write(*pte))
> > +			*pte = pte_wrprotect(*pte);
> > +		pte_unmap_unlock(pte, ptl);
> > +	}
> > +}
> 
> What about ram based filesystems?  They don't start out with read-only
> pte's, so I think they don't want them read-protected now either.
> Unless this is essential for correct mtime/ctime accounting on these
> filesystems (I don't think it really is).  But then the mapping should
> start out read-only as well, otherwise the time update will only work
> after an msync(MS_ASYNC).

page_mkclean() has all the needed logic for this, it also walks the rmap
and cleans out all other users, which I think is needed too for
consistencies sake:

Process A			Process B

mmap(foo.txt)			mmap(foo.txt)

dirty page
				dirty page

msync(MS_ASYNC)

				dirty page

msync(MS_ASYNC) <--- now what?!


So what I would suggest is using the page table walkers from mm, and
walks the page range, obtain the page using vm_normal_page() and call
page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)

All in all, that sounds rather expensive..




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 10:15     ` Peter Zijlstra
@ 2008-01-18 10:25       ` Peter Zijlstra
  2008-01-18 10:39         ` Anton Salikhmetov
  2008-01-18 17:58         ` Linus Torvalds
  2008-01-18 10:38       ` Miklos Szeredi
  1 sibling, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2008-01-18 10:25 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks,
	riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch


On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> 
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > >  #include <linux/syscalls.h>
> > >  
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > +	unsigned long addr;
> > > +
> > > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > +		spinlock_t *ptl;
> > > +		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > +		pud_t *pud = pud_offset(pgd, addr);
> > > +		pmd_t *pmd = pmd_offset(pud, addr);
> > > +		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +
> > > +		if (pte_dirty(*pte) && pte_write(*pte))
> > > +			*pte = pte_wrprotect(*pte);
> > > +		pte_unmap_unlock(pte, ptl);
> > > +	}
> > > +}
> > 
> > What about ram based filesystems?  They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is).  But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
> 
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
> 
> Process A			Process B
> 
> mmap(foo.txt)			mmap(foo.txt)
> 
> dirty page
> 				dirty page
> 
> msync(MS_ASYNC)
> 
> 				dirty page
> 
> msync(MS_ASYNC) <--- now what?!
> 
> 
> So what I would suggest is using the page table walkers from mm, and
> walks the page range, obtain the page using vm_normal_page() and call
> page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> 
> All in all, that sounds rather expensive..

Bah, and will break on s390... so we'd need a page_mkclean() variant
that doesn't actually clear dirty.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()
  2008-01-18  9:33   ` Miklos Szeredi
@ 2008-01-18 10:30     ` Anton Salikhmetov
  0 siblings, 0 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 10:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch

2008/1/18, Miklos Szeredi <miklos@szeredi.hu>:
> >       unsigned long end;
> > -     struct mm_struct *mm = current->mm;
> > +     int error, unmapped_error;
> >       struct vm_area_struct *vma;
> > -     int unmapped_error = 0;
> > -     int error = -EINVAL;
> > +     struct mm_struct *mm;
> >
> > +     error = -EINVAL;
>
> I think you may have misunderstood my last comment.  These are OK:
>
>         struct mm_struct *mm = current->mm;
>         int unmapped_error = 0;
>         int error = -EINVAL;
>
> This is not so good:
>
>         int error, unmapped_error;
>
> This is the worst:
>
>         int error = -EINVAL, unmapped_error = 0;
>
> So I think the original code is fine as it is.
>
> Othewise patch looks OK now.

I moved the initialization of the variables to the code where they are needed.

I don't agree that "int a; int b;" is better than "int a, b".

>
> Miklos
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times
  2008-01-18  9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi
@ 2008-01-18 10:31   ` Anton Salikhmetov
  2008-01-18 19:48   ` Anton Salikhmetov
  1 sibling, 0 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 10:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch

2008/1/18, Miklos Szeredi <miklos@szeredi.hu>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem.

>
> Thanks,
> Miklos
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 10:15     ` Peter Zijlstra
  2008-01-18 10:25       ` Peter Zijlstra
@ 2008-01-18 10:38       ` Miklos Szeredi
  2008-01-18 11:00         ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 10:38 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch

> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> 
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > >  #include <linux/syscalls.h>
> > >  
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > +	unsigned long addr;
> > > +
> > > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > +		spinlock_t *ptl;
> > > +		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > +		pud_t *pud = pud_offset(pgd, addr);
> > > +		pmd_t *pmd = pmd_offset(pud, addr);
> > > +		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +
> > > +		if (pte_dirty(*pte) && pte_write(*pte))
> > > +			*pte = pte_wrprotect(*pte);
> > > +		pte_unmap_unlock(pte, ptl);
> > > +	}
> > > +}
> > 
> > What about ram based filesystems?  They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is).  But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
> 
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
> 
> Process A			Process B
> 
> mmap(foo.txt)			mmap(foo.txt)
> 
> dirty page
> 				dirty page
> 
> msync(MS_ASYNC)
> 
> 				dirty page
> 
> msync(MS_ASYNC) <--- now what?!

Nothing.  I think it's perfectly acceptable behavior, if msync in
process A doesn't care about any dirtying in process B.

> All in all, that sounds rather expensive..

Right.  The advantage of Anton's current approach, is that it's at
least simple, and possibly not so expensive, while providing same
quite sane semantics for MS_ASYNC vs. mtime updates.

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 10:25       ` Peter Zijlstra
@ 2008-01-18 10:39         ` Anton Salikhmetov
  2008-01-18 17:58         ` Linus Torvalds
  1 sibling, 0 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, linux-mm, jakob, linux-kernel, valdis.kletnieks,
	riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch

2008/1/18, Peter Zijlstra <peterz@infradead.org>:
>
> On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include <linux/syscalls.h>
> > > >
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > +         spinlock_t *ptl;
> > > > +         pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > +         pud_t *pud = pud_offset(pgd, addr);
> > > > +         pmd_t *pmd = pmd_offset(pud, addr);
> > > > +         pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +
> > > > +         if (pte_dirty(*pte) && pte_write(*pte))
> > > > +                 *pte = pte_wrprotect(*pte);
> > > > +         pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A                     Process B
> >
> > mmap(foo.txt)                 mmap(foo.txt)
> >
> > dirty page
> >                               dirty page
> >
> > msync(MS_ASYNC)
> >
> >                               dirty page
> >
> > msync(MS_ASYNC) <--- now what?!
> >
> >
> > So what I would suggest is using the page table walkers from mm, and
> > walks the page range, obtain the page using vm_normal_page() and call
> > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> >
> > All in all, that sounds rather expensive..
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.

>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 10:38       ` Miklos Szeredi
@ 2008-01-18 11:00         ` Peter Zijlstra
  2008-01-18 11:17           ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2008-01-18 11:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks,
	riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch


On Fri, 2008-01-18 at 11:38 +0100, Miklos Szeredi wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> > 
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include <linux/syscalls.h>
> > > >  
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > +	unsigned long addr;
> > > > +
> > > > +	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > +		spinlock_t *ptl;
> > > > +		pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > +		pud_t *pud = pud_offset(pgd, addr);
> > > > +		pmd_t *pmd = pmd_offset(pud, addr);
> > > > +		pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +
> > > > +		if (pte_dirty(*pte) && pte_write(*pte))
> > > > +			*pte = pte_wrprotect(*pte);
> > > > +		pte_unmap_unlock(pte, ptl);
> > > > +	}
> > > > +}
> > > 
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> > 
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> > 
> > Process A			Process B
> > 
> > mmap(foo.txt)			mmap(foo.txt)
> > 
> > dirty page
> > 				dirty page
> > 
> > msync(MS_ASYNC)
> > 
> > 				dirty page
> > 
> > msync(MS_ASYNC) <--- now what?!

how about:

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a1b3fc6 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -14,6 +14,122 @@
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 
+unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
+		unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	do {
+		pte_t ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		if (pte_dirty(ptent) && pte_write(ptent)) {
+			flush_cache_page(vma, addr, pte_pfn(ptent));
+			ptent = ptep_clear_flush(vma, addr, pte);
+			ptent = pte_wrprotect(ptent);
+			set_pte_at(vma->vm_mnm, addr, pte, ptent);
+		}
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+
+	return addr;
+}
+
+unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		next = masync_pte_range(vma, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+
+	return addr;
+}
+
+unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+	       	unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		next = masync_pmd_range(vma, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+
+	return addr;
+}
+
+unsigned long masync_pgd_range()
+{
+	pgd_t *pgd;
+	unsigned long next;
+
+	pgd = pgd_offset(vma->vm_mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_of_clear_bad(pgd))
+			continue;
+		next = masync_pud_range(vma, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+
+	return addr;
+}
+
+int masync_vma_one(struct vm_area_struct *vma,
+		unsigned long start, unsigned long end)
+{
+	if (start < vma->vm_start)
+		start = vma->vm_start;
+
+	if (end > vma->vm_end)
+		end = vma->vm_end;
+
+	masync_pgd_range(vma, start, end);
+
+	return 0;
+}
+
+int masync_vma(struct vm_area_struct *vma, 
+		unsigned long start, unsigned long end)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma_iter;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return 0;
+
+	mapping = vma->vm_file->f_mapping;
+
+	if (!mapping_cap_account_dirty(mapping))
+		return 0;
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
+		masync_vma_one(vma_iter, start, end);
+	spin_unlock(&mapping->i_mmap_lock);
+
+	return 0;
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *




^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 11:00         ` Peter Zijlstra
@ 2008-01-18 11:17           ` Miklos Szeredi
  2008-01-18 11:23             ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 11:17 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch

> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..a1b3fc6 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -14,6 +14,122 @@
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  
> +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> +		unsigned long addr, unsigned long end)
> +{
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +	do {
> +		pte_t ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		if (pte_dirty(ptent) && pte_write(ptent)) {
> +			flush_cache_page(vma, addr, pte_pfn(ptent));

Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
data in the cache somehow interfere with the page protection?

> +			ptent = ptep_clear_flush(vma, addr, pte);
> +			ptent = pte_wrprotect(ptent);
> +			set_pte_at(vma->vm_mnm, addr, pte, ptent);
> +		}
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(pte - 1, ptl);
> +
> +	return addr;
> +}
> +
> +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> +		unsigned long addr, unsigned long end)
> +{
> +	pmd_t *pmd;
> +	unsigned long next;
> +
> +	pmd = pmd_offset(pud, addr);
> +	do {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none_or_clear_bad(pmd))
> +			continue;
> +		next = masync_pte_range(vma, pmd, addr, next);
> +	} while (pmd++, addr = next, addr != end);
> +
> +	return addr;
> +}
> +
> +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> +	       	unsigned long addr, unsigned long end)
> +{
> +	pud_t *pud;
> +	unsigned long next;
> +
> +	pud = pud_offset(pgd, addr);
> +	do {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none_or_clear_bad(pud))
> +			continue;
> +		next = masync_pmd_range(vma, pud, addr, next);
> +	} while (pud++, addr = next, addr != end);
> +
> +	return addr;
> +}
> +
> +unsigned long masync_pgd_range()
> +{
> +	pgd_t *pgd;
> +	unsigned long next;
> +
> +	pgd = pgd_offset(vma->vm_mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none_of_clear_bad(pgd))
> +			continue;
> +		next = masync_pud_range(vma, pgd, addr, next);
> +	} while (pgd++, addr = next, addr != end);
> +
> +	return addr;
> +}
> +
> +int masync_vma_one(struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end)
> +{
> +	if (start < vma->vm_start)
> +		start = vma->vm_start;
> +
> +	if (end > vma->vm_end)
> +		end = vma->vm_end;
> +
> +	masync_pgd_range(vma, start, end);
> +
> +	return 0;
> +}
> +
> +int masync_vma(struct vm_area_struct *vma, 
> +		unsigned long start, unsigned long end)
> +{
> +	struct address_space *mapping;
> +	struct vm_area_struct *vma_iter;
> +
> +	if (!(vma->vm_flags & VM_SHARED))
> +		return 0;
> +
> +	mapping = vma->vm_file->f_mapping;
> +
> +	if (!mapping_cap_account_dirty(mapping))
> +		return 0;
> +
> +	spin_lock(&mapping->i_mmap_lock);
> +	vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
> +		masync_vma_one(vma_iter, start, end);
> +	spin_unlock(&mapping->i_mmap_lock);

This is hoding i_mmap_lock for possibly quite long.  Isn't that going
to cause problems?

Miklos

> +
> +	return 0;
> +}
> +
>  /*
>   * MS_SYNC syncs the entire file - including mappings.
>   *
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 11:17           ` Miklos Szeredi
@ 2008-01-18 11:23             ` Peter Zijlstra
  2008-01-18 11:36               ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2008-01-18 11:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks,
	riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch


On Fri, 2008-01-18 at 12:17 +0100, Miklos Szeredi wrote:
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..a1b3fc6 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -14,6 +14,122 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/sched.h>
> >  
> > +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> > +		unsigned long addr, unsigned long end)
> > +{
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	do {
> > +		pte_t ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		if (pte_dirty(ptent) && pte_write(ptent)) {
> > +			flush_cache_page(vma, addr, pte_pfn(ptent));
> 
> Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
> data in the cache somehow interfere with the page protection?

No, just being paranoid..

> > +			ptent = ptep_clear_flush(vma, addr, pte);
> > +			ptent = pte_wrprotect(ptent);
> > +			set_pte_at(vma->vm_mnm, addr, pte, ptent);
> > +		}
> > +	} while (pte++, addr += PAGE_SIZE, addr != end);
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +
> > +	return addr;
> > +}
> > +
> > +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> > +		unsigned long addr, unsigned long end)
> > +{
> > +	pmd_t *pmd;
> > +	unsigned long next;
> > +
> > +	pmd = pmd_offset(pud, addr);
> > +	do {
> > +		next = pmd_addr_end(addr, end);
> > +		if (pmd_none_or_clear_bad(pmd))
> > +			continue;
> > +		next = masync_pte_range(vma, pmd, addr, next);
> > +	} while (pmd++, addr = next, addr != end);
> > +
> > +	return addr;
> > +}
> > +
> > +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> > +	       	unsigned long addr, unsigned long end)
> > +{
> > +	pud_t *pud;
> > +	unsigned long next;
> > +
> > +	pud = pud_offset(pgd, addr);
> > +	do {
> > +		next = pud_addr_end(addr, end);
> > +		if (pud_none_or_clear_bad(pud))
> > +			continue;
> > +		next = masync_pmd_range(vma, pud, addr, next);
> > +	} while (pud++, addr = next, addr != end);
> > +
> > +	return addr;
> > +}
> > +
> > +unsigned long masync_pgd_range()
> > +{
> > +	pgd_t *pgd;
> > +	unsigned long next;
> > +
> > +	pgd = pgd_offset(vma->vm_mm, addr);
> > +	do {
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none_of_clear_bad(pgd))
> > +			continue;
> > +		next = masync_pud_range(vma, pgd, addr, next);
> > +	} while (pgd++, addr = next, addr != end);
> > +
> > +	return addr;
> > +}
> > +
> > +int masync_vma_one(struct vm_area_struct *vma,
> > +		unsigned long start, unsigned long end)
> > +{
> > +	if (start < vma->vm_start)
> > +		start = vma->vm_start;
> > +
> > +	if (end > vma->vm_end)
> > +		end = vma->vm_end;
> > +
> > +	masync_pgd_range(vma, start, end);
> > +
> > +	return 0;
> > +}
> > +
> > +int masync_vma(struct vm_area_struct *vma, 
> > +		unsigned long start, unsigned long end)
> > +{
> > +	struct address_space *mapping;
> > +	struct vm_area_struct *vma_iter;
> > +
> > +	if (!(vma->vm_flags & VM_SHARED))
> > +		return 0;
> > +
> > +	mapping = vma->vm_file->f_mapping;
> > +
> > +	if (!mapping_cap_account_dirty(mapping))
> > +		return 0;
> > +
> > +	spin_lock(&mapping->i_mmap_lock);
> > +	vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end)
> > +		masync_vma_one(vma_iter, start, end);
> > +	spin_unlock(&mapping->i_mmap_lock);
> 
> This is hoding i_mmap_lock for possibly quite long.  Isn't that going
> to cause problems?

Possibly, I didn't see a quick way to break that iteration.
>From a quick glance at prio_tree.c the iterator isn't valid anymore
after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

I also realized I forgot to copy/paste the prio_tree_iter declaration
and ought to make all these functions static.

But for a quick draft it conveys the idea pretty well, I guess :-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 11:23             ` Peter Zijlstra
@ 2008-01-18 11:36               ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 11:36 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch

> Possibly, I didn't see a quick way to break that iteration.
> >From a quick glance at prio_tree.c the iterator isn't valid anymore
> after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem,
which protects the prio tree as well, no?

> I also realized I forgot to copy/paste the prio_tree_iter declaration
> and ought to make all these functions static.
> 
> But for a quick draft it conveys the idea pretty well, I guess :-)

Yes :)

There could also be nasty performance corner cases, like having a huge
file mapped thousands of times, and having only a couple of pages
dirtied between MS_ASYNC invocations.  Then most of that page table
walking would be just unnecessary overhead.

There's something to be said for walking only the dirty pages, and
doing page_mkclean on them, even if in some cases that would be
slower.

But I have a strong feeling of deja vu, and last time it ended with
Andrew not liking the whole thing...

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 10:25       ` Peter Zijlstra
  2008-01-18 10:39         ` Anton Salikhmetov
@ 2008-01-18 17:58         ` Linus Torvalds
  2008-01-18 18:11           ` Miklos Szeredi
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 17:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Miklos Szeredi, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Peter Zijlstra wrote:
> 
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

No, we simply want to not play all these very expensive games with dirty 
in the first place.

Guys, mmap access times aren't important enough for this. It's not 
specified closely enough, and people don't care enough.

Of the patches around so far, the best one by far seems to be the simple 
four-liner from Miklos.

And even in that four-liner, I suspect that the *last* two lines are 
actually incorrect: there's no point in updating the file time when the 
page *becomes* dirty, we should update the file time when it is marked 
clean, and "msync(MS_SYNC)" should update it as part of *that*.

So I think the file time update should be part of just the page writeout 
logic, not by msync() or page faulting itself or anything like that.

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 17:58         ` Linus Torvalds
@ 2008-01-18 18:11           ` Miklos Szeredi
  2008-01-18 18:28             ` Rik van Riel
  2008-01-18 18:43             ` Linus Torvalds
  0 siblings, 2 replies; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 18:11 UTC (permalink / raw)
  To: torvalds
  Cc: peterz, miklos, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

> And even in that four-liner, I suspect that the *last* two lines are 
> actually incorrect: there's no point in updating the file time when the 
> page *becomes* dirty,

Actually all four lines do that.  The first two for a write access on
a present, read-only pte, the other two for a write on a non-present
pte.

> we should update the file time when it is marked 
> clean, and "msync(MS_SYNC)" should update it as part of *that*.

That would need a new page flag (PG_mmap_dirty?).  Do we have one
available?

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 18:11           ` Miklos Szeredi
@ 2008-01-18 18:28             ` Rik van Riel
  2008-01-18 18:51               ` Miklos Szeredi
  2008-01-18 18:43             ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2008-01-18 18:28 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: torvalds, peterz, miklos, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

On Fri, 18 Jan 2008 19:11:47 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:

> > And even in that four-liner, I suspect that the *last* two lines are 
> > actually incorrect: there's no point in updating the file time when the 
> > page *becomes* dirty,
> 
> Actually all four lines do that.  The first two for a write access on
> a present, read-only pte, the other two for a write on a non-present
> pte.
> 
> > we should update the file time when it is marked 
> > clean, and "msync(MS_SYNC)" should update it as part of *that*.
> 
> That would need a new page flag (PG_mmap_dirty?).  Do we have one
> available?

I thought the page writing stuff looked at (and cleared) the pte
dirty bit, too?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 18:11           ` Miklos Szeredi
  2008-01-18 18:28             ` Rik van Riel
@ 2008-01-18 18:43             ` Linus Torvalds
  2008-01-18 18:57               ` Miklos Szeredi
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 18:43 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> That would need a new page flag (PG_mmap_dirty?).  Do we have one
> available?

Yeah, that would be bad. We probably have flags free, but those page flags 
are always a pain. Scratch that.

How about just setting a per-vma dirty flag, and then instead of updating 
the mtime when taking the dirty-page fault, we just set that flag?

Then, on unmap and msync, we just do

	if (vma->dirty-flag) {
		vma->dirty_flag = 0;
		update_file_times(vma->vm_file);
	}

and be done with it? 
		

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 18:28             ` Rik van Riel
@ 2008-01-18 18:51               ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 18:51 UTC (permalink / raw)
  To: riel
  Cc: miklos, torvalds, peterz, miklos, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

> > > And even in that four-liner, I suspect that the *last* two lines are 
> > > actually incorrect: there's no point in updating the file time when the 
> > > page *becomes* dirty,
> > 
> > Actually all four lines do that.  The first two for a write access on
> > a present, read-only pte, the other two for a write on a non-present
> > pte.
> > 
> > > we should update the file time when it is marked 
> > > clean, and "msync(MS_SYNC)" should update it as part of *that*.
> > 
> > That would need a new page flag (PG_mmap_dirty?).  Do we have one
> > available?
> 
> I thought the page writing stuff looked at (and cleared) the pte
> dirty bit, too?

Yeah, it does.  Hmm...

What happens on munmap?  The times _could_ get updated from there as
well, but it's getting complicated.

Miklos


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 18:43             ` Linus Torvalds
@ 2008-01-18 18:57               ` Miklos Szeredi
  2008-01-18 19:08                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 18:57 UTC (permalink / raw)
  To: torvalds
  Cc: miklos, peterz, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

> > That would need a new page flag (PG_mmap_dirty?).  Do we have one
> > available?
> 
> Yeah, that would be bad. We probably have flags free, but those page flags 
> are always a pain. Scratch that.
> 
> How about just setting a per-vma dirty flag, and then instead of updating 
> the mtime when taking the dirty-page fault, we just set that flag?
> 
> Then, on unmap and msync, we just do
> 
> 	if (vma->dirty-flag) {
> 		vma->dirty_flag = 0;
> 		update_file_times(vma->vm_file);
> 	}
> 
> and be done with it? 

But then background writeout, sync(2), etc, wouldn't update the times.
Dunno.  I don't think actual _physical_ writeout matters much, so it's
not worse to be 30s early with the timestamp, than to be 30s or more
late.

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 18:57               ` Miklos Szeredi
@ 2008-01-18 19:08                 ` Linus Torvalds
  2008-01-18 19:22                   ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 19:08 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> But then background writeout, sync(2), etc, wouldn't update the times.

Sure it would, but only when doing the final unmap.

Did you miss the "on unmap and msync" part?

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 19:08                 ` Linus Torvalds
@ 2008-01-18 19:22                   ` Miklos Szeredi
  2008-01-18 19:35                     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-18 19:22 UTC (permalink / raw)
  To: torvalds
  Cc: miklos, peterz, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

> > 
> > But then background writeout, sync(2), etc, wouldn't update the times.
> 
> Sure it would, but only when doing the final unmap.
> 
> Did you miss the "on unmap and msync" part?

No :)

What I'm saying is that the times could be left un-updated for a long
time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

If program has this pattern:

mmap()
write to map
msync(MS_ASYNC)
sleep(long)
write to map
msync(MS_ASYNC)
sleep(long)
...

Then we'd never see time updates (until the program exits, but that
could be years).

Maybe this doesn't matter, I'm just saying this is a disadvantage
compared to the "update on first dirtying" approach, which would
ensure, that times are updated at least once per 30s.

Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 19:22                   ` Miklos Szeredi
@ 2008-01-18 19:35                     ` Linus Torvalds
  2008-01-18 19:58                       ` Anton Salikhmetov
  2008-01-18 22:32                       ` Ingo Oeser
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 19:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> What I'm saying is that the times could be left un-updated for a long
> time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

Sure.

But in those circumstances, the programmer cannot depend on the mtime 
*anyway* (because there is no synchronization), so what's the downside?

Let's face it, there's exactly three possible solutions:

 - the insane one: trap EVERY SINGLE instruction that does a write to the 
   page, and update mtime each and every time.

   This one is so obviously STUPID that it's not even worth discussing 
   further, except to say that "yes, there is an 'exact' algorithm, but 
   no, we are never EVER going to use it".

 - the non-exact solutions that don't give you mtime updates every time 
   a write to the page happens, but give *some* guarantees for things that 
   will update it.

   This is the one I think we can do, and the only things a programmer can 
   impact using it is "msync()" and "munmap()", since no other operations 
   really have any thing to do with it in a programmer-visible way (ie a 
   normal "sync" operation may happen in the background and has no 
   progam-relevant timing information)

   Other things *may* or may not update mtime (some filesystems - take
   most networked one as an example - will *always* update mtime on the 
   server on writeback, so we cannot ever guarantee that nothing but 
   msync/munmap does so), but at least we'll have a minimum set of things 
   that people can depend on.

 - the "we don't care at all solutions".

   mmap(MAP_WRITE) doesn't really update times reliably after the write 
   has happened (but might do it *before* - maybe the mmap() itself does).

Those are the three choices, I think. We currently approximate #3. We 
*can* do #2 (and there are various flavors of it). And even *aiming* for 
#1 is totally insane and stupid.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times
  2008-01-18  9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi
  2008-01-18 10:31   ` Anton Salikhmetov
@ 2008-01-18 19:48   ` Anton Salikhmetov
  2008-01-19 10:45     ` Miklos Szeredi
  1 sibling, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 19:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm,
	staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb,
	r.e.wolff, hidave.darkstar, hch

2008/1/18, Miklos Szeredi <miklos@szeredi.hu>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin   1200662360      1200662360      1200662353
write   1200662361      1200662361      1200662353
mmap    1200662361      1200662361      1200662362
b       1200662363      1200662363      1200662362
msync b 1200662363      1200662363      1200662362
c       1200662365      1200662365      1200662362
msync c 1200662365      1200662365      1200662362
d       1200662367      1200662367      1200662362
munmap  1200662367      1200662367      1200662362
close   1200662367      1200662367      1200662362
sync    1200662367      1200662367      1200662362
debian:~/miklos# mount | grep /mnt
/root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0)

> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Following are the results of the measurements. Here are the relevant
portions of the test program:

>>>

#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
        tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
        p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(&tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(&tv_stop, NULL);

<<<

For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
   Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
   Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case".

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued prior to running the
tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 30000 >/proc/sys/vm/dirty_expire_centisecs
sync

Table 1. Reference platforms.

------------------------------------------------------------
|             | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD    |
------------------------------------------------------------
| First run   | 263405 usec   | 202283 usec   | 90 SECONDS |
------------------------------------------------------------
| Second run  | 262253 usec   | 172837 usec   | 90 SECONDS |
------------------------------------------------------------
| Third run   | 238465 usec   | 238465 usec   | 90 SECONDS |
------------------------------------------------------------

It looks like FreeBSD is a clear outsider here. Note that FreeBSD
showed an almost liner depencence of the time spent in the
msync(MS_ASYNC) call on the file size.

Table 2. The Qemu system. File size is 512M.

---------------------------------------------------
|            | Before the patch | After the patch |
---------------------------------------------------
| First run  |     35 usec      |   5852 usec     |
---------------------------------------------------
| Second run |     35 usec      |   4444 usec     |
---------------------------------------------------
| Third run  |     35 usec      |   6330 usec     |
---------------------------------------------------

I think that the data above prove the viability of the solution I
presented. Also, I guess that this bug fix is most probably ready
for getting upstream.

Please apply the sixth version of my solution.

>
> Thanks,
> Miklos
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 19:35                     ` Linus Torvalds
@ 2008-01-18 19:58                       ` Anton Salikhmetov
  2008-01-18 20:22                         ` Linus Torvalds
  2008-01-18 22:32                       ` Ingo Oeser
  1 sibling, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

2008/1/18, Linus Torvalds <torvalds@linux-foundation.org>:
>
>
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?
>
> Let's face it, there's exactly three possible solutions:
>
>  - the insane one: trap EVERY SINGLE instruction that does a write to the
>    page, and update mtime each and every time.
>
>    This one is so obviously STUPID that it's not even worth discussing
>    further, except to say that "yes, there is an 'exact' algorithm, but
>    no, we are never EVER going to use it".
>
>  - the non-exact solutions that don't give you mtime updates every time
>    a write to the page happens, but give *some* guarantees for things that
>    will update it.
>
>    This is the one I think we can do, and the only things a programmer can
>    impact using it is "msync()" and "munmap()", since no other operations
>    really have any thing to do with it in a programmer-visible way (ie a
>    normal "sync" operation may happen in the background and has no
>    progam-relevant timing information)
>
>    Other things *may* or may not update mtime (some filesystems - take
>    most networked one as an example - will *always* update mtime on the
>    server on writeback, so we cannot ever guarantee that nothing but
>    msync/munmap does so), but at least we'll have a minimum set of things
>    that people can depend on.
>
>  - the "we don't care at all solutions".
>
>    mmap(MAP_WRITE) doesn't really update times reliably after the write
>    has happened (but might do it *before* - maybe the mmap() itself does).
>
> Those are the three choices, I think. We currently approximate #3. We
> *can* do #2 (and there are various flavors of it). And even *aiming* for
> #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.

>
>                         Linus
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 19:58                       ` Anton Salikhmetov
@ 2008-01-18 20:22                         ` Linus Torvalds
  2008-01-18 21:03                           ` Anton Salikhmetov
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 20:22 UTC (permalink / raw)
  To: Anton Salikhmetov
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> 
> The current solution doesn't hit the performance at all when compared to
> the competitor POSIX-compliant systems. It is faster and does even more
> than the POSIX standard requires.

Your current patches have two problems:
 - they are simply unnecessarily invasive for a relatively simple issue
 - all versions I've looked at closer are buggy too

Example:

	+               if (pte_dirty(*pte) && pte_write(*pte))
	+                       *pte = pte_wrprotect(*pte);

Uhhuh. Looks simple enough. Except it does a non-atomic pte access while 
other CPU's may be accessing it and updating it from their hw page table 
walkers. What will happen? Who knows? I can see lost access bits at a 
minimum.

IOW, this isn't simple code. It's code that it is simple to screw up. In 
this case, you really need to use ptep_set_wrprotect(), for example.

So why not do it in many fewer lines with that simpler vma->dirty flag?

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 20:22                         ` Linus Torvalds
@ 2008-01-18 21:03                           ` Anton Salikhmetov
  2008-01-18 21:27                             ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

2008/1/18, Linus Torvalds <torvalds@linux-foundation.org>:
>
>
> On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> >
> > The current solution doesn't hit the performance at all when compared to
> > the competitor POSIX-compliant systems. It is faster and does even more
> > than the POSIX standard requires.
>
> Your current patches have two problems:
>  - they are simply unnecessarily invasive for a relatively simple issue
>  - all versions I've looked at closer are buggy too
>
> Example:
>
>         +               if (pte_dirty(*pte) && pte_write(*pte))
>         +                       *pte = pte_wrprotect(*pte);
>
> Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
> other CPU's may be accessing it and updating it from their hw page table
> walkers. What will happen? Who knows? I can see lost access bits at a
> minimum.
>
> IOW, this isn't simple code. It's code that it is simple to screw up. In
> this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.


>
> So why not do it in many fewer lines with that simpler vma->dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.

>
>                 Linus
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 21:03                           ` Anton Salikhmetov
@ 2008-01-18 21:27                             ` Linus Torvalds
  2008-01-18 22:04                               ` Anton Salikhmetov
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 21:27 UTC (permalink / raw)
  To: Anton Salikhmetov
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> 
> Before using pte_wrprotect() the vma_wrprotect() routine uses the
> pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> spinlock. Why did you say that this code was not SMP-safe? It should
> be atomic, I think.

It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

Guess how much another x86 CPU cares when it sets the accessed bit in 
hardware?

> The POSIX standard requires the ctime and mtime stamps to be updated
> not later than at the second call to msync() with the MS_ASYNC flag.

.. and that is no excuse for bad code.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 21:27                             ` Linus Torvalds
@ 2008-01-18 22:04                               ` Anton Salikhmetov
  2008-01-18 22:21                                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 22:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

2008/1/19, Linus Torvalds <torvalds@linux-foundation.org>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > Before using pte_wrprotect() the vma_wrprotect() routine uses the
> > pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> > spinlock. Why did you say that this code was not SMP-safe? It should
> > be atomic, I think.
>
> It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.
>
> Guess how much another x86 CPU cares when it sets the accessed bit in
> hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

        pgd = pgd_offset(mm, address);
        if (!pgd_present(*pgd))
                return NULL;

        pud = pud_offset(pgd, address);
        if (!pud_present(*pud))
                return NULL;

        pmd = pmd_offset(pud, address);
        if (!pmd_present(*pmd))
                return NULL;

        pte = pte_offset_map(pmd, address);
        /* Make a quick check before getting the lock */
        if (!pte_present(*pte)) {
                pte_unmap(pte);
                return NULL;
        }

        ptl = pte_lockptr(mm, pmd);
        spin_lock(ptl);
        if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
                *ptlp = ptl;
                return pte;
        }
        pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

        pte = page_check_address(page, mm, address, &ptl);
        if (!pte)
                goto out;

        if (pte_dirty(*pte) || pte_write(*pte)) {
                pte_t entry;

                flush_cache_page(vma, address, pte_pfn(*pte));
                entry = ptep_clear_flush(vma, address, pte);
                entry = pte_wrprotect(entry);
                entry = pte_mkclean(entry);
                set_pte_at(mm, address, pte, entry);
                ret = 1;
        }

        pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.

>
> > The POSIX standard requires the ctime and mtime stamps to be updated
> > not later than at the second call to msync() with the MS_ASYNC flag.
>
> .. and that is no excuse for bad code.
>
>                         Linus
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:04                               ` Anton Salikhmetov
@ 2008-01-18 22:21                                 ` Linus Torvalds
  2008-01-18 22:35                                   ` Anton Salikhmetov
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 22:21 UTC (permalink / raw)
  To: Anton Salikhmetov
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch



On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> 
> The page_check_address() function is called from the
> page_mkclean_one() routine as follows:

.. and the page_mkclean_one() function is totally different.

Lookie here, this is the correct and complex sequence:

>                 entry = ptep_clear_flush(vma, address, pte);
>                 entry = pte_wrprotect(entry);
>                 entry = pte_mkclean(entry);
>                 set_pte_at(mm, address, pte, entry);

That's a rather expensive sequence, but it's done exactly because it has 
to be done that way. What it does is to

 - *atomically* load the pte entry _and_ clear the old one in memory.

   That's the

	entry = ptep_clear_flush(vma, address, pte);

   thing, and it basically means that it's doing some 
   architecture-specific magic to make sure that another CPU that accesses 
   the PTE at the same time will never actually modify the pte (because 
   it's clear and not valid)

 - it then - while the page table is actually clear and invalid - takes 
   the old value and turns it into the new one:

	entry = pte_wrprotect(entry);
	entry = pte_mkclean(entry);

 - and finally, it replaces the entry with the new one:

	set_pte_at(mm, address, pte, entry);

   which takes care to write the new entry in some specific way that is 
   atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table 
   entry it writes the high word first, see the write barriers in 
   "native_set_pte()" in include/asm-x86/pgtable-3level.h

Now, compare that subtle and correct thing with what is *not* correct:

	if (pte_dirty(*pte) && pte_write(*pte))
		*pte = pte_wrprotect(*pte);

which makes no effort at all to make sure that it's safe in case another 
CPU updates the accessed bit.

Now, arguably it's unlikely to cause horrible problems at least on x86, 
because:

 - we only do this if the pte is already marked dirty, so while we can 
   lose the accessed bit, we can *not* lose the dirty bit. And the 
   accessed bit isn't such a big deal.

 - it's not doing any of the "be careful about" ordering things, but since 
   the really important bits aren't changing, ordering probably won't 
   practically matter.

But the problem is that we have something like 24 different architectures, 
it's hard to make sure that none of them have issues. 

In other words: it may well work in practice. But when these things go 
subtly wrong, they are *really* nasty to find, and the unsafe sequence is 
really not how it's supposed to be done. For example, you don't even flush 
the TLB, so even if there are no cross-CPU issues, there's probably going 
to be writable entries in the TLB that now don't match the page tables.

Will it matter? Again, probably impossible to see in practice. But ...

		Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 19:35                     ` Linus Torvalds
  2008-01-18 19:58                       ` Anton Salikhmetov
@ 2008-01-18 22:32                       ` Ingo Oeser
  2008-01-18 22:47                         ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Ingo Oeser @ 2008-01-18 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch

Hi Linus,

On Friday 18 January 2008, Linus Torvalds wrote:
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> > 
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
> 
> Sure.
> 
> But in those circumstances, the programmer cannot depend on the mtime 
> *anyway* (because there is no synchronization), so what's the downside?

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"? 

That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)


Best Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:21                                 ` Linus Torvalds
@ 2008-01-18 22:35                                   ` Anton Salikhmetov
  0 siblings, 0 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-18 22:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel,
	valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

2008/1/19, Linus Torvalds <torvalds@linux-foundation.org>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> >                 entry = ptep_clear_flush(vma, address, pte);
> >                 entry = pte_wrprotect(entry);
> >                 entry = pte_mkclean(entry);
> >                 set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
>  - *atomically* load the pte entry _and_ clear the old one in memory.
>
>    That's the
>
>         entry = ptep_clear_flush(vma, address, pte);
>
>    thing, and it basically means that it's doing some
>    architecture-specific magic to make sure that another CPU that accesses
>    the PTE at the same time will never actually modify the pte (because
>    it's clear and not valid)
>
>  - it then - while the page table is actually clear and invalid - takes
>    the old value and turns it into the new one:
>
>         entry = pte_wrprotect(entry);
>         entry = pte_mkclean(entry);
>
>  - and finally, it replaces the entry with the new one:
>
>         set_pte_at(mm, address, pte, entry);
>
>    which takes care to write the new entry in some specific way that is
>    atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
>    entry it writes the high word first, see the write barriers in
>    "native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
>         if (pte_dirty(*pte) && pte_write(*pte))
>                 *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
>  - we only do this if the pte is already marked dirty, so while we can
>    lose the accessed bit, we can *not* lose the dirty bit. And the
>    accessed bit isn't such a big deal.
>
>  - it's not doing any of the "be careful about" ordering things, but since
>    the really important bits aren't changing, ordering probably won't
>    practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton

>
>                 Linus
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:32                       ` Ingo Oeser
@ 2008-01-18 22:47                         ` Linus Torvalds
  2008-01-18 22:54                           ` Rik van Riel
  2008-01-21 14:25                           ` Peter Staubach
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2008-01-18 22:47 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch



On Fri, 18 Jan 2008, Ingo Oeser wrote:
> 
> Can we get "if the write to the page hits the disk, the mtime has hit the disk
> already no less than SOME_GRANULARITY before"? 
> 
> That is very important for computer forensics. Esp. in saving your ass!
> 
> Ok, now back again to making that fast :-)

I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:

 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).

So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.

			Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:47                         ` Linus Torvalds
@ 2008-01-18 22:54                           ` Rik van Riel
  2008-01-19  0:50                             ` Matt Mackall
  2008-01-21 14:25                           ` Peter Staubach
  1 sibling, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2008-01-18 22:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

>  - keep it simple. Let's face it, Linux has never ever given those 
>    guarantees before, and it's not is if anybody has really cared. Even 
>    now, the issue seems to be more about paper standards conformance than 
>    anything else.

There is one issue which is way more than just standards conformance.

When a program changes file data through mmap(), at some point the
mtime needs to be update so that backup programs know to back up the
new version of the file.

Backup programs not seeing an updated mtime is a really big deal.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:54                           ` Rik van Riel
@ 2008-01-19  0:50                             ` Matt Mackall
  2008-01-19  4:25                               ` Rik van Riel
  2008-01-19 10:22                               ` Miklos Szeredi
  0 siblings, 2 replies; 43+ messages in thread
From: Matt Mackall @ 2008-01-19  0:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, salikhmetov,
	linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach,
	jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch


On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:
> On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> >  - keep it simple. Let's face it, Linux has never ever given those 
> >    guarantees before, and it's not is if anybody has really cared. Even 
> >    now, the issue seems to be more about paper standards conformance than 
> >    anything else.
> 
> There is one issue which is way more than just standards conformance.
> 
> When a program changes file data through mmap(), at some point the
> mtime needs to be update so that backup programs know to back up the
> new version of the file.
> 
> Backup programs not seeing an updated mtime is a really big deal.

And that's fixed with the 4-line approach.

Reminds me, I've got a patch here for addressing that problem with loop mounts:

Writes to loop should update the mtime of the underlying file.

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/drivers/block/loop.c
===================================================================
--- l.orig/drivers/block/loop.c	2007-11-05 17:50:07.000000000 -0600
+++ l/drivers/block/loop.c	2007-11-05 19:03:51.000000000 -0600
@@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
 	offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
 	bv_offs = bvec->bv_offset;
 	len = bvec->bv_len;
+	file_update_time(file);
 	while (len > 0) {
 		sector_t IV;
 		unsigned size;
@@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
 
 	set_fs(get_ds());
 	bw = file->f_op->write(file, buf, len, &pos);
+	file_update_time(file);
 	set_fs(old_fs);
 	if (likely(bw == len))
 		return 0;


-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-19  0:50                             ` Matt Mackall
@ 2008-01-19  4:25                               ` Rik van Riel
  2008-01-19 10:22                               ` Miklos Szeredi
  1 sibling, 0 replies; 43+ messages in thread
From: Rik van Riel @ 2008-01-19  4:25 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, salikhmetov,
	linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach,
	jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch

On Fri, 18 Jan 2008 18:50:03 -0600
Matt Mackall <mpm@selenic.com> wrote:
> On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:

> > Backup programs not seeing an updated mtime is a really big deal.
> 
> And that's fixed with the 4-line approach.
> 
> Reminds me, I've got a patch here for addressing that problem with loop mounts:
> 
> Writes to loop should update the mtime of the underlying file.
> 
> Signed-off-by: Matt Mackall <mpm@selenic.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-19  0:50                             ` Matt Mackall
  2008-01-19  4:25                               ` Rik van Riel
@ 2008-01-19 10:22                               ` Miklos Szeredi
  2008-01-19 15:49                                 ` Matt Mackall
  1 sibling, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-19 10:22 UTC (permalink / raw)
  To: mpm
  Cc: riel, torvalds, ioe-lkml, miklos, peterz, salikhmetov, linux-mm,
	jakob, linux-kernel, valdis.kletnieks, ksm, staubach,
	jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch

> Reminds me, I've got a patch here for addressing that problem with loop mounts:
> 
> Writes to loop should update the mtime of the underlying file.
> 
> Signed-off-by: Matt Mackall <mpm@selenic.com>
> 
> Index: l/drivers/block/loop.c
> ===================================================================
> --- l.orig/drivers/block/loop.c	2007-11-05 17:50:07.000000000 -0600
> +++ l/drivers/block/loop.c	2007-11-05 19:03:51.000000000 -0600
> @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
>  	offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
>  	bv_offs = bvec->bv_offset;
>  	len = bvec->bv_len;
> +	file_update_time(file);
>  	while (len > 0) {
>  		sector_t IV;
>  		unsigned size;
> @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
>  
>  	set_fs(get_ds());
>  	bw = file->f_op->write(file, buf, len, &pos);
> +	file_update_time(file);

->write should have already updated the times, no?

>  	set_fs(old_fs);
>  	if (likely(bw == len))
>  		return 0;
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times
  2008-01-18 19:48   ` Anton Salikhmetov
@ 2008-01-19 10:45     ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2008-01-19 10:45 UTC (permalink / raw)
  To: salikhmetov
  Cc: miklos, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel,
	ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

> 2008/1/18, Miklos Szeredi <miklos@szeredi.hu>:
> > > 4. Performance test was done using the program available from the
> > > following link:
> > >
> > > http://bugzilla.kernel.org/attachment.cgi?id=14493
> > >
> > > Result: the impact of the changes was negligible for files of a few
> > > hundred megabytes.
> >
> > Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> > nanosecond timestamps, so the time updating code would be exercised
> > more during the page faults.
> >
> > What about performance impact on msync(MS_ASYNC)?  Could you please do
> > some measurment of that as well?
> 
> Did a quick test on an ext4 partition. This is how it looks like:

Thanks for running these tests.

I was more interested in the slowdown on ext4 (checked with the above
mentioned program).  Can you do such a test as well, and post
resulting times with and without the patch?

> Table 1. Reference platforms.
> 
> ------------------------------------------------------------
> |             | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD    |
> ------------------------------------------------------------
> | First run   | 263405 usec   | 202283 usec   | 90 SECONDS |
> ------------------------------------------------------------
> | Second run  | 262253 usec   | 172837 usec   | 90 SECONDS |
> ------------------------------------------------------------
> | Third run   | 238465 usec   | 238465 usec   | 90 SECONDS |
> ------------------------------------------------------------
> 
> It looks like FreeBSD is a clear outsider here. Note that FreeBSD
> showed an almost liner depencence of the time spent in the
> msync(MS_ASYNC) call on the file size.
> 
> Table 2. The Qemu system. File size is 512M.
> 
> ---------------------------------------------------
> |            | Before the patch | After the patch |
> ---------------------------------------------------
> | First run  |     35 usec      |   5852 usec     |
> ---------------------------------------------------
> | Second run |     35 usec      |   4444 usec     |
> ---------------------------------------------------
> | Third run  |     35 usec      |   6330 usec     |
> ---------------------------------------------------

Interesting.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-19 10:22                               ` Miklos Szeredi
@ 2008-01-19 15:49                                 ` Matt Mackall
  0 siblings, 0 replies; 43+ messages in thread
From: Matt Mackall @ 2008-01-19 15:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: riel, torvalds, ioe-lkml, peterz, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch


On Sat, 2008-01-19 at 11:22 +0100, Miklos Szeredi wrote:
> > Reminds me, I've got a patch here for addressing that problem with loop mounts:
> > 
> > Writes to loop should update the mtime of the underlying file.
> > 
> > Signed-off-by: Matt Mackall <mpm@selenic.com>
> > 
> > Index: l/drivers/block/loop.c
> > ===================================================================
> > --- l.orig/drivers/block/loop.c	2007-11-05 17:50:07.000000000 -0600
> > +++ l/drivers/block/loop.c	2007-11-05 19:03:51.000000000 -0600
> > @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
> >  	offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> >  	bv_offs = bvec->bv_offset;
> >  	len = bvec->bv_len;
> > +	file_update_time(file);
> >  	while (len > 0) {
> >  		sector_t IV;
> >  		unsigned size;
> > @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
> >  
> >  	set_fs(get_ds());
> >  	bw = file->f_op->write(file, buf, len, &pos);
> > +	file_update_time(file);
> 
> ->write should have already updated the times, no?

Yes, this second case is redundant. Still needed in the first case.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-18 22:47                         ` Linus Torvalds
  2008-01-18 22:54                           ` Rik van Riel
@ 2008-01-21 14:25                           ` Peter Staubach
  2008-01-21 14:36                             ` Anton Salikhmetov
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Staubach @ 2008-01-21 14:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob,
	linux-kernel, valdis.kletnieks, riel, ksm, jesper.juhl, akpm,
	protasnb, r.e.wolff, hidave.darkstar, hch

Linus Torvalds wrote:
> On Fri, 18 Jan 2008, Ingo Oeser wrote:
>   
>> Can we get "if the write to the page hits the disk, the mtime has hit the disk
>> already no less than SOME_GRANULARITY before"? 
>>
>> That is very important for computer forensics. Esp. in saving your ass!
>>
>> Ok, now back again to making that fast :-)
>>     
>
> I certainly don't mind it if we have some tighter guarantees, but what I'd 
> want is:
>
>  - keep it simple. Let's face it, Linux has never ever given those 
>    guarantees before, and it's not is if anybody has really cared. Even 
>    now, the issue seems to be more about paper standards conformance than 
>    anything else.
>
>   

I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted.  They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

>  - I get worried about people playing around with the dirty bit in 
>    particular. We have had some really rather nasty bugs here. Most of 
>    which are totally impossible to trigger under normal loads (for 
>    example the old random-access utorrent writable mmap issue from about 
>    a year ago).
>
> So these two issues - the big red danger signs flashing in my brain, 
> coupled with the fact that no application has apparently ever really 
> noticed in the last 15 years - just makes it a case where I'd like each 
> step of the way to be obvious and simple and no larger than really 
> absolutely necessary.

Simple is good.  However, too simple is not good.  I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
  2008-01-21 14:25                           ` Peter Staubach
@ 2008-01-21 14:36                             ` Anton Salikhmetov
  0 siblings, 0 replies; 43+ messages in thread
From: Anton Salikhmetov @ 2008-01-21 14:36 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, linux-mm,
	jakob, linux-kernel, valdis.kletnieks, riel, ksm, jesper.juhl,
	akpm, protasnb, r.e.wolff, hidave.darkstar, hch

2008/1/21, Peter Staubach <staubach@redhat.com>:
> Linus Torvalds wrote:
> > On Fri, 18 Jan 2008, Ingo Oeser wrote:
> >
> >> Can we get "if the write to the page hits the disk, the mtime has hit the disk
> >> already no less than SOME_GRANULARITY before"?
> >>
> >> That is very important for computer forensics. Esp. in saving your ass!
> >>
> >> Ok, now back again to making that fast :-)
> >>
> >
> > I certainly don't mind it if we have some tighter guarantees, but what I'd
> > want is:
> >
> >  - keep it simple. Let's face it, Linux has never ever given those
> >    guarantees before, and it's not is if anybody has really cared. Even
> >    now, the issue seems to be more about paper standards conformance than
> >    anything else.
> >
> >
>
> I have been working on getting something supported here for
> because I have some very large Wall Street customers who do
> care about getting the mtime updated because their backups
> are getting corrupted.  They are incomplete because although
> their applications update files, they don't get backed up
> because the mtime never changes.
>
> >  - I get worried about people playing around with the dirty bit in
> >    particular. We have had some really rather nasty bugs here. Most of
> >    which are totally impossible to trigger under normal loads (for
> >    example the old random-access utorrent writable mmap issue from about
> >    a year ago).
> >
> > So these two issues - the big red danger signs flashing in my brain,
> > coupled with the fact that no application has apparently ever really
> > noticed in the last 15 years - just makes it a case where I'd like each
> > step of the way to be obvious and simple and no larger than really
> > absolutely necessary.
>
> Simple is good.  However, too simple is not good.  I would suggest
> that we implement file time updates which make sense and if they
> happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.

>
>     Thanx...
>
>        ps
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2008-01-21 14:36 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov
2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov
2008-01-18  9:33   ` Miklos Szeredi
2008-01-18 10:30     ` Anton Salikhmetov
2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov
2008-01-18  9:51   ` Miklos Szeredi
2008-01-18 10:15     ` Peter Zijlstra
2008-01-18 10:25       ` Peter Zijlstra
2008-01-18 10:39         ` Anton Salikhmetov
2008-01-18 17:58         ` Linus Torvalds
2008-01-18 18:11           ` Miklos Szeredi
2008-01-18 18:28             ` Rik van Riel
2008-01-18 18:51               ` Miklos Szeredi
2008-01-18 18:43             ` Linus Torvalds
2008-01-18 18:57               ` Miklos Szeredi
2008-01-18 19:08                 ` Linus Torvalds
2008-01-18 19:22                   ` Miklos Szeredi
2008-01-18 19:35                     ` Linus Torvalds
2008-01-18 19:58                       ` Anton Salikhmetov
2008-01-18 20:22                         ` Linus Torvalds
2008-01-18 21:03                           ` Anton Salikhmetov
2008-01-18 21:27                             ` Linus Torvalds
2008-01-18 22:04                               ` Anton Salikhmetov
2008-01-18 22:21                                 ` Linus Torvalds
2008-01-18 22:35                                   ` Anton Salikhmetov
2008-01-18 22:32                       ` Ingo Oeser
2008-01-18 22:47                         ` Linus Torvalds
2008-01-18 22:54                           ` Rik van Riel
2008-01-19  0:50                             ` Matt Mackall
2008-01-19  4:25                               ` Rik van Riel
2008-01-19 10:22                               ` Miklos Szeredi
2008-01-19 15:49                                 ` Matt Mackall
2008-01-21 14:25                           ` Peter Staubach
2008-01-21 14:36                             ` Anton Salikhmetov
2008-01-18 10:38       ` Miklos Szeredi
2008-01-18 11:00         ` Peter Zijlstra
2008-01-18 11:17           ` Miklos Szeredi
2008-01-18 11:23             ` Peter Zijlstra
2008-01-18 11:36               ` Miklos Szeredi
2008-01-18  9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi
2008-01-18 10:31   ` Anton Salikhmetov
2008-01-18 19:48   ` Anton Salikhmetov
2008-01-19 10:45     ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).