LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH -v6 0/2] Fixing the issue with memory-mapped file times @ 2008-01-17 22:31 Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw) To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch This is the sixth version of my solution for the bug #2645: http://bugzilla.kernel.org/show_bug.cgi?id=2645 New since the previous version: 1) a few cosmetic changes according to the latest feedback for the cleanup patch; 2) implementation of the following suggestion by Miklos Szeredi: http://lkml.org/lkml/2008/1/17/158 These changes were tested as explained below. Please note that all tests were performed with all recommended kernel debug options enabled. Also note that the tests were performed on regular files residing on both an ext3 partition and a tmpfs filesystem. I also checked the block device case, which worked for me as well. 1. My own unit test: http://bugzilla.kernel.org/attachment.cgi?id=14430 Result: all test cases passed successfully. 2. Unit test provided by Miklos Szeredi: http://lkml.org/lkml/2008/1/14/104 Result: this test produced the following output: debian-64:~# ./miklos_test test begin 1200598736 1200598736 1200598617 write 1200598737 1200598737 1200598617 mmap 1200598737 1200598737 1200598738 b 1200598739 1200598739 1200598738 msync b 1200598739 1200598739 1200598738 c 1200598741 1200598741 1200598738 msync c 1200598741 1200598741 1200598738 d 1200598743 1200598743 1200598738 munmap 1200598743 1200598743 1200598738 close 1200598743 1200598743 1200598738 sync 1200598743 1200598743 1200598738 debian-64:~# 3. Regression tests were performed using the following test cases from the LTP test suite: msync01 msync02 msync03 msync04 msync05 mmapstress01 mmapstress09 mmapstress10 Result: no regressions were found while running these test cases. 4. Performance test was done using the program available from the following link: http://bugzilla.kernel.org/attachment.cgi?id=14493 Result: the impact of the changes was negligible for files of a few hundred megabytes. I wonder if these changes can be applied now. These patches are the result of many fruitful discussions with Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach, and Jacob Oestergaard. I am grateful to you all for your support during the days I was working on this long-standing nasty bug. ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH -v6 1/2] Massive code cleanup of sys_msync() 2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov @ 2008-01-17 22:31 ` Anton Salikhmetov 2008-01-18 9:33 ` Miklos Szeredi 2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov 2008-01-18 9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi 2 siblings, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw) To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch Using the PAGE_ALIGN() macro instead of "manual" alignment and improved readability of the loop traversing the process memory regions. Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com> --- mm/msync.c | 77 ++++++++++++++++++++++++++++-------------------------------- 1 files changed, 36 insertions(+), 41 deletions(-) diff --git a/mm/msync.c b/mm/msync.c index 144a757..a4de868 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -1,85 +1,83 @@ /* - * linux/mm/msync.c + * The msync() system call. * - * Copyright (C) 1994-1999 Linus Torvalds + * Copyright (C) 1994-1999 Linus Torvalds + * Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com> */ -/* - * The msync() system call. - */ +#include <linux/file.h> #include <linux/fs.h> #include <linux/mm.h> #include <linux/mman.h> -#include <linux/file.h> -#include <linux/syscalls.h> #include <linux/sched.h> +#include <linux/syscalls.h> /* * MS_SYNC syncs the entire file - including mappings. * * MS_ASYNC does not start I/O (it used to, up to 2.5.67). - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17). * Now it doesn't do anything, since dirty pages are properly tracked. * - * The application may now run fsync() to - * write out the dirty pages and wait on the writeout and check the result. - * Or the application may run fadvise(FADV_DONTNEED) against the fd to start - * async writeout immediately. + * The application may now run fsync() to write out the dirty pages and + * wait on the writeout and check the result. Or the application may run + * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately. * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to * applications. */ asmlinkage long sys_msync(unsigned long start, size_t len, int flags) { unsigned long end; - struct mm_struct *mm = current->mm; + int error, unmapped_error; struct vm_area_struct *vma; - int unmapped_error = 0; - int error = -EINVAL; + struct mm_struct *mm; + error = -EINVAL; if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) goto out; if (start & ~PAGE_MASK) goto out; if ((flags & MS_ASYNC) && (flags & MS_SYNC)) goto out; + error = -ENOMEM; - len = (len + ~PAGE_MASK) & PAGE_MASK; + len = PAGE_ALIGN(len); end = start + len; if (end < start) goto out; - error = 0; + + error = unmapped_error = 0; if (end == start) goto out; + /* * If the interval [start,end) covers some unmapped address ranges, * just ignore them, but return -ENOMEM at the end. */ + mm = current->mm; down_read(&mm->mmap_sem); vma = find_vma(mm, start); - for (;;) { + do { struct file *file; - /* Still start < end. */ error = -ENOMEM; if (!vma) - goto out_unlock; - /* Here start < vma->vm_end. */ + break; if (start < vma->vm_start) { start = vma->vm_start; if (start >= end) - goto out_unlock; - unmapped_error = -ENOMEM; - } - /* Here vma->vm_start <= start < vma->vm_end. */ - if ((flags & MS_INVALIDATE) && - (vma->vm_flags & VM_LOCKED)) { - error = -EBUSY; - goto out_unlock; + break; + unmapped_error = error; } - file = vma->vm_file; + + error = -EBUSY; + if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) + break; + + error = 0; start = vma->vm_end; - if ((flags & MS_SYNC) && file && - (vma->vm_flags & VM_SHARED)) { + file = vma->vm_file; + if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) { get_file(file); up_read(&mm->mmap_sem); error = do_fsync(file, 0); @@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags) goto out; down_read(&mm->mmap_sem); vma = find_vma(mm, start); - } else { - if (start >= end) { - error = 0; - goto out_unlock; - } - vma = vma->vm_next; + continue; } - } -out_unlock: + + vma = vma->vm_next; + } while (start < end); up_read(&mm->mmap_sem); + out: - return error ? : unmapped_error; + return error ? error : unmapped_error; } -- 1.4.4.4 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync() 2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov @ 2008-01-18 9:33 ` Miklos Szeredi 2008-01-18 10:30 ` Anton Salikhmetov 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 9:33 UTC (permalink / raw) To: salikhmetov Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch > unsigned long end; > - struct mm_struct *mm = current->mm; > + int error, unmapped_error; > struct vm_area_struct *vma; > - int unmapped_error = 0; > - int error = -EINVAL; > + struct mm_struct *mm; > > + error = -EINVAL; I think you may have misunderstood my last comment. These are OK: struct mm_struct *mm = current->mm; int unmapped_error = 0; int error = -EINVAL; This is not so good: int error, unmapped_error; This is the worst: int error = -EINVAL, unmapped_error = 0; So I think the original code is fine as it is. Othewise patch looks OK now. Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync() 2008-01-18 9:33 ` Miklos Szeredi @ 2008-01-18 10:30 ` Anton Salikhmetov 0 siblings, 0 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 10:30 UTC (permalink / raw) To: Miklos Szeredi Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Miklos Szeredi <miklos@szeredi.hu>: > > unsigned long end; > > - struct mm_struct *mm = current->mm; > > + int error, unmapped_error; > > struct vm_area_struct *vma; > > - int unmapped_error = 0; > > - int error = -EINVAL; > > + struct mm_struct *mm; > > > > + error = -EINVAL; > > I think you may have misunderstood my last comment. These are OK: > > struct mm_struct *mm = current->mm; > int unmapped_error = 0; > int error = -EINVAL; > > This is not so good: > > int error, unmapped_error; > > This is the worst: > > int error = -EINVAL, unmapped_error = 0; > > So I think the original code is fine as it is. > > Othewise patch looks OK now. I moved the initialization of the variables to the code where they are needed. I don't agree that "int a; int b;" is better than "int a, b". > > Miklos > ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov @ 2008-01-17 22:31 ` Anton Salikhmetov 2008-01-18 9:51 ` Miklos Szeredi 2008-01-18 9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi 2 siblings, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-17 22:31 UTC (permalink / raw) To: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch Updating file times at write references to memory-mapped files and forcing file times update at the next write reference after calling the msync() system call with the MS_ASYNC flag. Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com> --- mm/memory.c | 6 ++++++ mm/msync.c | 52 +++++++++++++++++++++++++++++++++++++++------------- 2 files changed, 45 insertions(+), 13 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 4bf0b6d..13d5bbf 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1668,6 +1668,9 @@ gotten: unlock: pte_unmap_unlock(page_table, ptl); if (dirty_page) { + if (vma->vm_file) + file_update_time(vma->vm_file); + /* * Yes, Virginia, this is actually required to prevent a race * with clear_page_dirty_for_io() from clearing the page dirty @@ -2341,6 +2344,9 @@ out_unlocked: if (anon) page_cache_release(vmf.page); else if (dirty_page) { + if (vma->vm_file) + file_update_time(vma->vm_file); + set_page_dirty_balance(dirty_page, page_mkwrite); put_page(dirty_page); } diff --git a/mm/msync.c b/mm/msync.c index a4de868..a49af28 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -13,11 +13,33 @@ #include <linux/syscalls.h> /* + * Scan the PTEs for pages belonging to the VMA and mark them read-only. + * It will force a pagefault on the next write access. + */ +static void vma_wrprotect(struct vm_area_struct *vma) +{ + unsigned long addr; + + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { + spinlock_t *ptl; + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); + pud_t *pud = pud_offset(pgd, addr); + pmd_t *pmd = pmd_offset(pud, addr); + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + + if (pte_dirty(*pte) && pte_write(*pte)) + *pte = pte_wrprotect(*pte); + pte_unmap_unlock(pte, ptl); + } +} + +/* * MS_SYNC syncs the entire file - including mappings. * - * MS_ASYNC does not start I/O (it used to, up to 2.5.67). - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17). - * Now it doesn't do anything, since dirty pages are properly tracked. + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages + * read-only by calling vma_wrprotect(). This is needed to catch the next + * write reference to the mapped region and update the file times + * accordingly. * * The application may now run fsync() to write out the dirty pages and * wait on the writeout and check the result. Or the application may run @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags) error = 0; start = vma->vm_end; file = vma->vm_file; - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) { - get_file(file); - up_read(&mm->mmap_sem); - error = do_fsync(file, 0); - fput(file); - if (error || start >= end) - goto out; - down_read(&mm->mmap_sem); - vma = find_vma(mm, start); - continue; + if (file && (vma->vm_flags & VM_SHARED)) { + if (flags & MS_ASYNC) + vma_wrprotect(vma); + if (flags & MS_SYNC) { + get_file(file); + up_read(&mm->mmap_sem); + error = do_fsync(file, 0); + fput(file); + if (error || start >= end) + goto out; + down_read(&mm->mmap_sem); + vma = find_vma(mm, start); + continue; + } } vma = vma->vm_next; -- 1.4.4.4 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov @ 2008-01-18 9:51 ` Miklos Szeredi 2008-01-18 10:15 ` Peter Zijlstra 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 9:51 UTC (permalink / raw) To: salikhmetov Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch > Updating file times at write references to memory-mapped files and > forcing file times update at the next write reference after > calling the msync() system call with the MS_ASYNC flag. > > Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com> > --- > mm/memory.c | 6 ++++++ > mm/msync.c | 52 +++++++++++++++++++++++++++++++++++++++------------- > 2 files changed, 45 insertions(+), 13 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 4bf0b6d..13d5bbf 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1668,6 +1668,9 @@ gotten: > unlock: > pte_unmap_unlock(page_table, ptl); > if (dirty_page) { > + if (vma->vm_file) > + file_update_time(vma->vm_file); > + > /* > * Yes, Virginia, this is actually required to prevent a race > * with clear_page_dirty_for_io() from clearing the page dirty > @@ -2341,6 +2344,9 @@ out_unlocked: > if (anon) > page_cache_release(vmf.page); > else if (dirty_page) { > + if (vma->vm_file) > + file_update_time(vma->vm_file); > + > set_page_dirty_balance(dirty_page, page_mkwrite); > put_page(dirty_page); > } > diff --git a/mm/msync.c b/mm/msync.c > index a4de868..a49af28 100644 > --- a/mm/msync.c > +++ b/mm/msync.c > @@ -13,11 +13,33 @@ > #include <linux/syscalls.h> > > /* > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > + * It will force a pagefault on the next write access. > + */ > +static void vma_wrprotect(struct vm_area_struct *vma) > +{ > + unsigned long addr; > + > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > + spinlock_t *ptl; > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > + pud_t *pud = pud_offset(pgd, addr); > + pmd_t *pmd = pmd_offset(pud, addr); > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + > + if (pte_dirty(*pte) && pte_write(*pte)) > + *pte = pte_wrprotect(*pte); > + pte_unmap_unlock(pte, ptl); > + } > +} What about ram based filesystems? They don't start out with read-only pte's, so I think they don't want them read-protected now either. Unless this is essential for correct mtime/ctime accounting on these filesystems (I don't think it really is). But then the mapping should start out read-only as well, otherwise the time update will only work after an msync(MS_ASYNC). > + > +/* > * MS_SYNC syncs the entire file - including mappings. > * > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67). > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17). > - * Now it doesn't do anything, since dirty pages are properly tracked. > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages > + * read-only by calling vma_wrprotect(). This is needed to catch the next > + * write reference to the mapped region and update the file times > + * accordingly. > * > * The application may now run fsync() to write out the dirty pages and > * wait on the writeout and check the result. Or the application may run > @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags) > error = 0; > start = vma->vm_end; > file = vma->vm_file; > - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) { > - get_file(file); > - up_read(&mm->mmap_sem); > - error = do_fsync(file, 0); > - fput(file); > - if (error || start >= end) > - goto out; > - down_read(&mm->mmap_sem); > - vma = find_vma(mm, start); > - continue; > + if (file && (vma->vm_flags & VM_SHARED)) { > + if (flags & MS_ASYNC) > + vma_wrprotect(vma); > + if (flags & MS_SYNC) { > + get_file(file); > + up_read(&mm->mmap_sem); > + error = do_fsync(file, 0); > + fput(file); > + if (error || start >= end) > + goto out; > + down_read(&mm->mmap_sem); > + vma = find_vma(mm, start); > + continue; > + } > } > > vma = vma->vm_next; > -- > 1.4.4.4 > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 9:51 ` Miklos Szeredi @ 2008-01-18 10:15 ` Peter Zijlstra 2008-01-18 10:25 ` Peter Zijlstra 2008-01-18 10:38 ` Miklos Szeredi 0 siblings, 2 replies; 43+ messages in thread From: Peter Zijlstra @ 2008-01-18 10:15 UTC (permalink / raw) To: Miklos Szeredi Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote: > > diff --git a/mm/msync.c b/mm/msync.c > > index a4de868..a49af28 100644 > > --- a/mm/msync.c > > +++ b/mm/msync.c > > @@ -13,11 +13,33 @@ > > #include <linux/syscalls.h> > > > > /* > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > + * It will force a pagefault on the next write access. > > + */ > > +static void vma_wrprotect(struct vm_area_struct *vma) > > +{ > > + unsigned long addr; > > + > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > + spinlock_t *ptl; > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > + pud_t *pud = pud_offset(pgd, addr); > > + pmd_t *pmd = pmd_offset(pud, addr); > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + > > + if (pte_dirty(*pte) && pte_write(*pte)) > > + *pte = pte_wrprotect(*pte); > > + pte_unmap_unlock(pte, ptl); > > + } > > +} > > What about ram based filesystems? They don't start out with read-only > pte's, so I think they don't want them read-protected now either. > Unless this is essential for correct mtime/ctime accounting on these > filesystems (I don't think it really is). But then the mapping should > start out read-only as well, otherwise the time update will only work > after an msync(MS_ASYNC). page_mkclean() has all the needed logic for this, it also walks the rmap and cleans out all other users, which I think is needed too for consistencies sake: Process A Process B mmap(foo.txt) mmap(foo.txt) dirty page dirty page msync(MS_ASYNC) dirty page msync(MS_ASYNC) <--- now what?! So what I would suggest is using the page table walkers from mm, and walks the page range, obtain the page using vm_normal_page() and call page_mkclean(). (Oh, and ensure you don't nest the pte lock :-) All in all, that sounds rather expensive.. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 10:15 ` Peter Zijlstra @ 2008-01-18 10:25 ` Peter Zijlstra 2008-01-18 10:39 ` Anton Salikhmetov 2008-01-18 17:58 ` Linus Torvalds 2008-01-18 10:38 ` Miklos Szeredi 1 sibling, 2 replies; 43+ messages in thread From: Peter Zijlstra @ 2008-01-18 10:25 UTC (permalink / raw) To: Miklos Szeredi Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote: > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote: > > > > diff --git a/mm/msync.c b/mm/msync.c > > > index a4de868..a49af28 100644 > > > --- a/mm/msync.c > > > +++ b/mm/msync.c > > > @@ -13,11 +13,33 @@ > > > #include <linux/syscalls.h> > > > > > > /* > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > > + * It will force a pagefault on the next write access. > > > + */ > > > +static void vma_wrprotect(struct vm_area_struct *vma) > > > +{ > > > + unsigned long addr; > > > + > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > > + spinlock_t *ptl; > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > > + pud_t *pud = pud_offset(pgd, addr); > > > + pmd_t *pmd = pmd_offset(pud, addr); > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > + > > > + if (pte_dirty(*pte) && pte_write(*pte)) > > > + *pte = pte_wrprotect(*pte); > > > + pte_unmap_unlock(pte, ptl); > > > + } > > > +} > > > > What about ram based filesystems? They don't start out with read-only > > pte's, so I think they don't want them read-protected now either. > > Unless this is essential for correct mtime/ctime accounting on these > > filesystems (I don't think it really is). But then the mapping should > > start out read-only as well, otherwise the time update will only work > > after an msync(MS_ASYNC). > > page_mkclean() has all the needed logic for this, it also walks the rmap > and cleans out all other users, which I think is needed too for > consistencies sake: > > Process A Process B > > mmap(foo.txt) mmap(foo.txt) > > dirty page > dirty page > > msync(MS_ASYNC) > > dirty page > > msync(MS_ASYNC) <--- now what?! > > > So what I would suggest is using the page table walkers from mm, and > walks the page range, obtain the page using vm_normal_page() and call > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-) > > All in all, that sounds rather expensive.. Bah, and will break on s390... so we'd need a page_mkclean() variant that doesn't actually clear dirty. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 10:25 ` Peter Zijlstra @ 2008-01-18 10:39 ` Anton Salikhmetov 2008-01-18 17:58 ` Linus Torvalds 1 sibling, 0 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 10:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Miklos Szeredi, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Peter Zijlstra <peterz@infradead.org>: > > On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote: > > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote: > > > > > > diff --git a/mm/msync.c b/mm/msync.c > > > > index a4de868..a49af28 100644 > > > > --- a/mm/msync.c > > > > +++ b/mm/msync.c > > > > @@ -13,11 +13,33 @@ > > > > #include <linux/syscalls.h> > > > > > > > > /* > > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > > > + * It will force a pagefault on the next write access. > > > > + */ > > > > +static void vma_wrprotect(struct vm_area_struct *vma) > > > > +{ > > > > + unsigned long addr; > > > > + > > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > > > + spinlock_t *ptl; > > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > > > + pud_t *pud = pud_offset(pgd, addr); > > > > + pmd_t *pmd = pmd_offset(pud, addr); > > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > > + > > > > + if (pte_dirty(*pte) && pte_write(*pte)) > > > > + *pte = pte_wrprotect(*pte); > > > > + pte_unmap_unlock(pte, ptl); > > > > + } > > > > +} > > > > > > What about ram based filesystems? They don't start out with read-only > > > pte's, so I think they don't want them read-protected now either. > > > Unless this is essential for correct mtime/ctime accounting on these > > > filesystems (I don't think it really is). But then the mapping should > > > start out read-only as well, otherwise the time update will only work > > > after an msync(MS_ASYNC). > > > > page_mkclean() has all the needed logic for this, it also walks the rmap > > and cleans out all other users, which I think is needed too for > > consistencies sake: > > > > Process A Process B > > > > mmap(foo.txt) mmap(foo.txt) > > > > dirty page > > dirty page > > > > msync(MS_ASYNC) > > > > dirty page > > > > msync(MS_ASYNC) <--- now what?! > > > > > > So what I would suggest is using the page table walkers from mm, and > > walks the page range, obtain the page using vm_normal_page() and call > > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-) > > > > All in all, that sounds rather expensive.. > > Bah, and will break on s390... so we'd need a page_mkclean() variant > that doesn't actually clear dirty. So the current version of the functional changes patch takes this into account. > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 10:25 ` Peter Zijlstra 2008-01-18 10:39 ` Anton Salikhmetov @ 2008-01-18 17:58 ` Linus Torvalds 2008-01-18 18:11 ` Miklos Szeredi 1 sibling, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 17:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Miklos Szeredi, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Peter Zijlstra wrote: > > Bah, and will break on s390... so we'd need a page_mkclean() variant > that doesn't actually clear dirty. No, we simply want to not play all these very expensive games with dirty in the first place. Guys, mmap access times aren't important enough for this. It's not specified closely enough, and people don't care enough. Of the patches around so far, the best one by far seems to be the simple four-liner from Miklos. And even in that four-liner, I suspect that the *last* two lines are actually incorrect: there's no point in updating the file time when the page *becomes* dirty, we should update the file time when it is marked clean, and "msync(MS_SYNC)" should update it as part of *that*. So I think the file time update should be part of just the page writeout logic, not by msync() or page faulting itself or anything like that. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 17:58 ` Linus Torvalds @ 2008-01-18 18:11 ` Miklos Szeredi 2008-01-18 18:28 ` Rik van Riel 2008-01-18 18:43 ` Linus Torvalds 0 siblings, 2 replies; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 18:11 UTC (permalink / raw) To: torvalds Cc: peterz, miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > And even in that four-liner, I suspect that the *last* two lines are > actually incorrect: there's no point in updating the file time when the > page *becomes* dirty, Actually all four lines do that. The first two for a write access on a present, read-only pte, the other two for a write on a non-present pte. > we should update the file time when it is marked > clean, and "msync(MS_SYNC)" should update it as part of *that*. That would need a new page flag (PG_mmap_dirty?). Do we have one available? Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 18:11 ` Miklos Szeredi @ 2008-01-18 18:28 ` Rik van Riel 2008-01-18 18:51 ` Miklos Szeredi 2008-01-18 18:43 ` Linus Torvalds 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2008-01-18 18:28 UTC (permalink / raw) To: Miklos Szeredi Cc: torvalds, peterz, miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008 19:11:47 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote: > > And even in that four-liner, I suspect that the *last* two lines are > > actually incorrect: there's no point in updating the file time when the > > page *becomes* dirty, > > Actually all four lines do that. The first two for a write access on > a present, read-only pte, the other two for a write on a non-present > pte. > > > we should update the file time when it is marked > > clean, and "msync(MS_SYNC)" should update it as part of *that*. > > That would need a new page flag (PG_mmap_dirty?). Do we have one > available? I thought the page writing stuff looked at (and cleared) the pte dirty bit, too? -- All rights reversed. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 18:28 ` Rik van Riel @ 2008-01-18 18:51 ` Miklos Szeredi 0 siblings, 0 replies; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 18:51 UTC (permalink / raw) To: riel Cc: miklos, torvalds, peterz, miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > > > And even in that four-liner, I suspect that the *last* two lines are > > > actually incorrect: there's no point in updating the file time when the > > > page *becomes* dirty, > > > > Actually all four lines do that. The first two for a write access on > > a present, read-only pte, the other two for a write on a non-present > > pte. > > > > > we should update the file time when it is marked > > > clean, and "msync(MS_SYNC)" should update it as part of *that*. > > > > That would need a new page flag (PG_mmap_dirty?). Do we have one > > available? > > I thought the page writing stuff looked at (and cleared) the pte > dirty bit, too? Yeah, it does. Hmm... What happens on munmap? The times _could_ get updated from there as well, but it's getting complicated. Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 18:11 ` Miklos Szeredi 2008-01-18 18:28 ` Rik van Riel @ 2008-01-18 18:43 ` Linus Torvalds 2008-01-18 18:57 ` Miklos Szeredi 1 sibling, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 18:43 UTC (permalink / raw) To: Miklos Szeredi Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Miklos Szeredi wrote: > > That would need a new page flag (PG_mmap_dirty?). Do we have one > available? Yeah, that would be bad. We probably have flags free, but those page flags are always a pain. Scratch that. How about just setting a per-vma dirty flag, and then instead of updating the mtime when taking the dirty-page fault, we just set that flag? Then, on unmap and msync, we just do if (vma->dirty-flag) { vma->dirty_flag = 0; update_file_times(vma->vm_file); } and be done with it? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 18:43 ` Linus Torvalds @ 2008-01-18 18:57 ` Miklos Szeredi 2008-01-18 19:08 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 18:57 UTC (permalink / raw) To: torvalds Cc: miklos, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > > That would need a new page flag (PG_mmap_dirty?). Do we have one > > available? > > Yeah, that would be bad. We probably have flags free, but those page flags > are always a pain. Scratch that. > > How about just setting a per-vma dirty flag, and then instead of updating > the mtime when taking the dirty-page fault, we just set that flag? > > Then, on unmap and msync, we just do > > if (vma->dirty-flag) { > vma->dirty_flag = 0; > update_file_times(vma->vm_file); > } > > and be done with it? But then background writeout, sync(2), etc, wouldn't update the times. Dunno. I don't think actual _physical_ writeout matters much, so it's not worse to be 30s early with the timestamp, than to be 30s or more late. Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 18:57 ` Miklos Szeredi @ 2008-01-18 19:08 ` Linus Torvalds 2008-01-18 19:22 ` Miklos Szeredi 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 19:08 UTC (permalink / raw) To: Miklos Szeredi Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Miklos Szeredi wrote: > > But then background writeout, sync(2), etc, wouldn't update the times. Sure it would, but only when doing the final unmap. Did you miss the "on unmap and msync" part? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 19:08 ` Linus Torvalds @ 2008-01-18 19:22 ` Miklos Szeredi 2008-01-18 19:35 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 19:22 UTC (permalink / raw) To: torvalds Cc: miklos, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > > > > But then background writeout, sync(2), etc, wouldn't update the times. > > Sure it would, but only when doing the final unmap. > > Did you miss the "on unmap and msync" part? No :) What I'm saying is that the times could be left un-updated for a long time if program doesn't do munmap() or msync(MS_SYNC) for a long time. If program has this pattern: mmap() write to map msync(MS_ASYNC) sleep(long) write to map msync(MS_ASYNC) sleep(long) ... Then we'd never see time updates (until the program exits, but that could be years). Maybe this doesn't matter, I'm just saying this is a disadvantage compared to the "update on first dirtying" approach, which would ensure, that times are updated at least once per 30s. Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 19:22 ` Miklos Szeredi @ 2008-01-18 19:35 ` Linus Torvalds 2008-01-18 19:58 ` Anton Salikhmetov 2008-01-18 22:32 ` Ingo Oeser 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 19:35 UTC (permalink / raw) To: Miklos Szeredi Cc: peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Miklos Szeredi wrote: > > What I'm saying is that the times could be left un-updated for a long > time if program doesn't do munmap() or msync(MS_SYNC) for a long time. Sure. But in those circumstances, the programmer cannot depend on the mtime *anyway* (because there is no synchronization), so what's the downside? Let's face it, there's exactly three possible solutions: - the insane one: trap EVERY SINGLE instruction that does a write to the page, and update mtime each and every time. This one is so obviously STUPID that it's not even worth discussing further, except to say that "yes, there is an 'exact' algorithm, but no, we are never EVER going to use it". - the non-exact solutions that don't give you mtime updates every time a write to the page happens, but give *some* guarantees for things that will update it. This is the one I think we can do, and the only things a programmer can impact using it is "msync()" and "munmap()", since no other operations really have any thing to do with it in a programmer-visible way (ie a normal "sync" operation may happen in the background and has no progam-relevant timing information) Other things *may* or may not update mtime (some filesystems - take most networked one as an example - will *always* update mtime on the server on writeback, so we cannot ever guarantee that nothing but msync/munmap does so), but at least we'll have a minimum set of things that people can depend on. - the "we don't care at all solutions". mmap(MAP_WRITE) doesn't really update times reliably after the write has happened (but might do it *before* - maybe the mmap() itself does). Those are the three choices, I think. We currently approximate #3. We *can* do #2 (and there are various flavors of it). And even *aiming* for #1 is totally insane and stupid. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 19:35 ` Linus Torvalds @ 2008-01-18 19:58 ` Anton Salikhmetov 2008-01-18 20:22 ` Linus Torvalds 2008-01-18 22:32 ` Ingo Oeser 1 sibling, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 19:58 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Linus Torvalds <torvalds@linux-foundation.org>: > > > On Fri, 18 Jan 2008, Miklos Szeredi wrote: > > > > What I'm saying is that the times could be left un-updated for a long > > time if program doesn't do munmap() or msync(MS_SYNC) for a long time. > > Sure. > > But in those circumstances, the programmer cannot depend on the mtime > *anyway* (because there is no synchronization), so what's the downside? > > Let's face it, there's exactly three possible solutions: > > - the insane one: trap EVERY SINGLE instruction that does a write to the > page, and update mtime each and every time. > > This one is so obviously STUPID that it's not even worth discussing > further, except to say that "yes, there is an 'exact' algorithm, but > no, we are never EVER going to use it". > > - the non-exact solutions that don't give you mtime updates every time > a write to the page happens, but give *some* guarantees for things that > will update it. > > This is the one I think we can do, and the only things a programmer can > impact using it is "msync()" and "munmap()", since no other operations > really have any thing to do with it in a programmer-visible way (ie a > normal "sync" operation may happen in the background and has no > progam-relevant timing information) > > Other things *may* or may not update mtime (some filesystems - take > most networked one as an example - will *always* update mtime on the > server on writeback, so we cannot ever guarantee that nothing but > msync/munmap does so), but at least we'll have a minimum set of things > that people can depend on. > > - the "we don't care at all solutions". > > mmap(MAP_WRITE) doesn't really update times reliably after the write > has happened (but might do it *before* - maybe the mmap() itself does). > > Those are the three choices, I think. We currently approximate #3. We > *can* do #2 (and there are various flavors of it). And even *aiming* for > #1 is totally insane and stupid. The current solution doesn't hit the performance at all when compared to the competitor POSIX-compliant systems. It is faster and does even more than the POSIX standard requires. Please see the test results I've sent into the thread "-v6 0/2": http://lkml.org/lkml/2008/1/18/447 I guess, the current solution is ready to use. > > Linus > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 19:58 ` Anton Salikhmetov @ 2008-01-18 20:22 ` Linus Torvalds 2008-01-18 21:03 ` Anton Salikhmetov 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 20:22 UTC (permalink / raw) To: Anton Salikhmetov Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Anton Salikhmetov wrote: > > The current solution doesn't hit the performance at all when compared to > the competitor POSIX-compliant systems. It is faster and does even more > than the POSIX standard requires. Your current patches have two problems: - they are simply unnecessarily invasive for a relatively simple issue - all versions I've looked at closer are buggy too Example: + if (pte_dirty(*pte) && pte_write(*pte)) + *pte = pte_wrprotect(*pte); Uhhuh. Looks simple enough. Except it does a non-atomic pte access while other CPU's may be accessing it and updating it from their hw page table walkers. What will happen? Who knows? I can see lost access bits at a minimum. IOW, this isn't simple code. It's code that it is simple to screw up. In this case, you really need to use ptep_set_wrprotect(), for example. So why not do it in many fewer lines with that simpler vma->dirty flag? Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 20:22 ` Linus Torvalds @ 2008-01-18 21:03 ` Anton Salikhmetov 2008-01-18 21:27 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 21:03 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Linus Torvalds <torvalds@linux-foundation.org>: > > > On Fri, 18 Jan 2008, Anton Salikhmetov wrote: > > > > The current solution doesn't hit the performance at all when compared to > > the competitor POSIX-compliant systems. It is faster and does even more > > than the POSIX standard requires. > > Your current patches have two problems: > - they are simply unnecessarily invasive for a relatively simple issue > - all versions I've looked at closer are buggy too > > Example: > > + if (pte_dirty(*pte) && pte_write(*pte)) > + *pte = pte_wrprotect(*pte); > > Uhhuh. Looks simple enough. Except it does a non-atomic pte access while > other CPU's may be accessing it and updating it from their hw page table > walkers. What will happen? Who knows? I can see lost access bits at a > minimum. > > IOW, this isn't simple code. It's code that it is simple to screw up. In > this case, you really need to use ptep_set_wrprotect(), for example. Before using pte_wrprotect() the vma_wrprotect() routine uses the pte_offset_map_lock() macro to get the PTE and to acquire the ptl spinlock. Why did you say that this code was not SMP-safe? It should be atomic, I think. > > So why not do it in many fewer lines with that simpler vma->dirty flag? Neither the dirty flag you suggest, nor the AS_MCTIME flag I've introduced in my previous solutions solve the following problem: - mmap() - a write reference - msync() with MS_ASYNC - a write reference - msync() with MS_ASYNC The POSIX standard requires the ctime and mtime stamps to be updated not later than at the second call to msync() with the MS_ASYNC flag. Some other POSIX-compliant operating system such as HP-UX and FreeBSD satisfy this POSIX requirement. Linux does not. > > Linus > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 21:03 ` Anton Salikhmetov @ 2008-01-18 21:27 ` Linus Torvalds 2008-01-18 22:04 ` Anton Salikhmetov 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 21:27 UTC (permalink / raw) To: Anton Salikhmetov Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Sat, 19 Jan 2008, Anton Salikhmetov wrote: > > Before using pte_wrprotect() the vma_wrprotect() routine uses the > pte_offset_map_lock() macro to get the PTE and to acquire the ptl > spinlock. Why did you say that this code was not SMP-safe? It should > be atomic, I think. It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK. Guess how much another x86 CPU cares when it sets the accessed bit in hardware? > The POSIX standard requires the ctime and mtime stamps to be updated > not later than at the second call to msync() with the MS_ASYNC flag. .. and that is no excuse for bad code. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 21:27 ` Linus Torvalds @ 2008-01-18 22:04 ` Anton Salikhmetov 2008-01-18 22:21 ` Linus Torvalds 0 siblings, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 22:04 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/19, Linus Torvalds <torvalds@linux-foundation.org>: > > > On Sat, 19 Jan 2008, Anton Salikhmetov wrote: > > > > Before using pte_wrprotect() the vma_wrprotect() routine uses the > > pte_offset_map_lock() macro to get the PTE and to acquire the ptl > > spinlock. Why did you say that this code was not SMP-safe? It should > > be atomic, I think. > > It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK. > > Guess how much another x86 CPU cares when it sets the accessed bit in > hardware? Thank you very much for taking part in this discussion. Personally, it's very important to me. But I'm not sure that I understand which bit can be lost. Please let me explain. The logic for my vma_wrprotect() routine was taken from the page_check_address() function in mm/rmap.c. Here is a code snippet of the latter function: pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) return NULL; pud = pud_offset(pgd, address); if (!pud_present(*pud)) return NULL; pmd = pmd_offset(pud, address); if (!pmd_present(*pmd)) return NULL; pte = pte_offset_map(pmd, address); /* Make a quick check before getting the lock */ if (!pte_present(*pte)) { pte_unmap(pte); return NULL; } ptl = pte_lockptr(mm, pmd); spin_lock(ptl); if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) { *ptlp = ptl; return pte; } pte_unmap_unlock(pte, ptl); The page_check_address() function is called from the page_mkclean_one() routine as follows: pte = page_check_address(page, mm, address, &ptl); if (!pte) goto out; if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); ret = 1; } pte_unmap_unlock(pte, ptl); The write-protection of the PTE is done using the pte_wrprotect() entity. I intended to do the same during msync() with MS_ASYNC. I understand that I'm taking a risk of looking a complete idiot now, however I don't see any difference between the two situations. I presumed that the code in mm/rmap.c was absolutely correct, that's why I basically reused the design. > > > The POSIX standard requires the ctime and mtime stamps to be updated > > not later than at the second call to msync() with the MS_ASYNC flag. > > .. and that is no excuse for bad code. > > Linus > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:04 ` Anton Salikhmetov @ 2008-01-18 22:21 ` Linus Torvalds 2008-01-18 22:35 ` Anton Salikhmetov 0 siblings, 1 reply; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 22:21 UTC (permalink / raw) To: Anton Salikhmetov Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Sat, 19 Jan 2008, Anton Salikhmetov wrote: > > The page_check_address() function is called from the > page_mkclean_one() routine as follows: .. and the page_mkclean_one() function is totally different. Lookie here, this is the correct and complex sequence: > entry = ptep_clear_flush(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > set_pte_at(mm, address, pte, entry); That's a rather expensive sequence, but it's done exactly because it has to be done that way. What it does is to - *atomically* load the pte entry _and_ clear the old one in memory. That's the entry = ptep_clear_flush(vma, address, pte); thing, and it basically means that it's doing some architecture-specific magic to make sure that another CPU that accesses the PTE at the same time will never actually modify the pte (because it's clear and not valid) - it then - while the page table is actually clear and invalid - takes the old value and turns it into the new one: entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - and finally, it replaces the entry with the new one: set_pte_at(mm, address, pte, entry); which takes care to write the new entry in some specific way that is atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table entry it writes the high word first, see the write barriers in "native_set_pte()" in include/asm-x86/pgtable-3level.h Now, compare that subtle and correct thing with what is *not* correct: if (pte_dirty(*pte) && pte_write(*pte)) *pte = pte_wrprotect(*pte); which makes no effort at all to make sure that it's safe in case another CPU updates the accessed bit. Now, arguably it's unlikely to cause horrible problems at least on x86, because: - we only do this if the pte is already marked dirty, so while we can lose the accessed bit, we can *not* lose the dirty bit. And the accessed bit isn't such a big deal. - it's not doing any of the "be careful about" ordering things, but since the really important bits aren't changing, ordering probably won't practically matter. But the problem is that we have something like 24 different architectures, it's hard to make sure that none of them have issues. In other words: it may well work in practice. But when these things go subtly wrong, they are *really* nasty to find, and the unsafe sequence is really not how it's supposed to be done. For example, you don't even flush the TLB, so even if there are no cross-CPU issues, there's probably going to be writable entries in the TLB that now don't match the page tables. Will it matter? Again, probably impossible to see in practice. But ... Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:21 ` Linus Torvalds @ 2008-01-18 22:35 ` Anton Salikhmetov 0 siblings, 0 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 22:35 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/19, Linus Torvalds <torvalds@linux-foundation.org>: > > > On Sat, 19 Jan 2008, Anton Salikhmetov wrote: > > > > The page_check_address() function is called from the > > page_mkclean_one() routine as follows: > > .. and the page_mkclean_one() function is totally different. > > Lookie here, this is the correct and complex sequence: > > > entry = ptep_clear_flush(vma, address, pte); > > entry = pte_wrprotect(entry); > > entry = pte_mkclean(entry); > > set_pte_at(mm, address, pte, entry); > > That's a rather expensive sequence, but it's done exactly because it has > to be done that way. What it does is to > > - *atomically* load the pte entry _and_ clear the old one in memory. > > That's the > > entry = ptep_clear_flush(vma, address, pte); > > thing, and it basically means that it's doing some > architecture-specific magic to make sure that another CPU that accesses > the PTE at the same time will never actually modify the pte (because > it's clear and not valid) > > - it then - while the page table is actually clear and invalid - takes > the old value and turns it into the new one: > > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > - and finally, it replaces the entry with the new one: > > set_pte_at(mm, address, pte, entry); > > which takes care to write the new entry in some specific way that is > atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table > entry it writes the high word first, see the write barriers in > "native_set_pte()" in include/asm-x86/pgtable-3level.h > > Now, compare that subtle and correct thing with what is *not* correct: > > if (pte_dirty(*pte) && pte_write(*pte)) > *pte = pte_wrprotect(*pte); > > which makes no effort at all to make sure that it's safe in case another > CPU updates the accessed bit. > > Now, arguably it's unlikely to cause horrible problems at least on x86, > because: > > - we only do this if the pte is already marked dirty, so while we can > lose the accessed bit, we can *not* lose the dirty bit. And the > accessed bit isn't such a big deal. > > - it's not doing any of the "be careful about" ordering things, but since > the really important bits aren't changing, ordering probably won't > practically matter. > > But the problem is that we have something like 24 different architectures, > it's hard to make sure that none of them have issues. > > In other words: it may well work in practice. But when these things go > subtly wrong, they are *really* nasty to find, and the unsafe sequence is > really not how it's supposed to be done. For example, you don't even flush > the TLB, so even if there are no cross-CPU issues, there's probably going > to be writable entries in the TLB that now don't match the page tables. > > Will it matter? Again, probably impossible to see in practice. But ... Linus, I am very grateful to you for your extremely clear explanation of the issue I have overlooked! Back to the msync() issue, I'm going to come back with a new design for the bug fix. Thank you once again. Anton > > Linus > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 19:35 ` Linus Torvalds 2008-01-18 19:58 ` Anton Salikhmetov @ 2008-01-18 22:32 ` Ingo Oeser 2008-01-18 22:47 ` Linus Torvalds 1 sibling, 1 reply; 43+ messages in thread From: Ingo Oeser @ 2008-01-18 22:32 UTC (permalink / raw) To: Linus Torvalds Cc: Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch Hi Linus, On Friday 18 January 2008, Linus Torvalds wrote: > On Fri, 18 Jan 2008, Miklos Szeredi wrote: > > > > What I'm saying is that the times could be left un-updated for a long > > time if program doesn't do munmap() or msync(MS_SYNC) for a long time. > > Sure. > > But in those circumstances, the programmer cannot depend on the mtime > *anyway* (because there is no synchronization), so what's the downside? Can we get "if the write to the page hits the disk, the mtime has hit the disk already no less than SOME_GRANULARITY before"? That is very important for computer forensics. Esp. in saving your ass! Ok, now back again to making that fast :-) Best Regards Ingo Oeser ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:32 ` Ingo Oeser @ 2008-01-18 22:47 ` Linus Torvalds 2008-01-18 22:54 ` Rik van Riel 2008-01-21 14:25 ` Peter Staubach 0 siblings, 2 replies; 43+ messages in thread From: Linus Torvalds @ 2008-01-18 22:47 UTC (permalink / raw) To: Ingo Oeser Cc: Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008, Ingo Oeser wrote: > > Can we get "if the write to the page hits the disk, the mtime has hit the disk > already no less than SOME_GRANULARITY before"? > > That is very important for computer forensics. Esp. in saving your ass! > > Ok, now back again to making that fast :-) I certainly don't mind it if we have some tighter guarantees, but what I'd want is: - keep it simple. Let's face it, Linux has never ever given those guarantees before, and it's not is if anybody has really cared. Even now, the issue seems to be more about paper standards conformance than anything else. - I get worried about people playing around with the dirty bit in particular. We have had some really rather nasty bugs here. Most of which are totally impossible to trigger under normal loads (for example the old random-access utorrent writable mmap issue from about a year ago). So these two issues - the big red danger signs flashing in my brain, coupled with the fact that no application has apparently ever really noticed in the last 15 years - just makes it a case where I'd like each step of the way to be obvious and simple and no larger than really absolutely necessary. Linus ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:47 ` Linus Torvalds @ 2008-01-18 22:54 ` Rik van Riel 2008-01-19 0:50 ` Matt Mackall 2008-01-21 14:25 ` Peter Staubach 1 sibling, 1 reply; 43+ messages in thread From: Rik van Riel @ 2008-01-18 22:54 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008 14:47:33 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote: > - keep it simple. Let's face it, Linux has never ever given those > guarantees before, and it's not is if anybody has really cared. Even > now, the issue seems to be more about paper standards conformance than > anything else. There is one issue which is way more than just standards conformance. When a program changes file data through mmap(), at some point the mtime needs to be update so that backup programs know to back up the new version of the file. Backup programs not seeing an updated mtime is a really big deal. -- All rights reversed. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:54 ` Rik van Riel @ 2008-01-19 0:50 ` Matt Mackall 2008-01-19 4:25 ` Rik van Riel 2008-01-19 10:22 ` Miklos Szeredi 0 siblings, 2 replies; 43+ messages in thread From: Matt Mackall @ 2008-01-19 0:50 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote: > On Fri, 18 Jan 2008 14:47:33 -0800 (PST) > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > - keep it simple. Let's face it, Linux has never ever given those > > guarantees before, and it's not is if anybody has really cared. Even > > now, the issue seems to be more about paper standards conformance than > > anything else. > > There is one issue which is way more than just standards conformance. > > When a program changes file data through mmap(), at some point the > mtime needs to be update so that backup programs know to back up the > new version of the file. > > Backup programs not seeing an updated mtime is a really big deal. And that's fixed with the 4-line approach. Reminds me, I've got a patch here for addressing that problem with loop mounts: Writes to loop should update the mtime of the underlying file. Signed-off-by: Matt Mackall <mpm@selenic.com> Index: l/drivers/block/loop.c =================================================================== --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600 +++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600 @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1); bv_offs = bvec->bv_offset; len = bvec->bv_len; + file_update_time(file); while (len > 0) { sector_t IV; unsigned size; @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil set_fs(get_ds()); bw = file->f_op->write(file, buf, len, &pos); + file_update_time(file); set_fs(old_fs); if (likely(bw == len)) return 0; -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-19 0:50 ` Matt Mackall @ 2008-01-19 4:25 ` Rik van Riel 2008-01-19 10:22 ` Miklos Szeredi 1 sibling, 0 replies; 43+ messages in thread From: Rik van Riel @ 2008-01-19 4:25 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 18 Jan 2008 18:50:03 -0600 Matt Mackall <mpm@selenic.com> wrote: > On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote: > > Backup programs not seeing an updated mtime is a really big deal. > > And that's fixed with the 4-line approach. > > Reminds me, I've got a patch here for addressing that problem with loop mounts: > > Writes to loop should update the mtime of the underlying file. > > Signed-off-by: Matt Mackall <mpm@selenic.com> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-19 0:50 ` Matt Mackall 2008-01-19 4:25 ` Rik van Riel @ 2008-01-19 10:22 ` Miklos Szeredi 2008-01-19 15:49 ` Matt Mackall 1 sibling, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-19 10:22 UTC (permalink / raw) To: mpm Cc: riel, torvalds, ioe-lkml, miklos, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > Reminds me, I've got a patch here for addressing that problem with loop mounts: > > Writes to loop should update the mtime of the underlying file. > > Signed-off-by: Matt Mackall <mpm@selenic.com> > > Index: l/drivers/block/loop.c > =================================================================== > --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600 > +++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600 > @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1); > bv_offs = bvec->bv_offset; > len = bvec->bv_len; > + file_update_time(file); > while (len > 0) { > sector_t IV; > unsigned size; > @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil > > set_fs(get_ds()); > bw = file->f_op->write(file, buf, len, &pos); > + file_update_time(file); ->write should have already updated the times, no? > set_fs(old_fs); > if (likely(bw == len)) > return 0; > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-19 10:22 ` Miklos Szeredi @ 2008-01-19 15:49 ` Matt Mackall 0 siblings, 0 replies; 43+ messages in thread From: Matt Mackall @ 2008-01-19 15:49 UTC (permalink / raw) To: Miklos Szeredi Cc: riel, torvalds, ioe-lkml, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, ksm, staubach, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Sat, 2008-01-19 at 11:22 +0100, Miklos Szeredi wrote: > > Reminds me, I've got a patch here for addressing that problem with loop mounts: > > > > Writes to loop should update the mtime of the underlying file. > > > > Signed-off-by: Matt Mackall <mpm@selenic.com> > > > > Index: l/drivers/block/loop.c > > =================================================================== > > --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600 > > +++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600 > > @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d > > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1); > > bv_offs = bvec->bv_offset; > > len = bvec->bv_len; > > + file_update_time(file); > > while (len > 0) { > > sector_t IV; > > unsigned size; > > @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil > > > > set_fs(get_ds()); > > bw = file->f_op->write(file, buf, len, &pos); > > + file_update_time(file); > > ->write should have already updated the times, no? Yes, this second case is redundant. Still needed in the first case. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 22:47 ` Linus Torvalds 2008-01-18 22:54 ` Rik van Riel @ 2008-01-21 14:25 ` Peter Staubach 2008-01-21 14:36 ` Anton Salikhmetov 1 sibling, 1 reply; 43+ messages in thread From: Peter Staubach @ 2008-01-21 14:25 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Oeser, Miklos Szeredi, peterz, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch Linus Torvalds wrote: > On Fri, 18 Jan 2008, Ingo Oeser wrote: > >> Can we get "if the write to the page hits the disk, the mtime has hit the disk >> already no less than SOME_GRANULARITY before"? >> >> That is very important for computer forensics. Esp. in saving your ass! >> >> Ok, now back again to making that fast :-) >> > > I certainly don't mind it if we have some tighter guarantees, but what I'd > want is: > > - keep it simple. Let's face it, Linux has never ever given those > guarantees before, and it's not is if anybody has really cared. Even > now, the issue seems to be more about paper standards conformance than > anything else. > > I have been working on getting something supported here for because I have some very large Wall Street customers who do care about getting the mtime updated because their backups are getting corrupted. They are incomplete because although their applications update files, they don't get backed up because the mtime never changes. > - I get worried about people playing around with the dirty bit in > particular. We have had some really rather nasty bugs here. Most of > which are totally impossible to trigger under normal loads (for > example the old random-access utorrent writable mmap issue from about > a year ago). > > So these two issues - the big red danger signs flashing in my brain, > coupled with the fact that no application has apparently ever really > noticed in the last 15 years - just makes it a case where I'd like each > step of the way to be obvious and simple and no larger than really > absolutely necessary. Simple is good. However, too simple is not good. I would suggest that we implement file time updates which make sense and if they happen to follow POSIX, then nifty, otherwise, oh well. Thanx... ps ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-21 14:25 ` Peter Staubach @ 2008-01-21 14:36 ` Anton Salikhmetov 0 siblings, 0 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-21 14:36 UTC (permalink / raw) To: Peter Staubach Cc: Linus Torvalds, Ingo Oeser, Miklos Szeredi, peterz, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, jesper.juhl, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/21, Peter Staubach <staubach@redhat.com>: > Linus Torvalds wrote: > > On Fri, 18 Jan 2008, Ingo Oeser wrote: > > > >> Can we get "if the write to the page hits the disk, the mtime has hit the disk > >> already no less than SOME_GRANULARITY before"? > >> > >> That is very important for computer forensics. Esp. in saving your ass! > >> > >> Ok, now back again to making that fast :-) > >> > > > > I certainly don't mind it if we have some tighter guarantees, but what I'd > > want is: > > > > - keep it simple. Let's face it, Linux has never ever given those > > guarantees before, and it's not is if anybody has really cared. Even > > now, the issue seems to be more about paper standards conformance than > > anything else. > > > > > > I have been working on getting something supported here for > because I have some very large Wall Street customers who do > care about getting the mtime updated because their backups > are getting corrupted. They are incomplete because although > their applications update files, they don't get backed up > because the mtime never changes. > > > - I get worried about people playing around with the dirty bit in > > particular. We have had some really rather nasty bugs here. Most of > > which are totally impossible to trigger under normal loads (for > > example the old random-access utorrent writable mmap issue from about > > a year ago). > > > > So these two issues - the big red danger signs flashing in my brain, > > coupled with the fact that no application has apparently ever really > > noticed in the last 15 years - just makes it a case where I'd like each > > step of the way to be obvious and simple and no larger than really > > absolutely necessary. > > Simple is good. However, too simple is not good. I would suggest > that we implement file time updates which make sense and if they > happen to follow POSIX, then nifty, otherwise, oh well. Thank you very much for your support, Peter! I'm going to submit the design document, the next version of the patch series, and the performance tests results soon. > > Thanx... > > ps > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 10:15 ` Peter Zijlstra 2008-01-18 10:25 ` Peter Zijlstra @ 2008-01-18 10:38 ` Miklos Szeredi 2008-01-18 11:00 ` Peter Zijlstra 1 sibling, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 10:38 UTC (permalink / raw) To: a.p.zijlstra Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote: > > > > diff --git a/mm/msync.c b/mm/msync.c > > > index a4de868..a49af28 100644 > > > --- a/mm/msync.c > > > +++ b/mm/msync.c > > > @@ -13,11 +13,33 @@ > > > #include <linux/syscalls.h> > > > > > > /* > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > > + * It will force a pagefault on the next write access. > > > + */ > > > +static void vma_wrprotect(struct vm_area_struct *vma) > > > +{ > > > + unsigned long addr; > > > + > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > > + spinlock_t *ptl; > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > > + pud_t *pud = pud_offset(pgd, addr); > > > + pmd_t *pmd = pmd_offset(pud, addr); > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > + > > > + if (pte_dirty(*pte) && pte_write(*pte)) > > > + *pte = pte_wrprotect(*pte); > > > + pte_unmap_unlock(pte, ptl); > > > + } > > > +} > > > > What about ram based filesystems? They don't start out with read-only > > pte's, so I think they don't want them read-protected now either. > > Unless this is essential for correct mtime/ctime accounting on these > > filesystems (I don't think it really is). But then the mapping should > > start out read-only as well, otherwise the time update will only work > > after an msync(MS_ASYNC). > > page_mkclean() has all the needed logic for this, it also walks the rmap > and cleans out all other users, which I think is needed too for > consistencies sake: > > Process A Process B > > mmap(foo.txt) mmap(foo.txt) > > dirty page > dirty page > > msync(MS_ASYNC) > > dirty page > > msync(MS_ASYNC) <--- now what?! Nothing. I think it's perfectly acceptable behavior, if msync in process A doesn't care about any dirtying in process B. > All in all, that sounds rather expensive.. Right. The advantage of Anton's current approach, is that it's at least simple, and possibly not so expensive, while providing same quite sane semantics for MS_ASYNC vs. mtime updates. Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 10:38 ` Miklos Szeredi @ 2008-01-18 11:00 ` Peter Zijlstra 2008-01-18 11:17 ` Miklos Szeredi 0 siblings, 1 reply; 43+ messages in thread From: Peter Zijlstra @ 2008-01-18 11:00 UTC (permalink / raw) To: Miklos Szeredi Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 2008-01-18 at 11:38 +0100, Miklos Szeredi wrote: > > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote: > > > > > > diff --git a/mm/msync.c b/mm/msync.c > > > > index a4de868..a49af28 100644 > > > > --- a/mm/msync.c > > > > +++ b/mm/msync.c > > > > @@ -13,11 +13,33 @@ > > > > #include <linux/syscalls.h> > > > > > > > > /* > > > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only. > > > > + * It will force a pagefault on the next write access. > > > > + */ > > > > +static void vma_wrprotect(struct vm_area_struct *vma) > > > > +{ > > > > + unsigned long addr; > > > > + > > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > > > > + spinlock_t *ptl; > > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > > > > + pud_t *pud = pud_offset(pgd, addr); > > > > + pmd_t *pmd = pmd_offset(pud, addr); > > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > > + > > > > + if (pte_dirty(*pte) && pte_write(*pte)) > > > > + *pte = pte_wrprotect(*pte); > > > > + pte_unmap_unlock(pte, ptl); > > > > + } > > > > +} > > > > > > What about ram based filesystems? They don't start out with read-only > > > pte's, so I think they don't want them read-protected now either. > > > Unless this is essential for correct mtime/ctime accounting on these > > > filesystems (I don't think it really is). But then the mapping should > > > start out read-only as well, otherwise the time update will only work > > > after an msync(MS_ASYNC). > > > > page_mkclean() has all the needed logic for this, it also walks the rmap > > and cleans out all other users, which I think is needed too for > > consistencies sake: > > > > Process A Process B > > > > mmap(foo.txt) mmap(foo.txt) > > > > dirty page > > dirty page > > > > msync(MS_ASYNC) > > > > dirty page > > > > msync(MS_ASYNC) <--- now what?! how about: diff --git a/mm/msync.c b/mm/msync.c index 144a757..a1b3fc6 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -14,6 +14,122 @@ #include <linux/syscalls.h> #include <linux/sched.h> +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm, + unsigned long addr, unsigned long end) +{ + pte_t *pte; + spinlock_t *ptl; + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + arch_enter_lazy_mmu_mode(); + do { + pte_t ptent = *pte; + + if (pte_none(ptent)) + continue; + + if (!pte_present(ptent)) + continue; + + if (pte_dirty(ptent) && pte_write(ptent)) { + flush_cache_page(vma, addr, pte_pfn(ptent)); + ptent = ptep_clear_flush(vma, addr, pte); + ptent = pte_wrprotect(ptent); + set_pte_at(vma->vm_mnm, addr, pte, ptent); + } + } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(pte - 1, ptl); + + return addr; +} + +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud, + unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none_or_clear_bad(pmd)) + continue; + next = masync_pte_range(vma, pmd, addr, next); + } while (pmd++, addr = next, addr != end); + + return addr; +} + +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd, + unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + next = masync_pmd_range(vma, pud, addr, next); + } while (pud++, addr = next, addr != end); + + return addr; +} + +unsigned long masync_pgd_range() +{ + pgd_t *pgd; + unsigned long next; + + pgd = pgd_offset(vma->vm_mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_of_clear_bad(pgd)) + continue; + next = masync_pud_range(vma, pgd, addr, next); + } while (pgd++, addr = next, addr != end); + + return addr; +} + +int masync_vma_one(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + if (start < vma->vm_start) + start = vma->vm_start; + + if (end > vma->vm_end) + end = vma->vm_end; + + masync_pgd_range(vma, start, end); + + return 0; +} + +int masync_vma(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + struct address_space *mapping; + struct vm_area_struct *vma_iter; + + if (!(vma->vm_flags & VM_SHARED)) + return 0; + + mapping = vma->vm_file->f_mapping; + + if (!mapping_cap_account_dirty(mapping)) + return 0; + + spin_lock(&mapping->i_mmap_lock); + vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end) + masync_vma_one(vma_iter, start, end); + spin_unlock(&mapping->i_mmap_lock); + + return 0; +} + /* * MS_SYNC syncs the entire file - including mappings. * ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 11:00 ` Peter Zijlstra @ 2008-01-18 11:17 ` Miklos Szeredi 2008-01-18 11:23 ` Peter Zijlstra 0 siblings, 1 reply; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 11:17 UTC (permalink / raw) To: a.p.zijlstra Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > diff --git a/mm/msync.c b/mm/msync.c > index 144a757..a1b3fc6 100644 > --- a/mm/msync.c > +++ b/mm/msync.c > @@ -14,6 +14,122 @@ > #include <linux/syscalls.h> > #include <linux/sched.h> > > +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm, > + unsigned long addr, unsigned long end) > +{ > + pte_t *pte; > + spinlock_t *ptl; > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + arch_enter_lazy_mmu_mode(); > + do { > + pte_t ptent = *pte; > + > + if (pte_none(ptent)) > + continue; > + > + if (!pte_present(ptent)) > + continue; > + > + if (pte_dirty(ptent) && pte_write(ptent)) { > + flush_cache_page(vma, addr, pte_pfn(ptent)); Hmm, I'm not sure flush_cache_page() is needed. Or does does dirty data in the cache somehow interfere with the page protection? > + ptent = ptep_clear_flush(vma, addr, pte); > + ptent = pte_wrprotect(ptent); > + set_pte_at(vma->vm_mnm, addr, pte, ptent); > + } > + } while (pte++, addr += PAGE_SIZE, addr != end); > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(pte - 1, ptl); > + > + return addr; > +} > + > +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud, > + unsigned long addr, unsigned long end) > +{ > + pmd_t *pmd; > + unsigned long next; > + > + pmd = pmd_offset(pud, addr); > + do { > + next = pmd_addr_end(addr, end); > + if (pmd_none_or_clear_bad(pmd)) > + continue; > + next = masync_pte_range(vma, pmd, addr, next); > + } while (pmd++, addr = next, addr != end); > + > + return addr; > +} > + > +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd, > + unsigned long addr, unsigned long end) > +{ > + pud_t *pud; > + unsigned long next; > + > + pud = pud_offset(pgd, addr); > + do { > + next = pud_addr_end(addr, end); > + if (pud_none_or_clear_bad(pud)) > + continue; > + next = masync_pmd_range(vma, pud, addr, next); > + } while (pud++, addr = next, addr != end); > + > + return addr; > +} > + > +unsigned long masync_pgd_range() > +{ > + pgd_t *pgd; > + unsigned long next; > + > + pgd = pgd_offset(vma->vm_mm, addr); > + do { > + next = pgd_addr_end(addr, end); > + if (pgd_none_of_clear_bad(pgd)) > + continue; > + next = masync_pud_range(vma, pgd, addr, next); > + } while (pgd++, addr = next, addr != end); > + > + return addr; > +} > + > +int masync_vma_one(struct vm_area_struct *vma, > + unsigned long start, unsigned long end) > +{ > + if (start < vma->vm_start) > + start = vma->vm_start; > + > + if (end > vma->vm_end) > + end = vma->vm_end; > + > + masync_pgd_range(vma, start, end); > + > + return 0; > +} > + > +int masync_vma(struct vm_area_struct *vma, > + unsigned long start, unsigned long end) > +{ > + struct address_space *mapping; > + struct vm_area_struct *vma_iter; > + > + if (!(vma->vm_flags & VM_SHARED)) > + return 0; > + > + mapping = vma->vm_file->f_mapping; > + > + if (!mapping_cap_account_dirty(mapping)) > + return 0; > + > + spin_lock(&mapping->i_mmap_lock); > + vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end) > + masync_vma_one(vma_iter, start, end); > + spin_unlock(&mapping->i_mmap_lock); This is hoding i_mmap_lock for possibly quite long. Isn't that going to cause problems? Miklos > + > + return 0; > +} > + > /* > * MS_SYNC syncs the entire file - including mappings. > * > > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 11:17 ` Miklos Szeredi @ 2008-01-18 11:23 ` Peter Zijlstra 2008-01-18 11:36 ` Miklos Szeredi 0 siblings, 1 reply; 43+ messages in thread From: Peter Zijlstra @ 2008-01-18 11:23 UTC (permalink / raw) To: Miklos Szeredi Cc: salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch On Fri, 2008-01-18 at 12:17 +0100, Miklos Szeredi wrote: > > diff --git a/mm/msync.c b/mm/msync.c > > index 144a757..a1b3fc6 100644 > > --- a/mm/msync.c > > +++ b/mm/msync.c > > @@ -14,6 +14,122 @@ > > #include <linux/syscalls.h> > > #include <linux/sched.h> > > > > +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm, > > + unsigned long addr, unsigned long end) > > +{ > > + pte_t *pte; > > + spinlock_t *ptl; > > + > > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + arch_enter_lazy_mmu_mode(); > > + do { > > + pte_t ptent = *pte; > > + > > + if (pte_none(ptent)) > > + continue; > > + > > + if (!pte_present(ptent)) > > + continue; > > + > > + if (pte_dirty(ptent) && pte_write(ptent)) { > > + flush_cache_page(vma, addr, pte_pfn(ptent)); > > Hmm, I'm not sure flush_cache_page() is needed. Or does does dirty > data in the cache somehow interfere with the page protection? No, just being paranoid.. > > + ptent = ptep_clear_flush(vma, addr, pte); > > + ptent = pte_wrprotect(ptent); > > + set_pte_at(vma->vm_mnm, addr, pte, ptent); > > + } > > + } while (pte++, addr += PAGE_SIZE, addr != end); > > + arch_leave_lazy_mmu_mode(); > > + pte_unmap_unlock(pte - 1, ptl); > > + > > + return addr; > > +} > > + > > +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud, > > + unsigned long addr, unsigned long end) > > +{ > > + pmd_t *pmd; > > + unsigned long next; > > + > > + pmd = pmd_offset(pud, addr); > > + do { > > + next = pmd_addr_end(addr, end); > > + if (pmd_none_or_clear_bad(pmd)) > > + continue; > > + next = masync_pte_range(vma, pmd, addr, next); > > + } while (pmd++, addr = next, addr != end); > > + > > + return addr; > > +} > > + > > +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd, > > + unsigned long addr, unsigned long end) > > +{ > > + pud_t *pud; > > + unsigned long next; > > + > > + pud = pud_offset(pgd, addr); > > + do { > > + next = pud_addr_end(addr, end); > > + if (pud_none_or_clear_bad(pud)) > > + continue; > > + next = masync_pmd_range(vma, pud, addr, next); > > + } while (pud++, addr = next, addr != end); > > + > > + return addr; > > +} > > + > > +unsigned long masync_pgd_range() > > +{ > > + pgd_t *pgd; > > + unsigned long next; > > + > > + pgd = pgd_offset(vma->vm_mm, addr); > > + do { > > + next = pgd_addr_end(addr, end); > > + if (pgd_none_of_clear_bad(pgd)) > > + continue; > > + next = masync_pud_range(vma, pgd, addr, next); > > + } while (pgd++, addr = next, addr != end); > > + > > + return addr; > > +} > > + > > +int masync_vma_one(struct vm_area_struct *vma, > > + unsigned long start, unsigned long end) > > +{ > > + if (start < vma->vm_start) > > + start = vma->vm_start; > > + > > + if (end > vma->vm_end) > > + end = vma->vm_end; > > + > > + masync_pgd_range(vma, start, end); > > + > > + return 0; > > +} > > + > > +int masync_vma(struct vm_area_struct *vma, > > + unsigned long start, unsigned long end) > > +{ > > + struct address_space *mapping; > > + struct vm_area_struct *vma_iter; > > + > > + if (!(vma->vm_flags & VM_SHARED)) > > + return 0; > > + > > + mapping = vma->vm_file->f_mapping; > > + > > + if (!mapping_cap_account_dirty(mapping)) > > + return 0; > > + > > + spin_lock(&mapping->i_mmap_lock); > > + vma_prio_tree_foreach(vma_iter, &iter, &mapping->i_mmap, start, end) > > + masync_vma_one(vma_iter, start, end); > > + spin_unlock(&mapping->i_mmap_lock); > > This is hoding i_mmap_lock for possibly quite long. Isn't that going > to cause problems? Possibly, I didn't see a quick way to break that iteration. >From a quick glance at prio_tree.c the iterator isn't valid anymore after releasing i_mmap_lock. Fixing that would be,.. 'fun'. I also realized I forgot to copy/paste the prio_tree_iter declaration and ought to make all these functions static. But for a quick draft it conveys the idea pretty well, I guess :-) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files 2008-01-18 11:23 ` Peter Zijlstra @ 2008-01-18 11:36 ` Miklos Szeredi 0 siblings, 0 replies; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 11:36 UTC (permalink / raw) To: a.p.zijlstra Cc: miklos, salikhmetov, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > Possibly, I didn't see a quick way to break that iteration. > >From a quick glance at prio_tree.c the iterator isn't valid anymore > after releasing i_mmap_lock. Fixing that would be,.. 'fun'. Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem, which protects the prio tree as well, no? > I also realized I forgot to copy/paste the prio_tree_iter declaration > and ought to make all these functions static. > > But for a quick draft it conveys the idea pretty well, I guess :-) Yes :) There could also be nasty performance corner cases, like having a huge file mapped thousands of times, and having only a couple of pages dirtied between MS_ASYNC invocations. Then most of that page table walking would be just unnecessary overhead. There's something to be said for walking only the dirty pages, and doing page_mkclean on them, even if in some cases that would be slower. But I have a strong feeling of deja vu, and last time it ended with Andrew not liking the whole thing... Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times 2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov @ 2008-01-18 9:40 ` Miklos Szeredi 2008-01-18 10:31 ` Anton Salikhmetov 2008-01-18 19:48 ` Anton Salikhmetov 2 siblings, 2 replies; 43+ messages in thread From: Miklos Szeredi @ 2008-01-18 9:40 UTC (permalink / raw) To: salikhmetov Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, miklos, r.e.wolff, hidave.darkstar, hch > 4. Performance test was done using the program available from the > following link: > > http://bugzilla.kernel.org/attachment.cgi?id=14493 > > Result: the impact of the changes was negligible for files of a few > hundred megabytes. Could you also test with ext4 and post some numbers? Afaik, ext4 uses nanosecond timestamps, so the time updating code would be exercised more during the page faults. What about performance impact on msync(MS_ASYNC)? Could you please do some measurment of that as well? Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times 2008-01-18 9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi @ 2008-01-18 10:31 ` Anton Salikhmetov 2008-01-18 19:48 ` Anton Salikhmetov 1 sibling, 0 replies; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 10:31 UTC (permalink / raw) To: Miklos Szeredi Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Miklos Szeredi <miklos@szeredi.hu>: > > 4. Performance test was done using the program available from the > > following link: > > > > http://bugzilla.kernel.org/attachment.cgi?id=14493 > > > > Result: the impact of the changes was negligible for files of a few > > hundred megabytes. > > Could you also test with ext4 and post some numbers? Afaik, ext4 uses > nanosecond timestamps, so the time updating code would be exercised > more during the page faults. > > What about performance impact on msync(MS_ASYNC)? Could you please do > some measurment of that as well? I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem. > > Thanks, > Miklos > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times 2008-01-18 9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi 2008-01-18 10:31 ` Anton Salikhmetov @ 2008-01-18 19:48 ` Anton Salikhmetov 2008-01-19 10:45 ` Miklos Szeredi 1 sibling, 1 reply; 43+ messages in thread From: Anton Salikhmetov @ 2008-01-18 19:48 UTC (permalink / raw) To: Miklos Szeredi Cc: linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, r.e.wolff, hidave.darkstar, hch 2008/1/18, Miklos Szeredi <miklos@szeredi.hu>: > > 4. Performance test was done using the program available from the > > following link: > > > > http://bugzilla.kernel.org/attachment.cgi?id=14493 > > > > Result: the impact of the changes was negligible for files of a few > > hundred megabytes. > > Could you also test with ext4 and post some numbers? Afaik, ext4 uses > nanosecond timestamps, so the time updating code would be exercised > more during the page faults. > > What about performance impact on msync(MS_ASYNC)? Could you please do > some measurment of that as well? Did a quick test on an ext4 partition. This is how it looks like: debian:~/miklos# ./miklos_test /mnt/file begin 1200662360 1200662360 1200662353 write 1200662361 1200662361 1200662353 mmap 1200662361 1200662361 1200662362 b 1200662363 1200662363 1200662362 msync b 1200662363 1200662363 1200662362 c 1200662365 1200662365 1200662362 msync c 1200662365 1200662365 1200662362 d 1200662367 1200662367 1200662362 munmap 1200662367 1200662367 1200662362 close 1200662367 1200662367 1200662362 sync 1200662367 1200662367 1200662362 debian:~/miklos# mount | grep /mnt /root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0) > What about performance impact on msync(MS_ASYNC)? Could you please do > some measurment of that as well? Following are the results of the measurements. Here are the relevant portions of the test program: >>> #define FILE_SIZE (1024 * 1024 * 512) p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); /* Bring the pages in */ for (i = 0; i < FILE_SIZE; i += 4096) tmp = p[i]; /* Dirty the pages */ for (i = 0; i < FILE_SIZE; i += 4096) p[i] = i; /* How long did we spend in msync(MS_ASYNC)? */ gettimeofday(&tv_start, NULL); msync(p, FILE_SIZE, MS_ASYNC); gettimeofday(&tv_stop, NULL); <<< For reference tests, the following platforms were used: 1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB. 2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB), Memory: 15.98 GB. 3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs. Memory: 4G. The tests of my solution were performed using the following platform: A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case". The following tables give the time difference between the two calls to gettimeofday(). The test program was run three times in a raw with a delay of one second between consecutive runs. On Linux systems, the following commands were issued prior to running the tests: echo 80 >/proc/sys/vm/dirty_ratio echo 80 >/proc/sys/vm/dirty_background_ratio echo 30000 >/proc/sys/vm/dirty_expire_centisecs sync Table 1. Reference platforms. ------------------------------------------------------------ | | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD | ------------------------------------------------------------ | First run | 263405 usec | 202283 usec | 90 SECONDS | ------------------------------------------------------------ | Second run | 262253 usec | 172837 usec | 90 SECONDS | ------------------------------------------------------------ | Third run | 238465 usec | 238465 usec | 90 SECONDS | ------------------------------------------------------------ It looks like FreeBSD is a clear outsider here. Note that FreeBSD showed an almost liner depencence of the time spent in the msync(MS_ASYNC) call on the file size. Table 2. The Qemu system. File size is 512M. --------------------------------------------------- | | Before the patch | After the patch | --------------------------------------------------- | First run | 35 usec | 5852 usec | --------------------------------------------------- | Second run | 35 usec | 4444 usec | --------------------------------------------------- | Third run | 35 usec | 6330 usec | --------------------------------------------------- I think that the data above prove the viability of the solution I presented. Also, I guess that this bug fix is most probably ready for getting upstream. Please apply the sixth version of my solution. > > Thanks, > Miklos > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times 2008-01-18 19:48 ` Anton Salikhmetov @ 2008-01-19 10:45 ` Miklos Szeredi 0 siblings, 0 replies; 43+ messages in thread From: Miklos Szeredi @ 2008-01-19 10:45 UTC (permalink / raw) To: salikhmetov Cc: miklos, linux-mm, jakob, linux-kernel, valdis.kletnieks, riel, ksm, staubach, jesper.juhl, torvalds, a.p.zijlstra, akpm, protasnb, r.e.wolff, hidave.darkstar, hch > 2008/1/18, Miklos Szeredi <miklos@szeredi.hu>: > > > 4. Performance test was done using the program available from the > > > following link: > > > > > > http://bugzilla.kernel.org/attachment.cgi?id=14493 > > > > > > Result: the impact of the changes was negligible for files of a few > > > hundred megabytes. > > > > Could you also test with ext4 and post some numbers? Afaik, ext4 uses > > nanosecond timestamps, so the time updating code would be exercised > > more during the page faults. > > > > What about performance impact on msync(MS_ASYNC)? Could you please do > > some measurment of that as well? > > Did a quick test on an ext4 partition. This is how it looks like: Thanks for running these tests. I was more interested in the slowdown on ext4 (checked with the above mentioned program). Can you do such a test as well, and post resulting times with and without the patch? > Table 1. Reference platforms. > > ------------------------------------------------------------ > | | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD | > ------------------------------------------------------------ > | First run | 263405 usec | 202283 usec | 90 SECONDS | > ------------------------------------------------------------ > | Second run | 262253 usec | 172837 usec | 90 SECONDS | > ------------------------------------------------------------ > | Third run | 238465 usec | 238465 usec | 90 SECONDS | > ------------------------------------------------------------ > > It looks like FreeBSD is a clear outsider here. Note that FreeBSD > showed an almost liner depencence of the time spent in the > msync(MS_ASYNC) call on the file size. > > Table 2. The Qemu system. File size is 512M. > > --------------------------------------------------- > | | Before the patch | After the patch | > --------------------------------------------------- > | First run | 35 usec | 5852 usec | > --------------------------------------------------- > | Second run | 35 usec | 4444 usec | > --------------------------------------------------- > | Third run | 35 usec | 6330 usec | > --------------------------------------------------- Interesting. Thanks, Miklos ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2008-01-21 14:36 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-01-17 22:31 [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 1/2] Massive code cleanup of sys_msync() Anton Salikhmetov 2008-01-18 9:33 ` Miklos Szeredi 2008-01-18 10:30 ` Anton Salikhmetov 2008-01-17 22:31 ` [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files Anton Salikhmetov 2008-01-18 9:51 ` Miklos Szeredi 2008-01-18 10:15 ` Peter Zijlstra 2008-01-18 10:25 ` Peter Zijlstra 2008-01-18 10:39 ` Anton Salikhmetov 2008-01-18 17:58 ` Linus Torvalds 2008-01-18 18:11 ` Miklos Szeredi 2008-01-18 18:28 ` Rik van Riel 2008-01-18 18:51 ` Miklos Szeredi 2008-01-18 18:43 ` Linus Torvalds 2008-01-18 18:57 ` Miklos Szeredi 2008-01-18 19:08 ` Linus Torvalds 2008-01-18 19:22 ` Miklos Szeredi 2008-01-18 19:35 ` Linus Torvalds 2008-01-18 19:58 ` Anton Salikhmetov 2008-01-18 20:22 ` Linus Torvalds 2008-01-18 21:03 ` Anton Salikhmetov 2008-01-18 21:27 ` Linus Torvalds 2008-01-18 22:04 ` Anton Salikhmetov 2008-01-18 22:21 ` Linus Torvalds 2008-01-18 22:35 ` Anton Salikhmetov 2008-01-18 22:32 ` Ingo Oeser 2008-01-18 22:47 ` Linus Torvalds 2008-01-18 22:54 ` Rik van Riel 2008-01-19 0:50 ` Matt Mackall 2008-01-19 4:25 ` Rik van Riel 2008-01-19 10:22 ` Miklos Szeredi 2008-01-19 15:49 ` Matt Mackall 2008-01-21 14:25 ` Peter Staubach 2008-01-21 14:36 ` Anton Salikhmetov 2008-01-18 10:38 ` Miklos Szeredi 2008-01-18 11:00 ` Peter Zijlstra 2008-01-18 11:17 ` Miklos Szeredi 2008-01-18 11:23 ` Peter Zijlstra 2008-01-18 11:36 ` Miklos Szeredi 2008-01-18 9:40 ` [PATCH -v6 0/2] Fixing the issue with memory-mapped file times Miklos Szeredi 2008-01-18 10:31 ` Anton Salikhmetov 2008-01-18 19:48 ` Anton Salikhmetov 2008-01-19 10:45 ` Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).