* [patch 1/5] avoid tlb gather restarts.
2007-06-29 13:55 [patch 0/5] Various mm improvements Martin Schwidefsky
@ 2007-06-29 13:55 ` Martin Schwidefsky
2007-06-29 18:56 ` Hugh Dickins
2007-06-29 13:55 ` [patch 2/5] remove ptep_establish Martin Schwidefsky
` (3 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 13:55 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: Martin Schwidefsky
[-- Attachment #1: 002-flush-restarts.diff --]
[-- Type: text/plain, Size: 1309 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
If need_resched() is false it is unnecessary to call tlb_finish_mmu()
and tlb_gather_mmu() for each vma in unmap_vmas(). Moving the tlb gather
restart under the if that contains the cond_resched() will avoid
unnecessary tlb flush operations that are triggered by tlb_finish_mmu()
and tlb_gather_mmu().
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
mm/memory.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)
diff -urpN linux-2.6/mm/memory.c linux-2.6-patched/mm/memory.c
--- linux-2.6/mm/memory.c 2007-06-29 15:44:08.000000000 +0200
+++ linux-2.6-patched/mm/memory.c 2007-06-29 15:44:08.000000000 +0200
@@ -851,19 +851,18 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
- tlb_finish_mmu(*tlbp, tlb_start, start);
-
if (need_resched() ||
(i_mmap_lock && need_lockbreak(i_mmap_lock))) {
+ tlb_finish_mmu(*tlbp, tlb_start, start);
if (i_mmap_lock) {
*tlbp = NULL;
goto out;
}
cond_resched();
+ *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
+ tlb_start_valid = 0;
}
- *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
- tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 1/5] avoid tlb gather restarts.
2007-06-29 13:55 ` [patch 1/5] avoid tlb gather restarts Martin Schwidefsky
@ 2007-06-29 18:56 ` Hugh Dickins
2007-06-29 21:19 ` Martin Schwidefsky
0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2007-06-29 18:56 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: linux-kernel, linux-mm
I don't dare comment on your page_mkclean_one patch (5/5),
that dirty page business has grown too subtle for me.
Your cleanups 2-4 look good, especially the mm_types.h one (how
confident are you that everything builds?), and I'm glad we can
now lay ptep_establish to rest. Though I think you may have
missed removing a __HAVE_ARCH_PTEP... from frv at least?
But this one...
On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
> If need_resched() is false it is unnecessary to call tlb_finish_mmu()
> and tlb_gather_mmu() for each vma in unmap_vmas(). Moving the tlb gather
> restart under the if that contains the cond_resched() will avoid
> unnecessary tlb flush operations that are triggered by tlb_finish_mmu()
> and tlb_gather_mmu().
>
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Sorry, no. It looks reasonable, but unmap_vmas is treading a delicate
and uncomfortable line between hi-performance and lo-latency: you've
chosen to improve performance at the expense of latency.
You think you're just moving the finish/gather to where they're
actually necessary; but the thing is, that per-cpu struct mmu_gather
is liable to accumulate a lot of unpreemptible work for the future
tlb_finish_mmu, particularly when anon pages are associated with swap.
So although there may be no need to resched right now, if we keep on
gathering more and more without flushing, we'll be very unresponsive
when a resched is needed later on. Hence Ingo's ZAP_BLOCK_SIZE to
split it up, small when CONFIG_PREEMPT, more reasonable but still
limited when not.
I expect there is some tinkering which could be done to improve it a
little; but my ambition has always been to eliminate ZAP_BLOCK_SIZE,
get away from the per-cpu'ness of the mmu_gather, and make unmap_vmas
preemptible. But the i_mmap_lock case, and the per-arch variations
in TLB flushing, have forever stalled me.
Hugh
> ---
>
> mm/memory.c | 7 +++----
> 1 files changed, 3 insertions(+), 4 deletions(-)
>
> diff -urpN linux-2.6/mm/memory.c linux-2.6-patched/mm/memory.c
> --- linux-2.6/mm/memory.c 2007-06-29 15:44:08.000000000 +0200
> +++ linux-2.6-patched/mm/memory.c 2007-06-29 15:44:08.000000000 +0200
> @@ -851,19 +851,18 @@ unsigned long unmap_vmas(struct mmu_gath
> break;
> }
>
> - tlb_finish_mmu(*tlbp, tlb_start, start);
> -
> if (need_resched() ||
> (i_mmap_lock && need_lockbreak(i_mmap_lock))) {
> + tlb_finish_mmu(*tlbp, tlb_start, start);
> if (i_mmap_lock) {
> *tlbp = NULL;
> goto out;
> }
> cond_resched();
> + *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
> + tlb_start_valid = 0;
> }
>
> - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
> - tlb_start_valid = 0;
> zap_work = ZAP_BLOCK_SIZE;
> }
> }
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 1/5] avoid tlb gather restarts.
2007-06-29 18:56 ` Hugh Dickins
@ 2007-06-29 21:19 ` Martin Schwidefsky
2007-06-30 13:16 ` Hugh Dickins
0 siblings, 1 reply; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 21:19 UTC (permalink / raw)
To: Hugh Dickins; +Cc: linux-kernel, linux-mm
On Fri, 2007-06-29 at 19:56 +0100, Hugh Dickins wrote:
> I don't dare comment on your page_mkclean_one patch (5/5),
> that dirty page business has grown too subtle for me.
Oh yes, the dirty handling is tricky. I had to fix a really nasty bug
with it lately. As for page_mkclean_one the difference is that it
doesn't claim a page is dirty if only the write protect bit has not been
set. If we manage to lose dirty bits from ptes and have to rely on the
write protect bit to take over the job, then we have a different problem
altogether, no ?
> Your cleanups 2-4 look good, especially the mm_types.h one (how
> confident are you that everything builds?), and I'm glad we can
> now lay ptep_establish to rest. Though I think you may have
> missed removing a __HAVE_ARCH_PTEP... from frv at least?
Ok, thanks for the review. I take a look at frv to see if I missed
something.
> But this one...
>
> On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
>
> > If need_resched() is false it is unnecessary to call tlb_finish_mmu()
> > and tlb_gather_mmu() for each vma in unmap_vmas(). Moving the tlb gather
> > restart under the if that contains the cond_resched() will avoid
> > unnecessary tlb flush operations that are triggered by tlb_finish_mmu()
> > and tlb_gather_mmu().
> >
> > Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>
> Sorry, no. It looks reasonable, but unmap_vmas is treading a delicate
> and uncomfortable line between hi-performance and lo-latency: you've
> chosen to improve performance at the expense of latency.
That it true, my only concern had been performance. You likely have a
point here.
> You think you're just moving the finish/gather to where they're
> actually necessary; but the thing is, that per-cpu struct mmu_gather
> is liable to accumulate a lot of unpreemptible work for the future
> tlb_finish_mmu, particularly when anon pages are associated with swap.
Hmm, ok, so you are saying that we should do a flush at the end of each
vma.
> So although there may be no need to resched right now, if we keep on
> gathering more and more without flushing, we'll be very unresponsive
> when a resched is needed later on. Hence Ingo's ZAP_BLOCK_SIZE to
> split it up, small when CONFIG_PREEMPT, more reasonable but still
> limited when not.
Would it be acceptable to call tlb_flush_mmu instead of the
tlb_finish_mmu / tlb_gather_mmu pair if the condition around
cond_resched evaluates to false?
The background for this change is that I'm working on another patch that
will change the tlb flushing for s390 quite a bit. We won't have
anything to flush with tlb_finish_mmu because we will either flush all
tlbs with tlb_gather_mmu or each pte seperatly. The pages will always be
freed immediatly. If we are forced to restart the tlb gather then we'll
do multiple flush_tlb_mm because the information that we already flushed
everything is lost with tlb_finish_mmu.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 1/5] avoid tlb gather restarts.
2007-06-29 21:19 ` Martin Schwidefsky
@ 2007-06-30 13:16 ` Hugh Dickins
0 siblings, 0 replies; 19+ messages in thread
From: Hugh Dickins @ 2007-06-30 13:16 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: linux-kernel, linux-mm
On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
> On Fri, 2007-06-29 at 19:56 +0100, Hugh Dickins wrote:
> > I don't dare comment on your page_mkclean_one patch (5/5),
> > that dirty page business has grown too subtle for me.
>
> Oh yes, the dirty handling is tricky....
I'll move that discussion over to 5/5 and Cc Peter
(sorry I was too lazy to do so in the first place).
> > On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
> > You think you're just moving the finish/gather to where they're
> > actually necessary; but the thing is, that per-cpu struct mmu_gather
> > is liable to accumulate a lot of unpreemptible work for the future
> > tlb_finish_mmu, particularly when anon pages are associated with swap.
>
> Hmm, ok, so you are saying that we should do a flush at the end of each
> vma.
I think of it as doing a flush every ZAP_BLOCK_SIZE, with the imperfect
structure of the loop forcing perhaps an early flush at the end of each
vma: I seem to assume large vmas, and you to assume small ones.
IIRC, the common case for doing multiple vmas here is exit, when it
ends up that the TLB flush can often be skipped because already done
by the switch from exiting task; so the premature flush per vma doesn't
matter much. But treat that claim with maximum scepticism: I've not
rechecked it, several aspects may be wrong. What I do remember is
that (at least on i386) there's a lot less actual TLB flushing done
here than it appears from the outside.
> > So although there may be no need to resched right now, if we keep on
> > gathering more and more without flushing, we'll be very unresponsive
> > when a resched is needed later on. Hence Ingo's ZAP_BLOCK_SIZE to
> > split it up, small when CONFIG_PREEMPT, more reasonable but still
> > limited when not.
>
> Would it be acceptable to call tlb_flush_mmu instead of the
> tlb_finish_mmu / tlb_gather_mmu pair if the condition around
> cond_resched evaluates to false?
That sounds a good idea, yes, that should be fine. But beware,
tlb_flush_mmu is an internal detail of the asm-generic/tlb.h method
and perhaps some others, it currently doesn't exist on some arches.
I think you just need to add a simple one to arm & arm26, and take
the "ia64_" off the ia64 one. powerpc and sparc64 go about it all
a bit differently, but it should be easy to give them one too.
There may be some others missing.
> The background for this change is that I'm working on another patch that
> will change the tlb flushing for s390 quite a bit. We won't have
> anything to flush with tlb_finish_mmu because we will either flush all
> tlbs with tlb_gather_mmu or each pte seperatly. The pages will always be
> freed immediatly. If we are forced to restart the tlb gather then we'll
> do multiple flush_tlb_mm because the information that we already flushed
> everything is lost with tlb_finish_mmu.
Thanks for the info. Sounds like we may have trouble ahead when
rearranging this stuff, easy to forget s390 from our assumptions:
keep watch!
Hugh
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 2/5] remove ptep_establish.
2007-06-29 13:55 [patch 0/5] Various mm improvements Martin Schwidefsky
2007-06-29 13:55 ` [patch 1/5] avoid tlb gather restarts Martin Schwidefsky
@ 2007-06-29 13:55 ` Martin Schwidefsky
2007-06-29 13:55 ` [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty Martin Schwidefsky
` (2 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 13:55 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: Martin Schwidefsky
[-- Attachment #1: 003-ptep-establish.diff --]
[-- Type: text/plain, Size: 2990 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
The last user of ptep_establish in mm/ is long gone. Remove the
architecture primitive as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
include/asm-generic/pgtable.h | 19 -------------------
include/asm-i386/pgtable.h | 11 -----------
include/asm-ia64/pgtable.h | 6 ++++--
3 files changed, 4 insertions(+), 32 deletions(-)
diff -urpN linux-2.6/include/asm-generic/pgtable.h linux-2.6-patched/include/asm-generic/pgtable.h
--- linux-2.6/include/asm-generic/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-generic/pgtable.h 2007-06-29 15:44:10.000000000 +0200
@@ -3,25 +3,6 @@
#ifndef __ASSEMBLY__
-#ifndef __HAVE_ARCH_PTEP_ESTABLISH
-/*
- * Establish a new mapping:
- * - flush the old one
- * - update the page tables
- * - inform the TLB about the new one
- *
- * We hold the mm semaphore for reading, and the pte lock.
- *
- * Note: the old pte is known to not be writable, so we don't need to
- * worry about dirty bits etc getting lost.
- */
-#define ptep_establish(__vma, __address, __ptep, __entry) \
-do { \
- set_pte_at((__vma)->vm_mm, (__address), __ptep, __entry); \
- flush_tlb_page(__vma, __address); \
-} while (0)
-#endif
-
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
* Largely same as above, but only sets the access flags (dirty,
diff -urpN linux-2.6/include/asm-i386/pgtable.h linux-2.6-patched/include/asm-i386/pgtable.h
--- linux-2.6/include/asm-i386/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-i386/pgtable.h 2007-06-29 15:44:10.000000000 +0200
@@ -317,17 +317,6 @@ static inline pte_t native_local_ptep_ge
__ret; \
})
-/*
- * Rules for using ptep_establish: the pte MUST be a user pte, and
- * must be a present->present transition.
- */
-#define __HAVE_ARCH_PTEP_ESTABLISH
-#define ptep_establish(vma, address, ptep, pteval) \
-do { \
- set_pte_present((vma)->vm_mm, address, ptep, pteval); \
- flush_tlb_page(vma, address); \
-} while (0)
-
#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define ptep_clear_flush_dirty(vma, address, ptep) \
({ \
diff -urpN linux-2.6/include/asm-ia64/pgtable.h linux-2.6-patched/include/asm-ia64/pgtable.h
--- linux-2.6/include/asm-ia64/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-ia64/pgtable.h 2007-06-29 15:44:10.000000000 +0200
@@ -546,8 +546,10 @@ extern void lazy_mmu_prot_update (pte_t
# define ptep_set_access_flags(__vma, __addr, __ptep, __entry, __safely_writable) \
({ \
int __changed = !pte_same(*(__ptep), __entry); \
- if (__changed) \
- ptep_establish(__vma, __addr, __ptep, __entry); \
+ if (__changed) { \
+ set_pte_at((__vma)->vm_mm, (__addr), __ptep, __entry); \
+ flush_tlb_page(__vma, __addr); \
+ } \
__changed; \
})
#endif
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty.
2007-06-29 13:55 [patch 0/5] Various mm improvements Martin Schwidefsky
2007-06-29 13:55 ` [patch 1/5] avoid tlb gather restarts Martin Schwidefsky
2007-06-29 13:55 ` [patch 2/5] remove ptep_establish Martin Schwidefsky
@ 2007-06-29 13:55 ` Martin Schwidefsky
2007-07-03 1:29 ` Zachary Amsden
2007-06-29 13:55 ` [patch 4/5] move mm_struct and vm_area_struct Martin Schwidefsky
2007-06-29 13:55 ` [patch 5/5] Optimize page_mkclean_one Martin Schwidefsky
4 siblings, 1 reply; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 13:55 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: Martin Schwidefsky
[-- Attachment #1: 004-ptep-clear-dirty.diff --]
[-- Type: text/plain, Size: 14173 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Nobody is using ptep_test_and_clear_dirty and ptep_clear_flush_dirty.
Remove the functions from all architectures.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
include/asm-frv/pgtable.h | 7 -------
include/asm-generic/pgtable.h | 25 -------------------------
include/asm-i386/pgtable.h | 21 ---------------------
include/asm-ia64/pgtable.h | 17 -----------------
include/asm-m32r/pgtable.h | 6 ------
include/asm-parisc/pgtable.h | 16 ----------------
include/asm-powerpc/pgtable-ppc32.h | 7 -------
include/asm-powerpc/pgtable-ppc64.h | 31 -------------------------------
include/asm-ppc/pgtable.h | 7 -------
include/asm-s390/pgtable.h | 15 ---------------
include/asm-x86_64/pgtable.h | 8 --------
include/asm-xtensa/pgtable.h | 12 ------------
12 files changed, 172 deletions(-)
diff -urpN linux-2.6/include/asm-frv/pgtable.h linux-2.6-patched/include/asm-frv/pgtable.h
--- linux-2.6/include/asm-frv/pgtable.h 2007-05-09 09:58:15.000000000 +0200
+++ linux-2.6-patched/include/asm-frv/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -394,13 +394,6 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte &= ~_PAGE_WP; return pte; }
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- int i = test_and_clear_bit(_PAGE_BIT_DIRTY, ptep);
- asm volatile("dcf %M0" :: "U"(*ptep));
- return i;
-}
-
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
int i = test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep);
diff -urpN linux-2.6/include/asm-generic/pgtable.h linux-2.6-patched/include/asm-generic/pgtable.h
--- linux-2.6/include/asm-generic/pgtable.h 2007-06-29 15:44:11.000000000 +0200
+++ linux-2.6-patched/include/asm-generic/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -49,31 +49,6 @@
})
#endif
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep) \
-({ \
- pte_t __pte = *__ptep; \
- int r = 1; \
- if (!pte_dirty(__pte)) \
- r = 0; \
- else \
- set_pte_at((__vma)->vm_mm, (__address), (__ptep), \
- pte_mkclean(__pte)); \
- r; \
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep) \
-({ \
- int __dirty; \
- __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \
- if (__dirty) \
- flush_tlb_page(__vma, __address); \
- __dirty; \
-})
-#endif
-
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear(__mm, __address, __ptep) \
({ \
diff -urpN linux-2.6/include/asm-i386/pgtable.h linux-2.6-patched/include/asm-i386/pgtable.h
--- linux-2.6/include/asm-i386/pgtable.h 2007-06-29 15:44:11.000000000 +0200
+++ linux-2.6-patched/include/asm-i386/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -295,17 +295,6 @@ static inline pte_t native_local_ptep_ge
__changed; \
})
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(vma, addr, ptep) ({ \
- int __ret = 0; \
- if (pte_dirty(*(ptep))) \
- __ret = test_and_clear_bit(_PAGE_BIT_DIRTY, \
- &(ptep)->pte_low); \
- if (__ret) \
- pte_update((vma)->vm_mm, addr, ptep); \
- __ret; \
-})
-
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define ptep_test_and_clear_young(vma, addr, ptep) ({ \
int __ret = 0; \
@@ -317,16 +306,6 @@ static inline pte_t native_local_ptep_ge
__ret; \
})
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep) \
-({ \
- int __dirty; \
- __dirty = ptep_test_and_clear_dirty((vma), (address), (ptep)); \
- if (__dirty) \
- flush_tlb_page(vma, address); \
- __dirty; \
-})
-
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young(vma, address, ptep) \
({ \
diff -urpN linux-2.6/include/asm-ia64/pgtable.h linux-2.6-patched/include/asm-ia64/pgtable.h
--- linux-2.6/include/asm-ia64/pgtable.h 2007-06-29 15:44:11.000000000 +0200
+++ linux-2.6-patched/include/asm-ia64/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -398,22 +398,6 @@ ptep_test_and_clear_young (struct vm_are
#endif
}
-static inline int
-ptep_test_and_clear_dirty (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
-#ifdef CONFIG_SMP
- if (!pte_dirty(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_D_BIT, ptep);
-#else
- pte_t pte = *ptep;
- if (!pte_dirty(pte))
- return 0;
- set_pte_at(vma->vm_mm, addr, ptep, pte_mkclean(pte));
- return 1;
-#endif
-}
-
static inline pte_t
ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
@@ -593,7 +577,6 @@ extern void lazy_mmu_prot_update (pte_t
#endif
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTE_SAME
diff -urpN linux-2.6/include/asm-m32r/pgtable.h linux-2.6-patched/include/asm-m32r/pgtable.h
--- linux-2.6/include/asm-m32r/pgtable.h 2007-05-12 20:16:10.000000000 +0200
+++ linux-2.6-patched/include/asm-m32r/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -284,11 +284,6 @@ static inline pte_t pte_mkwrite(pte_t pt
return pte;
}
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep);
-}
-
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep);
@@ -382,7 +377,6 @@ static inline void pmd_set(pmd_t * pmdp,
remap_pfn_range(vma, vaddr, pfn, size, prot)
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTE_SAME
diff -urpN linux-2.6/include/asm-parisc/pgtable.h linux-2.6-patched/include/asm-parisc/pgtable.h
--- linux-2.6/include/asm-parisc/pgtable.h 2007-05-09 09:58:15.000000000 +0200
+++ linux-2.6-patched/include/asm-parisc/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -451,21 +451,6 @@ static inline int ptep_test_and_clear_yo
#endif
}
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
-#ifdef CONFIG_SMP
- if (!pte_dirty(*ptep))
- return 0;
- return test_and_clear_bit(xlate_pabit(_PAGE_DIRTY_BIT), &pte_val(*ptep));
-#else
- pte_t pte = *ptep;
- if (!pte_dirty(pte))
- return 0;
- set_pte_at(vma->vm_mm, addr, ptep, pte_mkclean(pte));
- return 1;
-#endif
-}
-
extern spinlock_t pa_dbit_lock;
struct mm_struct;
@@ -533,7 +518,6 @@ static inline void ptep_set_wrprotect(st
#define HAVE_ARCH_UNMAPPED_AREA
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTE_SAME
diff -urpN linux-2.6/include/asm-powerpc/pgtable-ppc32.h linux-2.6-patched/include/asm-powerpc/pgtable-ppc32.h
--- linux-2.6/include/asm-powerpc/pgtable-ppc32.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-powerpc/pgtable-ppc32.h 2007-06-29 15:44:11.000000000 +0200
@@ -643,13 +643,6 @@ static inline int __ptep_test_and_clear_
#define ptep_test_and_clear_young(__vma, __addr, __ptep) \
__ptep_test_and_clear_young((__vma)->vm_mm->context.id, __addr, __ptep)
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
-{
- return (pte_update(ptep, (_PAGE_DIRTY | _PAGE_HWWRITE), 0) & _PAGE_DIRTY) != 0;
-}
-
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
diff -urpN linux-2.6/include/asm-powerpc/pgtable-ppc64.h linux-2.6-patched/include/asm-powerpc/pgtable-ppc64.h
--- linux-2.6/include/asm-powerpc/pgtable-ppc64.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-powerpc/pgtable-ppc64.h 2007-06-29 15:44:11.000000000 +0200
@@ -307,29 +307,6 @@ static inline int __ptep_test_and_clear_
__r; \
})
-/*
- * On RW/DIRTY bit transitions we can avoid flushing the hpte. For the
- * moment we always flush but we need to fix hpte_update and test if the
- * optimisation is worth it.
- */
-static inline int __ptep_test_and_clear_dirty(struct mm_struct *mm,
- unsigned long addr, pte_t *ptep)
-{
- unsigned long old;
-
- if ((pte_val(*ptep) & _PAGE_DIRTY) == 0)
- return 0;
- old = pte_update(mm, addr, ptep, _PAGE_DIRTY, 0);
- return (old & _PAGE_DIRTY) != 0;
-}
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __addr, __ptep) \
-({ \
- int __r; \
- __r = __ptep_test_and_clear_dirty((__vma)->vm_mm, __addr, __ptep); \
- __r; \
-})
-
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
@@ -357,14 +334,6 @@ static inline void ptep_set_wrprotect(st
__young; \
})
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep) \
-({ \
- int __dirty = __ptep_test_and_clear_dirty((__vma)->vm_mm, __address, \
- __ptep); \
- __dirty; \
-})
-
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
diff -urpN linux-2.6/include/asm-ppc/pgtable.h linux-2.6-patched/include/asm-ppc/pgtable.h
--- linux-2.6/include/asm-ppc/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-ppc/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -664,13 +664,6 @@ static inline int __ptep_test_and_clear_
#define ptep_test_and_clear_young(__vma, __addr, __ptep) \
__ptep_test_and_clear_young((__vma)->vm_mm->context.id, __addr, __ptep)
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
-{
- return (pte_update(ptep, (_PAGE_DIRTY | _PAGE_HWWRITE), 0) & _PAGE_DIRTY) != 0;
-}
-
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
diff -urpN linux-2.6/include/asm-s390/pgtable.h linux-2.6-patched/include/asm-s390/pgtable.h
--- linux-2.6/include/asm-s390/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-s390/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -677,19 +677,6 @@ ptep_clear_flush_young(struct vm_area_st
return ptep_test_and_clear_young(vma, address, ptep);
}
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- return 0;
-}
-
-static inline int
-ptep_clear_flush_dirty(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)
-{
- /* No need to flush TLB; bits are in storage key */
- return ptep_test_and_clear_dirty(vma, address, ptep);
-}
-
static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
pte_t pte = *ptep;
@@ -952,8 +939,6 @@ extern void memmap_init(unsigned long, i
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
diff -urpN linux-2.6/include/asm-x86_64/pgtable.h linux-2.6-patched/include/asm-x86_64/pgtable.h
--- linux-2.6/include/asm-x86_64/pgtable.h 2007-06-18 09:43:22.000000000 +0200
+++ linux-2.6-patched/include/asm-x86_64/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -290,13 +290,6 @@ static inline pte_t pte_clrhuge(pte_t pt
struct vm_area_struct;
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- if (!pte_dirty(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
-
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
if (!pte_young(*ptep))
@@ -433,7 +426,6 @@ extern int kern_addr_valid(unsigned long
(((o) & (1UL << (__VIRTUAL_MASK_SHIFT-1))) ? ((o) | (~__VIRTUAL_MASK)) : (o))
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
diff -urpN linux-2.6/include/asm-xtensa/pgtable.h linux-2.6-patched/include/asm-xtensa/pgtable.h
--- linux-2.6/include/asm-xtensa/pgtable.h 2006-12-11 10:25:32.000000000 +0100
+++ linux-2.6-patched/include/asm-xtensa/pgtable.h 2007-06-29 15:44:11.000000000 +0200
@@ -270,17 +270,6 @@ ptep_test_and_clear_young(struct vm_area
return 1;
}
-static inline int
-ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr,
- pte_t *ptep)
-{
- pte_t pte = *ptep;
- if (!pte_dirty(pte))
- return 0;
- update_pte(ptep, pte_mkclean(pte));
- return 1;
-}
-
static inline pte_t
ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
@@ -421,7 +410,6 @@ typedef pte_t *pte_addr_t;
#endif /* !defined (__ASSEMBLY__) */
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty.
2007-06-29 13:55 ` [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty Martin Schwidefsky
@ 2007-07-03 1:29 ` Zachary Amsden
2007-07-03 7:26 ` Martin Schwidefsky
0 siblings, 1 reply; 19+ messages in thread
From: Zachary Amsden @ 2007-07-03 1:29 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: linux-kernel, linux-mm, Hugh Dickins
Martin Schwidefsky wrote:
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
>
> Nobody is using ptep_test_and_clear_dirty and ptep_clear_flush_dirty.
> Remove the functions from all architectures.
>
>
> -static inline int
> -ptep_test_and_clear_dirty (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> -{
> -#ifdef CONFIG_SMP
> - if (!pte_dirty(*ptep))
> - return 0;
> - return test_and_clear_bit(_PAGE_D_BIT, ptep);
> -#else
> - pte_t pte = *ptep;
> - if (!pte_dirty(pte))
> - return 0;
> - set_pte_at(vma->vm_mm, addr, ptep, pte_mkclean(pte));
> - return 1;
> -#endif
> -}
I've not followed all the changes lately - what is the current protocol
for clearing dirty bit? Is it simply pte_clear followed by set or is it
not done at all? At least for i386 and virtualization, we had several
optimizations to the test_and_clear path that are not possible with a
pte_clear / set_pte approach.
Zach
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty.
2007-07-03 1:29 ` Zachary Amsden
@ 2007-07-03 7:26 ` Martin Schwidefsky
0 siblings, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-07-03 7:26 UTC (permalink / raw)
To: Zachary Amsden; +Cc: linux-kernel, linux-mm, Hugh Dickins
On Mon, 2007-07-02 at 18:29 -0700, Zachary Amsden wrote:
> > -static inline int
> > -ptep_test_and_clear_dirty (struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
> > -{
> > -#ifdef CONFIG_SMP
> > - if (!pte_dirty(*ptep))
> > - return 0;
> > - return test_and_clear_bit(_PAGE_D_BIT, ptep);
> > -#else
> > - pte_t pte = *ptep;
> > - if (!pte_dirty(pte))
> > - return 0;
> > - set_pte_at(vma->vm_mm, addr, ptep, pte_mkclean(pte));
> > - return 1;
> > -#endif
> > -}
>
> I've not followed all the changes lately - what is the current protocol
> for clearing dirty bit? Is it simply pte_clear followed by set or is it
> not done at all? At least for i386 and virtualization, we had several
> optimizations to the test_and_clear path that are not possible with a
> pte_clear / set_pte approach.
Imho with a sequence of ptep_get_and_clear, pte_wrprotect, set_pte_at.
One of the reasons why ptep_test_and_clear_dirty doesn't make sense
anymore is the shared dirty page tracking. You never just test and clear
the dirty bit, the latest code always sets the write protect bit as
well.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 4/5] move mm_struct and vm_area_struct.
2007-06-29 13:55 [patch 0/5] Various mm improvements Martin Schwidefsky
` (2 preceding siblings ...)
2007-06-29 13:55 ` [patch 3/5] remove ptep_test_and_clear_dirty and ptep_clear_flush_dirty Martin Schwidefsky
@ 2007-06-29 13:55 ` Martin Schwidefsky
2007-06-29 13:55 ` [patch 5/5] Optimize page_mkclean_one Martin Schwidefsky
4 siblings, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 13:55 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: Martin Schwidefsky
[-- Attachment #1: 005-mm-types.diff --]
[-- Type: text/plain, Size: 13175 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Move the definitions of struct mm_struct and struct vma_area_struct
to include/mm_types.h. This allows to define more function in
asm/pgtable.h and friends with inline assemblies instead of macros.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
include/linux/mm.h | 63 --------------------
include/linux/mm_types.h | 143 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 74 ------------------------
3 files changed, 144 insertions(+), 136 deletions(-)
diff -urpN linux-2.6/include/linux/mm.h linux-2.6-patched/include/linux/mm.h
--- linux-2.6/include/linux/mm.h 2007-06-29 15:44:08.000000000 +0200
+++ linux-2.6-patched/include/linux/mm.h 2007-06-29 15:44:12.000000000 +0200
@@ -51,69 +51,6 @@ extern int sysctl_legacy_va_layout;
* mmap() functions).
*/
-/*
- * This struct defines a memory VMM memory area. There is one of these
- * per VM-area/task. A VM area is any part of the process virtual memory
- * space that has a special rule for the page-fault handlers (ie a shared
- * library, the executable area etc).
- */
-struct vm_area_struct {
- struct mm_struct * vm_mm; /* The address space we belong to. */
- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
-
- /* linked list of VM areas per task, sorted by address */
- struct vm_area_struct *vm_next;
-
- pgprot_t vm_page_prot; /* Access permissions of this VMA. */
- unsigned long vm_flags; /* Flags, listed below. */
-
- struct rb_node vm_rb;
-
- /*
- * For areas with an address space and backing store,
- * linkage into the address_space->i_mmap prio tree, or
- * linkage to the list of like vmas hanging off its node, or
- * linkage of vma in the address_space->i_mmap_nonlinear list.
- */
- union {
- struct {
- struct list_head list;
- void *parent; /* aligns with prio_tree_node parent */
- struct vm_area_struct *head;
- } vm_set;
-
- struct raw_prio_tree_node prio_tree_node;
- } shared;
-
- /*
- * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
- * list, after a COW of one of the file pages. A MAP_SHARED vma
- * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
- * or brk vma (with NULL file) can only be in an anon_vma list.
- */
- struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
- struct anon_vma *anon_vma; /* Serialized by page_table_lock */
-
- /* Function pointers to deal with this struct. */
- struct vm_operations_struct * vm_ops;
-
- /* Information about our backing store: */
- unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
- units, *not* PAGE_CACHE_SIZE */
- struct file * vm_file; /* File we map to (can be NULL). */
- void * vm_private_data; /* was vm_pte (shared mem) */
- unsigned long vm_truncate_count;/* truncate_count or restart_addr */
-
-#ifndef CONFIG_MMU
- atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */
-#endif
-#ifdef CONFIG_NUMA
- struct mempolicy *vm_policy; /* NUMA policy for the VMA */
-#endif
-};
-
extern struct kmem_cache *vm_area_cachep;
/*
diff -urpN linux-2.6/include/linux/mm_types.h linux-2.6-patched/include/linux/mm_types.h
--- linux-2.6/include/linux/mm_types.h 2007-05-11 09:19:04.000000000 +0200
+++ linux-2.6-patched/include/linux/mm_types.h 2007-06-29 15:44:12.000000000 +0200
@@ -1,13 +1,25 @@
#ifndef _LINUX_MM_TYPES_H
#define _LINUX_MM_TYPES_H
+#include <linux/auxvec.h> /* For AT_VECTOR_SIZE */
#include <linux/types.h>
#include <linux/threads.h>
#include <linux/list.h>
#include <linux/spinlock.h>
+#include <linux/prio_tree.h>
+#include <linux/rbtree.h>
+#include <linux/rwsem.h>
+#include <linux/completion.h>
+#include <asm/mmu.h>
struct address_space;
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+typedef atomic_long_t mm_counter_t;
+#else /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
+typedef unsigned long mm_counter_t;
+#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
+
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -80,4 +92,135 @@ struct page {
#endif /* WANT_PAGE_VIRTUAL */
};
+/*
+ * This struct defines a memory VMM memory area. There is one of these
+ * per VM-area/task. A VM area is any part of the process virtual memory
+ * space that has a special rule for the page-fault handlers (ie a shared
+ * library, the executable area etc).
+ */
+struct vm_area_struct {
+ struct mm_struct * vm_mm; /* The address space we belong to. */
+ unsigned long vm_start; /* Our start address within vm_mm. */
+ unsigned long vm_end; /* The first byte after our end address
+ within vm_mm. */
+
+ /* linked list of VM areas per task, sorted by address */
+ struct vm_area_struct *vm_next;
+
+ pgprot_t vm_page_prot; /* Access permissions of this VMA. */
+ unsigned long vm_flags; /* Flags, listed below. */
+
+ struct rb_node vm_rb;
+
+ /*
+ * For areas with an address space and backing store,
+ * linkage into the address_space->i_mmap prio tree, or
+ * linkage to the list of like vmas hanging off its node, or
+ * linkage of vma in the address_space->i_mmap_nonlinear list.
+ */
+ union {
+ struct {
+ struct list_head list;
+ void *parent; /* aligns with prio_tree_node parent */
+ struct vm_area_struct *head;
+ } vm_set;
+
+ struct raw_prio_tree_node prio_tree_node;
+ } shared;
+
+ /*
+ * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
+ * list, after a COW of one of the file pages. A MAP_SHARED vma
+ * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
+ * or brk vma (with NULL file) can only be in an anon_vma list.
+ */
+ struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
+ struct anon_vma *anon_vma; /* Serialized by page_table_lock */
+
+ /* Function pointers to deal with this struct. */
+ struct vm_operations_struct * vm_ops;
+
+ /* Information about our backing store: */
+ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
+ units, *not* PAGE_CACHE_SIZE */
+ struct file * vm_file; /* File we map to (can be NULL). */
+ void * vm_private_data; /* was vm_pte (shared mem) */
+ unsigned long vm_truncate_count;/* truncate_count or restart_addr */
+
+#ifndef CONFIG_MMU
+ atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */
+#endif
+#ifdef CONFIG_NUMA
+ struct mempolicy *vm_policy; /* NUMA policy for the VMA */
+#endif
+};
+
+struct mm_struct {
+ struct vm_area_struct * mmap; /* list of VMAs */
+ struct rb_root mm_rb;
+ struct vm_area_struct * mmap_cache; /* last find_vma result */
+ unsigned long (*get_unmapped_area) (struct file *filp,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags);
+ void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
+ unsigned long mmap_base; /* base of mmap area */
+ unsigned long task_size; /* size of task vm space */
+ unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
+ unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
+ pgd_t * pgd;
+ atomic_t mm_users; /* How many users with user space? */
+ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ int map_count; /* number of VMAs */
+ struct rw_semaphore mmap_sem;
+ spinlock_t page_table_lock; /* Protects page tables and some counters */
+
+ struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
+ * together off init_mm.mmlist, and are protected
+ * by mmlist_lock
+ */
+
+ /* Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ mm_counter_t _file_rss;
+ mm_counter_t _anon_rss;
+
+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
+
+ unsigned long total_vm, locked_vm, shared_vm, exec_vm;
+ unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;
+
+ unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+
+ cpumask_t cpu_vm_mask;
+
+ /* Architecture-specific MM context */
+ mm_context_t context;
+
+ /* Swap token stuff */
+ /*
+ * Last value of global fault stamp as seen by this process.
+ * In other words, this value gives an indication of how long
+ * it has been since this task got the token.
+ * Look at mm/thrash.c
+ */
+ unsigned int faultstamp;
+ unsigned int token_priority;
+ unsigned int last_interval;
+
+ unsigned char dumpable:2;
+
+ /* coredumping support */
+ int core_waiters;
+ struct completion *core_startup_done, core_done;
+
+ /* aio bits */
+ rwlock_t ioctx_list_lock;
+ struct kioctx *ioctx_list;
+};
+
#endif /* _LINUX_MM_TYPES_H */
diff -urpN linux-2.6/include/linux/sched.h linux-2.6-patched/include/linux/sched.h
--- linux-2.6/include/linux/sched.h 2007-06-09 12:24:04.000000000 +0200
+++ linux-2.6-patched/include/linux/sched.h 2007-06-29 15:44:12.000000000 +0200
@@ -1,8 +1,6 @@
#ifndef _LINUX_SCHED_H
#define _LINUX_SCHED_H
-#include <linux/auxvec.h> /* For AT_VECTOR_SIZE */
-
/*
* cloning flags:
*/
@@ -54,12 +52,12 @@ struct sched_param {
#include <linux/cpumask.h>
#include <linux/errno.h>
#include <linux/nodemask.h>
+#include <linux/mm_types.h>
#include <asm/system.h>
#include <asm/semaphore.h>
#include <asm/page.h>
#include <asm/ptrace.h>
-#include <asm/mmu.h>
#include <asm/cputime.h>
#include <linux/smp.h>
@@ -292,7 +290,6 @@ extern void arch_unmap_area_topdown(stru
#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
-typedef atomic_long_t mm_counter_t;
#else /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
/*
@@ -304,7 +301,6 @@ typedef atomic_long_t mm_counter_t;
#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
#define inc_mm_counter(mm, member) (mm)->_##member++
#define dec_mm_counter(mm, member) (mm)->_##member--
-typedef unsigned long mm_counter_t;
#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
@@ -320,74 +316,6 @@ typedef unsigned long mm_counter_t;
(mm)->hiwater_vm = (mm)->total_vm; \
} while (0)
-struct mm_struct {
- struct vm_area_struct * mmap; /* list of VMAs */
- struct rb_root mm_rb;
- struct vm_area_struct * mmap_cache; /* last find_vma result */
- unsigned long (*get_unmapped_area) (struct file *filp,
- unsigned long addr, unsigned long len,
- unsigned long pgoff, unsigned long flags);
- void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
- unsigned long mmap_base; /* base of mmap area */
- unsigned long task_size; /* size of task vm space */
- unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
- unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
- pgd_t * pgd;
- atomic_t mm_users; /* How many users with user space? */
- atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
- int map_count; /* number of VMAs */
- struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables and some counters */
-
- struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
- * together off init_mm.mmlist, and are protected
- * by mmlist_lock
- */
-
- /* Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- mm_counter_t _file_rss;
- mm_counter_t _anon_rss;
-
- unsigned long hiwater_rss; /* High-watermark of RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
-
- unsigned long total_vm, locked_vm, shared_vm, exec_vm;
- unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
-
- unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
-
- cpumask_t cpu_vm_mask;
-
- /* Architecture-specific MM context */
- mm_context_t context;
-
- /* Swap token stuff */
- /*
- * Last value of global fault stamp as seen by this process.
- * In other words, this value gives an indication of how long
- * it has been since this task got the token.
- * Look at mm/thrash.c
- */
- unsigned int faultstamp;
- unsigned int token_priority;
- unsigned int last_interval;
-
- unsigned char dumpable:2;
-
- /* coredumping support */
- int core_waiters;
- struct completion *core_startup_done, core_done;
-
- /* aio bits */
- rwlock_t ioctx_list_lock;
- struct kioctx *ioctx_list;
-};
-
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 5/5] Optimize page_mkclean_one
2007-06-29 13:55 [patch 0/5] Various mm improvements Martin Schwidefsky
` (3 preceding siblings ...)
2007-06-29 13:55 ` [patch 4/5] move mm_struct and vm_area_struct Martin Schwidefsky
@ 2007-06-29 13:55 ` Martin Schwidefsky
2007-06-30 14:04 ` Hugh Dickins
2007-07-01 10:29 ` Miklos Szeredi
4 siblings, 2 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-06-29 13:55 UTC (permalink / raw)
To: linux-kernel, linux-mm; +Cc: Martin Schwidefsky
[-- Attachment #1: 006-page-mkclean.diff --]
[-- Type: text/plain, Size: 1156 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
page_mkclean_one is used to clear the dirty bit and to set the write
protect bit of a pte. In additions it returns true if the pte either
has been dirty or if it has been writable. As far as I can see the
function should return true only if the pte has been dirty, or page
writeback will needlessly write a clean page.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
mm/rmap.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletion(-)
diff -urpN linux-2.6/mm/rmap.c linux-2.6-patched/mm/rmap.c
--- linux-2.6/mm/rmap.c 2007-06-29 09:58:33.000000000 +0200
+++ linux-2.6-patched/mm/rmap.c 2007-06-29 15:44:58.000000000 +0200
@@ -433,11 +433,12 @@ static int page_mkclean_one(struct page
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+ if (pte_dirty(entry))
+ ret = 1;
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
- ret = 1;
}
pte_unmap_unlock(pte, ptl);
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-06-29 13:55 ` [patch 5/5] Optimize page_mkclean_one Martin Schwidefsky
@ 2007-06-30 14:04 ` Hugh Dickins
2007-07-01 7:15 ` Martin Schwidefsky
2007-07-01 10:29 ` Miklos Szeredi
1 sibling, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2007-06-30 14:04 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: Peter Zijlstra, linux-kernel, linux-mm
On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
> On Fri, 2007-06-29 at 19:56 +0100, Hugh Dickins wrote:
> > I don't dare comment on your page_mkclean_one patch (5/5),
> > that dirty page business has grown too subtle for me.
>
> Oh yes, the dirty handling is tricky. I had to fix a really nasty bug
> with it lately. As for page_mkclean_one the difference is that it
> doesn't claim a page is dirty if only the write protect bit has not been
> set. If we manage to lose dirty bits from ptes and have to rely on the
> write protect bit to take over the job, then we have a different problem
> altogether, no ?
[Moving that over from 1/5 discussion].
Expect you're right, but I _really_ don't want to comment, when I don't
understand that "|| pte_write" in the first place, and don't know the
consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
Peter?
My suspicion is that the "|| pte_write" is precisely to cover your
s390 case where pte is never dirty (it may even have been me who got
Peter to put it in for that reason). In which case your patch would
be fine - though I think it'd be improved a lot by a comment or
rearrangement or new macro in place of the pte_dirty || pte_write
line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
and use that - its former use in msync has gone away now).
Hugh
On Fri, 29 Jun 2007, Martin Schwidefsky wrote:
> page_mkclean_one is used to clear the dirty bit and to set the write
> protect bit of a pte. In additions it returns true if the pte either
> has been dirty or if it has been writable. As far as I can see the
> function should return true only if the pte has been dirty, or page
> writeback will needlessly write a clean page.
>
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>
> mm/rmap.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletion(-)
>
> diff -urpN linux-2.6/mm/rmap.c linux-2.6-patched/mm/rmap.c
> --- linux-2.6/mm/rmap.c 2007-06-29 09:58:33.000000000 +0200
> +++ linux-2.6-patched/mm/rmap.c 2007-06-29 15:44:58.000000000 +0200
> @@ -433,11 +433,12 @@ static int page_mkclean_one(struct page
>
> flush_cache_page(vma, address, pte_pfn(*pte));
> entry = ptep_clear_flush(vma, address, pte);
> + if (pte_dirty(entry))
> + ret = 1;
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
> set_pte_at(mm, address, pte, entry);
> lazy_mmu_prot_update(entry);
> - ret = 1;
> }
>
> pte_unmap_unlock(pte, ptl);
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-06-30 14:04 ` Hugh Dickins
@ 2007-07-01 7:15 ` Martin Schwidefsky
2007-07-01 8:54 ` Hugh Dickins
0 siblings, 1 reply; 19+ messages in thread
From: Martin Schwidefsky @ 2007-07-01 7:15 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Peter Zijlstra, linux-kernel, linux-mm
On Sat, 2007-06-30 at 15:04 +0100, Hugh Dickins wrote:
> > Oh yes, the dirty handling is tricky. I had to fix a really nasty bug
> > with it lately. As for page_mkclean_one the difference is that it
> > doesn't claim a page is dirty if only the write protect bit has not been
> > set. If we manage to lose dirty bits from ptes and have to rely on the
> > write protect bit to take over the job, then we have a different problem
> > altogether, no ?
>
> [Moving that over from 1/5 discussion].
>
> Expect you're right, but I _really_ don't want to comment, when I don't
> understand that "|| pte_write" in the first place, and don't know the
> consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
The pte_write() part is for the shared dirty page tracking. If you want
to make sure that a max of x% of your pages are dirty then you cannot
allow to have more than x% to be writable. Thats why page_mkclean_one
clears the dirty bit and makes the page read-only.
> My suspicion is that the "|| pte_write" is precisely to cover your
> s390 case where pte is never dirty (it may even have been me who got
> Peter to put it in for that reason). In which case your patch would
> be fine - though I think it'd be improved a lot by a comment or
> rearrangement or new macro in place of the pte_dirty || pte_write
> line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> and use that - its former use in msync has gone away now).
No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
page_mkclean.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-07-01 7:15 ` Martin Schwidefsky
@ 2007-07-01 8:54 ` Hugh Dickins
2007-07-01 13:27 ` Peter Zijlstra
2007-07-01 19:50 ` Martin Schwidefsky
0 siblings, 2 replies; 19+ messages in thread
From: Hugh Dickins @ 2007-07-01 8:54 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: Peter Zijlstra, linux-kernel, linux-mm
On Sun, 1 Jul 2007, Martin Schwidefsky wrote:
> >
> > Expect you're right, but I _really_ don't want to comment, when I don't
> > understand that "|| pte_write" in the first place, and don't know the
> > consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
>
> The pte_write() part is for the shared dirty page tracking. If you want
> to make sure that a max of x% of your pages are dirty then you cannot
> allow to have more than x% to be writable. Thats why page_mkclean_one
> clears the dirty bit and makes the page read-only.
The whole of page_mkclean_one is for the dirty page tracking: so it's
obvious why it tests pte_dirty, but not obvious why it tests pte_write.
>
> > My suspicion is that the "|| pte_write" is precisely to cover your
> > s390 case where pte is never dirty (it may even have been me who got
> > Peter to put it in for that reason). In which case your patch would
> > be fine - though I think it'd be improved a lot by a comment or
> > rearrangement or new macro in place of the pte_dirty || pte_write
> > line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> > and use that - its former use in msync has gone away now).
>
> No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
> page_mkclean.
That's where its dirty page count comes from, yes: but since the s390
pte_dirty just says no, if page_mkclean_one tested only pte_dirty,
then it wouldn't do anything on s390, and in particular wouldn't
write protect the ptes to re-enforce dirty counting from then on.
So in answering your denials, I grow more confident that the pte_write
test is precisely for the s390 case. Though it might also be to cover
some defect in the write-protection scheme on other arches.
Come to think of it, would your patch really make any difference?
Although page_mkclean's "count" of dirty ptes on s390 will be nonsense,
that count would anyway be unknown, and it's only used as a boolean;
and now I don't think your patch changes the boolean value - if any
pte is found writable (and if the scheme is working) that implies
that the page was written to, and so should give the same answer
as the page_test_dirty.
But I could easily be overlooking something: Peter will recall.
Hugh
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-07-01 8:54 ` Hugh Dickins
@ 2007-07-01 13:27 ` Peter Zijlstra
2007-07-02 7:07 ` Martin Schwidefsky
2007-07-01 19:50 ` Martin Schwidefsky
1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2007-07-01 13:27 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Martin Schwidefsky, linux-kernel, linux-mm
On Sun, 2007-07-01 at 09:54 +0100, Hugh Dickins wrote:
> But I could easily be overlooking something: Peter will recall.
/me tries to get his brain up to speed after the OLS closing party :-)
I did both pte_dirty and pte_write because I was extra careful. One
_should_ imply the other, but since we'll be clearing both, I thought it
prudent to also check both.
I will have to think on this a little more, but I'm currently of the
opinion that the optimisation is not correct. But I'll have a thorough
look at s390 again when I get home.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-07-01 13:27 ` Peter Zijlstra
@ 2007-07-02 7:07 ` Martin Schwidefsky
0 siblings, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-07-02 7:07 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Hugh Dickins, linux-kernel, linux-mm
On Sun, 2007-07-01 at 15:27 +0200, Peter Zijlstra wrote:
> > But I could easily be overlooking something: Peter will recall.
>
> /me tries to get his brain up to speed after the OLS closing party :-)
Oh-oh, the Black Thorn party :-)
> I did both pte_dirty and pte_write because I was extra careful. One
> _should_ imply the other, but since we'll be clearing both, I thought it
> prudent to also check both.
Just ran a little experiment: I've added a simple WARN_ON(ret == 0) to
page_mkclean after the page_test_dirty() check to see if there are cases
where the page is dirty and all ptes are read-only. A little stress run
including massive swap did not print a single warning.
> I will have to think on this a little more, but I'm currently of the
> opinion that the optimisation is not correct. But I'll have a thorough
> look at s390 again when I get home.
I think the patch is correct, although I beginning to doubt that is has
any effect.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-07-01 8:54 ` Hugh Dickins
2007-07-01 13:27 ` Peter Zijlstra
@ 2007-07-01 19:50 ` Martin Schwidefsky
1 sibling, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-07-01 19:50 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Peter Zijlstra, linux-kernel, linux-mm
On Sun, 2007-07-01 at 09:54 +0100, Hugh Dickins wrote:
> On Sun, 1 Jul 2007, Martin Schwidefsky wrote:
> > >
> > > Expect you're right, but I _really_ don't want to comment, when I don't
> > > understand that "|| pte_write" in the first place, and don't know the
> > > consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
> >
> > The pte_write() part is for the shared dirty page tracking. If you want
> > to make sure that a max of x% of your pages are dirty then you cannot
> > allow to have more than x% to be writable. Thats why page_mkclean_one
> > clears the dirty bit and makes the page read-only.
>
> The whole of page_mkclean_one is for the dirty page tracking: so it's
> obvious why it tests pte_dirty, but not obvious why it tests pte_write.
Yes, the pte_write call is needed for shared dirty page tracking. In
particular its needed for s390 but for corner cases where a page is
writable but not dirty it might be needed for other architectures as
well.
> > > My suspicion is that the "|| pte_write" is precisely to cover your
> > > s390 case where pte is never dirty (it may even have been me who got
> > > Peter to put it in for that reason). In which case your patch would
> > > be fine - though I think it'd be improved a lot by a comment or
> > > rearrangement or new macro in place of the pte_dirty || pte_write
> > > line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> > > and use that - its former use in msync has gone away now).
> >
> > No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
> > page_mkclean.
>
> That's where its dirty page count comes from, yes: but since the s390
> pte_dirty just says no, if page_mkclean_one tested only pte_dirty,
> then it wouldn't do anything on s390, and in particular wouldn't
> write protect the ptes to re-enforce dirty counting from then on.
Yes, I definitly agree that the pte_write is required for s390.
> So in answering your denials, I grow more confident that the pte_write
> test is precisely for the s390 case. Though it might also be to cover
> some defect in the write-protection scheme on other arches.
Well, here I'm not so sure. You need the implication
pte_write() == true -> pte_dirty() == true
to be able to skip the pte_write check for architectures that keep their
dirty bits in the pte. Is this really true for all corner-cases?
> Come to think of it, would your patch really make any difference?
> Although page_mkclean's "count" of dirty ptes on s390 will be nonsense,
> that count would anyway be unknown, and it's only used as a boolean;
> and now I don't think your patch changes the boolean value - if any
> pte is found writable (and if the scheme is working) that implies
> that the page was written to, and so should give the same answer
> as the page_test_dirty.
It depends on code outside of pte_mkclean_one if the patch makes a
difference or not. The additional check for pte_dirty will make the
function less suble as it will not depends on code outside of it.
With the additional check for pte_dirty the function does the following:
1) make the pte clean, 2) make the pte read-only, 3) return true if the
pte has been marked dirty.
Without the check the function does 1), 2) as above and 3) return true
if the pte has been marked dirty or has been writable.
I find it easier to understand the semantics of the function with the
additional check. But that is only me ..
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [patch 5/5] Optimize page_mkclean_one
2007-06-29 13:55 ` [patch 5/5] Optimize page_mkclean_one Martin Schwidefsky
2007-06-30 14:04 ` Hugh Dickins
@ 2007-07-01 10:29 ` Miklos Szeredi
1 sibling, 0 replies; 19+ messages in thread
From: Miklos Szeredi @ 2007-07-01 10:29 UTC (permalink / raw)
To: schwidefsky; +Cc: linux-kernel, linux-mm, schwidefsky
> page_mkclean_one is used to clear the dirty bit and to set the write
> protect bit of a pte. In additions it returns true if the pte either
> has been dirty or if it has been writable. As far as I can see the
> function should return true only if the pte has been dirty, or page
> writeback will needlessly write a clean page.
There are some weird cases, like for example get_user_pages(), when
the pte takes a write fault and the page is modified, but the pte
doesn't become dirty, because the page is written through the kernel
mapping.
In the get_user_pages() case the page itself is dirtied, so your patch
probably doesn't break that. But I'm not sure if there aren't similar
cases like that that the pte_write() check is taking care of.
And anyway if the dirty page tracking works correctly, your patch
won't optimize anything, since the pte will _only_ become writable if
the page was dirtied.
So in fact normally pte_dirty() and pte_write() should be equivalent,
except for some weird cases.
Miklos
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 4/5] move mm_struct and vm_area_struct.
2007-07-03 11:18 [patch 0/5] some mm improvements + s390 tlb flush Martin Schwidefsky
@ 2007-07-03 11:18 ` Martin Schwidefsky
0 siblings, 0 replies; 19+ messages in thread
From: Martin Schwidefsky @ 2007-07-03 11:18 UTC (permalink / raw)
To: akpm, hugh, peterz; +Cc: linux-kernel, linux-mm, Martin Schwidefsky
[-- Attachment #1: 004-mm-types.diff --]
[-- Type: text/plain, Size: 13254 bytes --]
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Move the definitions of struct mm_struct and struct vma_area_struct
to include/mm_types.h. This allows to define more function in
asm/pgtable.h and friends with inline assemblies instead of macros.
Compile tested on i386, ia64, powerpc, powerpc64, s390-32, s390-64
and x86_64.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
include/linux/mm.h | 63 --------------------
include/linux/mm_types.h | 143 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 74 ------------------------
3 files changed, 144 insertions(+), 136 deletions(-)
diff -urpN linux-2.6/include/linux/mm.h linux-2.6-patched/include/linux/mm.h
--- linux-2.6/include/linux/mm.h 2007-06-22 14:11:55.000000000 +0200
+++ linux-2.6-patched/include/linux/mm.h 2007-07-03 12:56:50.000000000 +0200
@@ -51,69 +51,6 @@ extern int sysctl_legacy_va_layout;
* mmap() functions).
*/
-/*
- * This struct defines a memory VMM memory area. There is one of these
- * per VM-area/task. A VM area is any part of the process virtual memory
- * space that has a special rule for the page-fault handlers (ie a shared
- * library, the executable area etc).
- */
-struct vm_area_struct {
- struct mm_struct * vm_mm; /* The address space we belong to. */
- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
-
- /* linked list of VM areas per task, sorted by address */
- struct vm_area_struct *vm_next;
-
- pgprot_t vm_page_prot; /* Access permissions of this VMA. */
- unsigned long vm_flags; /* Flags, listed below. */
-
- struct rb_node vm_rb;
-
- /*
- * For areas with an address space and backing store,
- * linkage into the address_space->i_mmap prio tree, or
- * linkage to the list of like vmas hanging off its node, or
- * linkage of vma in the address_space->i_mmap_nonlinear list.
- */
- union {
- struct {
- struct list_head list;
- void *parent; /* aligns with prio_tree_node parent */
- struct vm_area_struct *head;
- } vm_set;
-
- struct raw_prio_tree_node prio_tree_node;
- } shared;
-
- /*
- * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
- * list, after a COW of one of the file pages. A MAP_SHARED vma
- * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
- * or brk vma (with NULL file) can only be in an anon_vma list.
- */
- struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
- struct anon_vma *anon_vma; /* Serialized by page_table_lock */
-
- /* Function pointers to deal with this struct. */
- struct vm_operations_struct * vm_ops;
-
- /* Information about our backing store: */
- unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
- units, *not* PAGE_CACHE_SIZE */
- struct file * vm_file; /* File we map to (can be NULL). */
- void * vm_private_data; /* was vm_pte (shared mem) */
- unsigned long vm_truncate_count;/* truncate_count or restart_addr */
-
-#ifndef CONFIG_MMU
- atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */
-#endif
-#ifdef CONFIG_NUMA
- struct mempolicy *vm_policy; /* NUMA policy for the VMA */
-#endif
-};
-
extern struct kmem_cache *vm_area_cachep;
/*
diff -urpN linux-2.6/include/linux/mm_types.h linux-2.6-patched/include/linux/mm_types.h
--- linux-2.6/include/linux/mm_types.h 2007-05-11 09:19:04.000000000 +0200
+++ linux-2.6-patched/include/linux/mm_types.h 2007-07-03 12:56:50.000000000 +0200
@@ -1,13 +1,25 @@
#ifndef _LINUX_MM_TYPES_H
#define _LINUX_MM_TYPES_H
+#include <linux/auxvec.h> /* For AT_VECTOR_SIZE */
#include <linux/types.h>
#include <linux/threads.h>
#include <linux/list.h>
#include <linux/spinlock.h>
+#include <linux/prio_tree.h>
+#include <linux/rbtree.h>
+#include <linux/rwsem.h>
+#include <linux/completion.h>
+#include <asm/mmu.h>
struct address_space;
+#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
+typedef atomic_long_t mm_counter_t;
+#else /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
+typedef unsigned long mm_counter_t;
+#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
+
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -80,4 +92,135 @@ struct page {
#endif /* WANT_PAGE_VIRTUAL */
};
+/*
+ * This struct defines a memory VMM memory area. There is one of these
+ * per VM-area/task. A VM area is any part of the process virtual memory
+ * space that has a special rule for the page-fault handlers (ie a shared
+ * library, the executable area etc).
+ */
+struct vm_area_struct {
+ struct mm_struct * vm_mm; /* The address space we belong to. */
+ unsigned long vm_start; /* Our start address within vm_mm. */
+ unsigned long vm_end; /* The first byte after our end address
+ within vm_mm. */
+
+ /* linked list of VM areas per task, sorted by address */
+ struct vm_area_struct *vm_next;
+
+ pgprot_t vm_page_prot; /* Access permissions of this VMA. */
+ unsigned long vm_flags; /* Flags, listed below. */
+
+ struct rb_node vm_rb;
+
+ /*
+ * For areas with an address space and backing store,
+ * linkage into the address_space->i_mmap prio tree, or
+ * linkage to the list of like vmas hanging off its node, or
+ * linkage of vma in the address_space->i_mmap_nonlinear list.
+ */
+ union {
+ struct {
+ struct list_head list;
+ void *parent; /* aligns with prio_tree_node parent */
+ struct vm_area_struct *head;
+ } vm_set;
+
+ struct raw_prio_tree_node prio_tree_node;
+ } shared;
+
+ /*
+ * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
+ * list, after a COW of one of the file pages. A MAP_SHARED vma
+ * can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
+ * or brk vma (with NULL file) can only be in an anon_vma list.
+ */
+ struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
+ struct anon_vma *anon_vma; /* Serialized by page_table_lock */
+
+ /* Function pointers to deal with this struct. */
+ struct vm_operations_struct * vm_ops;
+
+ /* Information about our backing store: */
+ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
+ units, *not* PAGE_CACHE_SIZE */
+ struct file * vm_file; /* File we map to (can be NULL). */
+ void * vm_private_data; /* was vm_pte (shared mem) */
+ unsigned long vm_truncate_count;/* truncate_count or restart_addr */
+
+#ifndef CONFIG_MMU
+ atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */
+#endif
+#ifdef CONFIG_NUMA
+ struct mempolicy *vm_policy; /* NUMA policy for the VMA */
+#endif
+};
+
+struct mm_struct {
+ struct vm_area_struct * mmap; /* list of VMAs */
+ struct rb_root mm_rb;
+ struct vm_area_struct * mmap_cache; /* last find_vma result */
+ unsigned long (*get_unmapped_area) (struct file *filp,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags);
+ void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
+ unsigned long mmap_base; /* base of mmap area */
+ unsigned long task_size; /* size of task vm space */
+ unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
+ unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
+ pgd_t * pgd;
+ atomic_t mm_users; /* How many users with user space? */
+ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ int map_count; /* number of VMAs */
+ struct rw_semaphore mmap_sem;
+ spinlock_t page_table_lock; /* Protects page tables and some counters */
+
+ struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
+ * together off init_mm.mmlist, and are protected
+ * by mmlist_lock
+ */
+
+ /* Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ mm_counter_t _file_rss;
+ mm_counter_t _anon_rss;
+
+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
+
+ unsigned long total_vm, locked_vm, shared_vm, exec_vm;
+ unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;
+
+ unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+
+ cpumask_t cpu_vm_mask;
+
+ /* Architecture-specific MM context */
+ mm_context_t context;
+
+ /* Swap token stuff */
+ /*
+ * Last value of global fault stamp as seen by this process.
+ * In other words, this value gives an indication of how long
+ * it has been since this task got the token.
+ * Look at mm/thrash.c
+ */
+ unsigned int faultstamp;
+ unsigned int token_priority;
+ unsigned int last_interval;
+
+ unsigned char dumpable:2;
+
+ /* coredumping support */
+ int core_waiters;
+ struct completion *core_startup_done, core_done;
+
+ /* aio bits */
+ rwlock_t ioctx_list_lock;
+ struct kioctx *ioctx_list;
+};
+
#endif /* _LINUX_MM_TYPES_H */
diff -urpN linux-2.6/include/linux/sched.h linux-2.6-patched/include/linux/sched.h
--- linux-2.6/include/linux/sched.h 2007-06-09 12:24:04.000000000 +0200
+++ linux-2.6-patched/include/linux/sched.h 2007-07-03 12:56:50.000000000 +0200
@@ -1,8 +1,6 @@
#ifndef _LINUX_SCHED_H
#define _LINUX_SCHED_H
-#include <linux/auxvec.h> /* For AT_VECTOR_SIZE */
-
/*
* cloning flags:
*/
@@ -54,12 +52,12 @@ struct sched_param {
#include <linux/cpumask.h>
#include <linux/errno.h>
#include <linux/nodemask.h>
+#include <linux/mm_types.h>
#include <asm/system.h>
#include <asm/semaphore.h>
#include <asm/page.h>
#include <asm/ptrace.h>
-#include <asm/mmu.h>
#include <asm/cputime.h>
#include <linux/smp.h>
@@ -292,7 +290,6 @@ extern void arch_unmap_area_topdown(stru
#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
-typedef atomic_long_t mm_counter_t;
#else /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
/*
@@ -304,7 +301,6 @@ typedef atomic_long_t mm_counter_t;
#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
#define inc_mm_counter(mm, member) (mm)->_##member++
#define dec_mm_counter(mm, member) (mm)->_##member--
-typedef unsigned long mm_counter_t;
#endif /* NR_CPUS < CONFIG_SPLIT_PTLOCK_CPUS */
@@ -320,74 +316,6 @@ typedef unsigned long mm_counter_t;
(mm)->hiwater_vm = (mm)->total_vm; \
} while (0)
-struct mm_struct {
- struct vm_area_struct * mmap; /* list of VMAs */
- struct rb_root mm_rb;
- struct vm_area_struct * mmap_cache; /* last find_vma result */
- unsigned long (*get_unmapped_area) (struct file *filp,
- unsigned long addr, unsigned long len,
- unsigned long pgoff, unsigned long flags);
- void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
- unsigned long mmap_base; /* base of mmap area */
- unsigned long task_size; /* size of task vm space */
- unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
- unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
- pgd_t * pgd;
- atomic_t mm_users; /* How many users with user space? */
- atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
- int map_count; /* number of VMAs */
- struct rw_semaphore mmap_sem;
- spinlock_t page_table_lock; /* Protects page tables and some counters */
-
- struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
- * together off init_mm.mmlist, and are protected
- * by mmlist_lock
- */
-
- /* Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- mm_counter_t _file_rss;
- mm_counter_t _anon_rss;
-
- unsigned long hiwater_rss; /* High-watermark of RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
-
- unsigned long total_vm, locked_vm, shared_vm, exec_vm;
- unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
-
- unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
-
- cpumask_t cpu_vm_mask;
-
- /* Architecture-specific MM context */
- mm_context_t context;
-
- /* Swap token stuff */
- /*
- * Last value of global fault stamp as seen by this process.
- * In other words, this value gives an indication of how long
- * it has been since this task got the token.
- * Look at mm/thrash.c
- */
- unsigned int faultstamp;
- unsigned int token_priority;
- unsigned int last_interval;
-
- unsigned char dumpable:2;
-
- /* coredumping support */
- int core_waiters;
- struct completion *core_startup_done, core_done;
-
- /* aio bits */
- rwlock_t ioctx_list_lock;
- struct kioctx *ioctx_list;
-};
-
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply [flat|nested] 19+ messages in thread