LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
@ 2007-03-26  6:37 David Rientjes
  2007-03-26  9:08 ` Andrew Morton
  2007-03-26 20:24 ` Zachary Amsden
  0 siblings, 2 replies; 14+ messages in thread
From: David Rientjes @ 2007-03-26  6:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, Zachary Amsden, linux-kernel

	Date: Sun, 25 Mar 2007 23:07:43 -0800
	From: Zachary Amsden <zach@vmware.com>

	If you actually clear the bit, you need to:

	+         pte_update_defer(vma->vm_mm, addr, ptep);

	The reason is, when updating PTEs, the hypervisor must be notified.  Using
	atomic operations to do this is fine for all hypervisors I am aware of.
	However, for hypervisors which shadow page tables, if these PTE
	modifications are not trapped, you need a post-modification call to fulfill
	the update of the shadow page table.

Cc: Zachary Amsden <zach@vmware.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/asm-i386/pgtable.h |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -287,18 +287,24 @@ do {									\
 static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
 					    unsigned long addr, pte_t *ptep)
 {
-	if (!pte_dirty(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
+	int ret = 0;
+	if (pte_dirty(*ptep))
+		ret = test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
+	if (ret)
+		pte_update_defer(vma->vm_mm, addr, ptep);
+	return ret;
 }
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long addr, pte_t *ptep)
 {
-	if (!pte_young(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low);
+	int ret = 0;
+	if (pte_young(*ptep))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low);
+	if (ret)
+		pte_update_defer(vma->vm_mm, addr, ptep);
+	return ret;
 }
 
 /*
@@ -317,10 +323,8 @@ do {									\
 ({									\
 	int __dirty;							\
 	__dirty = ptep_test_and_clear_dirty((vma), (address), (ptep));	\
-	if (__dirty) {							\
-		pte_update_defer((vma)->vm_mm, (address), (ptep));	\
+	if (__dirty)							\
 		flush_tlb_page(vma, address);				\
-	}								\
 	__dirty;							\
 })
 
@@ -329,10 +333,8 @@ do {									\
 ({									\
 	int __young;							\
 	__young = ptep_test_and_clear_young((vma), (address), (ptep));	\
-	if (__young) {							\
-		pte_update_defer((vma)->vm_mm, (address), (ptep));	\
+	if (__young)							\
 		flush_tlb_page(vma, address);				\
-	}								\
 	__young;							\
 })
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-03-26  6:37 [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young} David Rientjes
@ 2007-03-26  9:08 ` Andrew Morton
  2007-03-26 20:24 ` Zachary Amsden
  1 sibling, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2007-03-26  9:08 UTC (permalink / raw)
  To: David Rientjes; +Cc: Hugh Dickins, Zachary Amsden, linux-kernel

On Sun, 25 Mar 2007 23:37:08 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> 	Date: Sun, 25 Mar 2007 23:07:43 -0800
> 	From: Zachary Amsden <zach@vmware.com>
> 
> 	If you actually clear the bit, you need to:
> 
> 	+         pte_update_defer(vma->vm_mm, addr, ptep);
> 
> 	The reason is, when updating PTEs, the hypervisor must be notified.  Using
> 	atomic operations to do this is fine for all hypervisors I am aware of.
> 	However, for hypervisors which shadow page tables, if these PTE
> 	modifications are not trapped, you need a post-modification call to fulfill
> 	the update of the shadow page table.
> 
> Cc: Zachary Amsden <zach@vmware.com>
> Cc: Hugh Dickins <hugh@veritas.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/asm-i386/pgtable.h |   26 ++++++++++++++------------
>  1 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
> --- a/include/asm-i386/pgtable.h
> +++ b/include/asm-i386/pgtable.h
> @@ -287,18 +287,24 @@ do {									\
>  static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
>  					    unsigned long addr, pte_t *ptep)
>  {
> -	if (!pte_dirty(*ptep))
> -		return 0;
> -	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
> +	int ret = 0;
> +	if (pte_dirty(*ptep))
> +		ret = test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
> +	if (ret)
> +		pte_update_defer(vma->vm_mm, addr, ptep);
> +	return ret;
>  }
>  

i386 allmodconfig:

include/asm/pgtable.h: In function 'ptep_test_and_clear_dirty':
include/asm/pgtable.h:294: error: dereferencing pointer to incomplete type
include/asm/pgtable.h: In function 'ptep_test_and_clear_young':
include/asm/pgtable.h:306: error: dereferencing pointer to incomplete type

Due to vm_area_struct.

Turning it into a stinky macro fixes it.


diff -puN include/asm-i386/pgtable.h~i386-use-pte_update_defer-in-ptep_test_and_clear_dirtyyoung-fix include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h~i386-use-pte_update_defer-in-ptep_test_and_clear_dirtyyoung-fix
+++ a/include/asm-i386/pgtable.h
@@ -284,28 +284,24 @@ do {									\
 } while (0)
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
-					    unsigned long addr, pte_t *ptep)
-{
-	int ret = 0;
-	if (pte_dirty(*ptep))
-		ret = test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
-	if (ret)
-		pte_update_defer(vma->vm_mm, addr, ptep);
-	return ret;
-}
+#define ptep_test_and_clear_dirty(vma, addr, ptep) ({			\
+	int ret = 0;							\
+	if (pte_dirty(*ptep))						\
+		ret = test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); \
+	if (ret)							\
+		pte_update_defer(vma->vm_mm, addr, ptep);		\
+	ret;								\
+})
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-					    unsigned long addr, pte_t *ptep)
-{
-	int ret = 0;
-	if (pte_young(*ptep))
-		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low);
-	if (ret)
-		pte_update_defer(vma->vm_mm, addr, ptep);
-	return ret;
-}
+#define ptep_test_and_clear_young(vma, addr, ptep) ({			\
+	int ret = 0;							\
+	if (pte_young(*ptep))						\
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); \
+	if (ret)							\
+		pte_update_defer(vma->vm_mm, addr, ptep);		\
+	ret;								\
+})
 
 /*
  * Rules for using ptep_establish: the pte MUST be a user pte, and
_


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-03-26  6:37 [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young} David Rientjes
  2007-03-26  9:08 ` Andrew Morton
@ 2007-03-26 20:24 ` Zachary Amsden
  2007-04-13 17:54   ` Hugh Dickins
  1 sibling, 1 reply; 14+ messages in thread
From: Zachary Amsden @ 2007-03-26 20:24 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, Hugh Dickins, linux-kernel

David Rientjes wrote:
> 	Date: Sun, 25 Mar 2007 23:07:43 -0800
> 	From: Zachary Amsden <zach@vmware.com>
>
> 	If you actually clear the bit, you need to:
>
> 	+         pte_update_defer(vma->vm_mm, addr, ptep);
>
> 	The reason is, when updating PTEs, the hypervisor must be notified.  Using
> 	atomic operations to do this is fine for all hypervisors I am aware of.
> 	However, for hypervisors which shadow page tables, if these PTE
> 	modifications are not trapped, you need a post-modification call to fulfill
> 	the update of the shadow page table.
>
> Cc: Zachary Amsden <zach@vmware.com>
> Cc: Hugh Dickins <hugh@veritas.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
>   

Acked-by: Zachary Amsden <zach@vmware.com>

David, thanks for cleaning this up.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-03-26 20:24 ` Zachary Amsden
@ 2007-04-13 17:54   ` Hugh Dickins
  2007-04-13 19:05     ` Zachary Amsden
  2007-04-13 19:27     ` Zachary Amsden
  0 siblings, 2 replies; 14+ messages in thread
From: Hugh Dickins @ 2007-04-13 17:54 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: David Rientjes, Andrew Morton, linux-kernel

On Mon, 26 Mar 2007, Zachary Amsden wrote:
> David Rientjes wrote:
> >  Date: Sun, 25 Mar 2007 23:07:43 -0800
> >  From: Zachary Amsden <zach@vmware.com>
> >
> >  If you actually clear the bit, you need to:
> >
> >  +         pte_update_defer(vma->vm_mm, addr, ptep);
> >
> >  The reason is, when updating PTEs, the hypervisor must be notified.  Using
> >  atomic operations to do this is fine for all hypervisors I am aware of.
> >  However, for hypervisors which shadow page tables, if these PTE
> >  modifications are not trapped, you need a post-modification call to fulfill
> >  the update of the shadow page table.
> >
> > Cc: Zachary Amsden <zach@vmware.com>
> > Cc: Hugh Dickins <hugh@veritas.com>
> > Signed-off-by: David Rientjes <rientjes@google.com>
> >   
> 
> Acked-by: Zachary Amsden <zach@vmware.com>
> 
> David, thanks for cleaning this up.

Zach, while looking at your recent patches, I ran across the comment
on pte_update_defer, and where it was being used, and now think that
David's patch is actually incorrect.  Previously pte_update_defer
was being used where a flush_tlb_page followed immediately after
within the same macro; with David's patch, mm's clear_refs_pte_range
is calling ptep_test_and_clear_young (including pte_update_defer) on
several ptes, then unlocking the page table, and later flushing TLB.
That's exactly wrong for pte_update_defer, isn't it?

I do find the name confusing (pte_update_defer is for use only when
the TLB update is not deferred, right?): indeed, its distinction from
pte_update very prone to errors of this kind.  I'm all for deferring
work, but is there some way the right thing could be done automatically?
Or, how much would be lost if we just replaced all the pte_update_defers
by pte_updates?

I'm not proposing any patch myself: just bringing to your attention.

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-13 17:54   ` Hugh Dickins
@ 2007-04-13 19:05     ` Zachary Amsden
  2007-04-13 19:27     ` Zachary Amsden
  1 sibling, 0 replies; 14+ messages in thread
From: Zachary Amsden @ 2007-04-13 19:05 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Rientjes, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> Zach, while looking at your recent patches, I ran across the comment
> on pte_update_defer, and where it was being used, and now think that
> David's patch is actually incorrect.  Previously pte_update_defer
> was being used where a flush_tlb_page followed immediately after
> within the same macro; with David's patch, mm's clear_refs_pte_range
> is calling ptep_test_and_clear_young (including pte_update_defer) on
> several ptes, then unlocking the page table, and later flushing TLB.
> That's exactly wrong for pte_update_defer, isn't it?
>   

Yes, that is correct.  But in the context of kernel usage, it is not a 
problem.  Note that ptep_test_and_clear_xxx are purely optimizing 
definitions for specific architectures; they are never called outside of 
asm-generic pgtable code, and when they are, they are called from 
ptep_clear_flush_xxx.  So the requisite flush still dominates the lazy / 
deferred / what have you pte update.  I think this is actually the major 
reason I had done the hack to begin with - defining 
__HAVE_ARCH_PTEP_TEST_AND_CLEAR_XXX without actually defining the 
functions prevented any accidental usage of pte_update_defer where the 
flush was omitted.

In correct uses which actually change PTE bits that matter to hardware, 
a flush will (almost*) always be required for hardware correctness.  
This property minimizes the potential damage that can be caused by 
misusing the pte_update_defer or decoupling it from the flush.  It does 
nothing to stop SMP races, however, as in cases where the flush happens 
outside of the page table spinlock, an SMP race is possible on the 
shadow page tables.  So it is better to keep the flush as close as 
possible to the deferred update.

(* RW -> RO,  OLD -> YOUNG, DIRTY -> CLEAN, USER -> SUPERVISOR, EXEC -> 
NX, CACHEABLE -> PCD all require a TLB flush.  Linux never makes 
transitions in the opposite direction by using live page table bit 
manipulation - which might change the PTE without requiring a flush.  It 
turns out, a flush is still a good idea, as transitions in the opposite 
direction represent a permission upgrade, and leaving a more restrictive 
TLB entry around could cause a spurious fault, which might be suppressed 
depending on the architecture, but still good to avoid.)

> I do find the name confusing (pte_update_defer is for use only when
> the TLB update is not deferred, right?): indeed, its distinction from
> pte_update very prone to errors of this kind.  I'm all for deferring
> work, but is there some way the right thing could be done automatically?
> Or, how much would be lost if we just replaced all the pte_update_defers
> by pte_updates?
>   

I agree the name is confusing; I struggled to find a better name (as 
lazy is already overridden in several ways in mm code).  What would be a 
good way to say pte update which is delayed until subsumed by an 
immediate pte operation (explicit non-delayed update or flush)?

In light of the property above (flush required for correctness), what 
about a name like mark_tlb_pte_async - to indicate that the TLB and PTE 
are now out of sync - and require a flush to be brought in sync.  Some 
architectures simply don't need to care to implement them - and some 
generic code can be written to verify that unsynchronized tlbs are 
always flushed before returning from fault handlers or to userspace.

> I'm not proposing any patch myself: just bringing to your attention.
>   

Thanks!  I'm pretty sure we're still correct, for now, but without care, 
it is easy to diverge from that.  I'll keep a watchful eye on the 
pagetable and mm code (as always) and try to come up with a more 
comprehensive solution.

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-13 17:54   ` Hugh Dickins
  2007-04-13 19:05     ` Zachary Amsden
@ 2007-04-13 19:27     ` Zachary Amsden
  2007-04-13 20:24       ` Hugh Dickins
  1 sibling, 1 reply; 14+ messages in thread
From: Zachary Amsden @ 2007-04-13 19:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Rientjes, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> Zach, while looking at your recent patches, I ran across the comment
> on pte_update_defer, and where it was being used, and now think that
> David's patch is actually incorrect.  Previously pte_update_defer
> was being used where a flush_tlb_page followed immediately after
> within the same macro; with David's patch, mm's clear_refs_pte_range
> is calling ptep_test_and_clear_young (including pte_update_defer) on
> several ptes, then unlocking the page table, and later flushing TLB.
> That's exactly wrong for pte_update_defer, isn't it?
>   

Ok, disregard most of my last e-mail.  It is fine to decouple the flush 
from the update, as long as they stay close enough that you can reason 
they happen together.  I guess I hadn't seen the other parts of the 
patch which release the page table spinlock in between the two, and 
somehow missed it again when responding to the above as I got too 
excited explaining why the decoupling is ok.  It is not ok to release 
the spinlock when using shadow page tables on SMP.  There are some 
rather complex races that can result.  Here's one case:

 CPU-0                    CPU-1
-----------------------  ---------------------------
test_and_clear_dirty(x)
spin_unlock(ptl)
                         write address mapped by X
                         (harware updates dirty bit)
                         spin_lock(ptl)
                         set_pte_wrprotect(x)
                         flush
flush

Now, the write protected pte which maps a dirty page gets broken in two 
ways; it is unclear if dirty bit or entiry PTE from CPU-0 is deferred 
until flush, so either write protected PTE for modified page loses the 
dirty bit (BAD!), or write protected PTE loses both dirty and write 
protect bits (VERY BAD!).

To prevent this, we need a flush before dropping the spinlock.  If that 
gets too complicated, we can drop the defer logic and just use 
pte_update instead, which notifies the hypervisor immediately of the 
mapping change.

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-13 19:27     ` Zachary Amsden
@ 2007-04-13 20:24       ` Hugh Dickins
  2007-04-13 20:59         ` Zachary Amsden
  0 siblings, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2007-04-13 20:24 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: David Rientjes, Andrew Morton, linux-kernel

On Fri, 13 Apr 2007, Zachary Amsden wrote:
> Hugh Dickins wrote:
> > Zach, while looking at your recent patches, I ran across the comment
> > on pte_update_defer, and where it was being used, and now think that
> > David's patch is actually incorrect.  Previously pte_update_defer
> > was being used where a flush_tlb_page followed immediately after
> > within the same macro; with David's patch, mm's clear_refs_pte_range
> > is calling ptep_test_and_clear_young (including pte_update_defer) on
> > several ptes, then unlocking the page table, and later flushing TLB.
> > That's exactly wrong for pte_update_defer, isn't it?
> >   
> 
> Ok, disregard most of my last e-mail.

Phew!  That's a lot quicker than digesting it ;)
But thanks for going to so much trouble.

> It is fine to decouple the flush from
> the update, as long as they stay close enough that you can reason they happen
> together.  I guess I hadn't seen the other parts of the patch which release
> the page table spinlock in between the two, and somehow missed it again when
> responding to the above as I got too excited explaining why the decoupling is
> ok.  It is not ok to release the spinlock when using shadow page tables on
> SMP.  There are some rather complex races that can result.  Here's one case:
> 
> CPU-0                    CPU-1
> -----------------------  ---------------------------
> test_and_clear_dirty(x)
> spin_unlock(ptl)
>                         write address mapped by X
>                         (harware updates dirty bit)
>                         spin_lock(ptl)
>                         set_pte_wrprotect(x)
>                         flush
> flush
> 
> Now, the write protected pte which maps a dirty page gets broken in two ways;
> it is unclear if dirty bit or entiry PTE from CPU-0 is deferred until flush,
> so either write protected PTE for modified page loses the dirty bit (BAD!), or
> write protected PTE loses both dirty and write protect bits (VERY BAD!).
> 
> To prevent this, we need a flush before dropping the spinlock.  If that gets
> too complicated, we can drop the defer logic and just use pte_update instead,
> which notifies the hypervisor immediately of the mapping change.

David (clear_refs_pte_range) is only using ptep_test_and_clear_young,
though he did change the ptep_test_and_clear_dirty definition to be
consistent with it.  old/young is never so serious as clean/dirty, so
it may be that there's very little problem with what's in there now;
it just becomes a shame if the wrong decision gets made too often e.g.
if the misflushing is such that his clear_youngs never really take
effect.  I simply cannot tell whether or not that's the case myself.

But once the pte_update_defers get moved away from their flush_tlb_pages,
as is the case now, it feels like we're on thin ice.

Actually, I don't really get pte_update_defer at all: I can understand
wanting to defer a call down, but it appears to be a call down to say
we're deferring the shadow update?  Why not just do it with pte_update?
I'd be happier without it until this could be restructured more safely
(but my incomprehension is not necessarily the best guiding principle).

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-13 20:24       ` Hugh Dickins
@ 2007-04-13 20:59         ` Zachary Amsden
  2007-04-16 17:59           ` Hugh Dickins
  0 siblings, 1 reply; 14+ messages in thread
From: Zachary Amsden @ 2007-04-13 20:59 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Rientjes, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> David (clear_refs_pte_range) is only using ptep_test_and_clear_young,
> though he did change the ptep_test_and_clear_dirty definition to be
> consistent with it.  old/young is never so serious as clean/dirty, so
> it may be that there's very little problem with what's in there now;
> it just becomes a shame if the wrong decision gets made too often e.g.
> if the misflushing is such that his clear_youngs never really take
> effect.  I simply cannot tell whether or not that's the case myself.
>   

If the logic is too difficult to reason about without entering ganda 
bherundasana, then it is simpler to just drop the _defer suffix.  Old / 
young actually is just as serious, not because A-bit is critical 
(although misvirtualizing it will do strange things to the page 
clocking), but because the entire PTE update is delayed - and thus an 
intervening PTE update which remaps the physical page can come in before 
the A-bit update is flushed.  The A-bit update carries the PPN of the 
original physical page, destroying the remap.

> But once the pte_update_defers get moved away from their flush_tlb_pages,
> as is the case now, it feels like we're on thin ice.
>   

Yes, that is why I think I did the original hack of having #defines 
without implementations - to ensure the flush stayed coupled with the 
use of pte_update_defer.

> Actually, I don't really get pte_update_defer at all: I can understand
> wanting to defer a call down, but it appears to be a call down to say
> we're deferring the shadow update?  Why not just do it with pte_update?
> I'd be happier without it until this could be restructured more safely
> (but my incomprehension is not necessarily the best guiding principle).
>   

So pte_update does the update immediately; it is always safe.  For 
select cases, where there is an immediate pte_update + flush 
combination, it is beneficial to combine the two into one hypercall, 
saving the several thousand cycle round trip into the hypervisor from 
happening twice.  This is why pte_update_defer was used close to the 
flushes.  I thought David was just defining the ptep_test_and_clear_xxx 
functions properly; I didn't realize he was actually using them to 
decouple the flush across a spinlock or logically complex code region.  
If that is the case, I agree pte_update instead of defer is absolutely 
the right thing to do.

I'm a little worried about the TLB consistency here, on real hardware.  
If you drop the spinlock, you could end up preempted.  If the preemption 
runs a kernel thread, which accesses the same memory for which you just 
modified the page table mappings, the TLB could still have a stale 
entry.  So, a clear of a dirty bit might happen in a writeback thread, 
but get preempted before the flush, then the hardware TLB still has a 
dirty entry.  This means writes through the same virtual address will 
skip the A/D bit update in the pte.  Now, for regular user memory 
access, that might not be a problem, but if it is the kernel touching 
user memory through kmap, the kernel virtual address might well ..  I'm 
off in the weeds.  But still slightly concerned about leaving that TLB 
entry stale across a possible preemption.

As an aside, in the tree I'm looking at, I don't see any users of 
ptep_clear_flush_dirty at all.. doesn't that mean writeback which 
undirties pages will leave the PTEs with latent dirty bits, causing them 
to get written back again when the process exits?  Or does rmap take 
care of this.  Ahh, I just discovered page_mkclean_one.  Never mind.  
But still no users of ptep_clear_flush_dirty.

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-13 20:59         ` Zachary Amsden
@ 2007-04-16 17:59           ` Hugh Dickins
  2007-04-16 18:51             ` David Rientjes
  2007-04-16 22:00             ` Zachary Amsden
  0 siblings, 2 replies; 14+ messages in thread
From: Hugh Dickins @ 2007-04-16 17:59 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: David Rientjes, Andrew Morton, linux-kernel

On Fri, 13 Apr 2007, Zachary Amsden wrote:
> 
> If the logic is too difficult to reason about without entering ganda
> bherundasana, then it is simpler to just drop the _defer suffix.  Old / young
> actually is just as serious, not because A-bit is critical (although
> misvirtualizing it will do strange things to the page clocking), but because
> the entire PTE update is delayed - and thus an intervening PTE update which
> remaps the physical page can come in before the A-bit update is flushed.  The
> A-bit update carries the PPN of the original physical page, destroying the
> remap.

Hmm, that is serious.  Perhaps a dangerous divergence from how real
hardware would work, which will get you into trouble down the line?
When, as now, core mm people make some change, failing to take your
case into account.

> So pte_update does the update immediately; it is always safe.  For select
> cases, where there is an immediate pte_update + flush combination, it is
> beneficial to combine the two into one hypercall, saving the several thousand
> cycle round trip into the hypervisor from happening twice.  This is why
> pte_update_defer was used close to the flushes.  I thought David was just
> defining the ptep_test_and_clear_xxx functions properly; I didn't realize he
> was actually using them to decouple the flush across a spinlock or logically
> complex code region.  If that is the case, I agree pte_update instead of defer
> is absolutely the right thing to do.

You're right to want to defer your pte_updates, David is right to want
to batch his TLB flushes.  It bothers me that you have a surprising case,
and that unless you abandon your optimization, it imposes a new constraint
on how to proceed in common code (without #ifdef'ing around).

But perhaps in this case David might concede that the longer we delay
the TLB flush, the more likely a referenced bit is to be missed - that is,
it gets cleared from the pte, but if that page is accessed again before
the TLB is flushed, the processor may well omit to reinstate the accessed
bit, and our stats drift away from reality.

Compromise patch below: would that be satisfactory to you, David?

> 
> I'm a little worried about the TLB consistency here, on real hardware.  If you
> drop the spinlock, you could end up preempted.  If the preemption runs a
> kernel thread, which accesses the same memory for which you just modified the
> page table mappings, the TLB could still have a stale entry.  So, a clear of a
> dirty bit might happen in a writeback thread, but get preempted before the
> flush, then the hardware TLB still has a dirty entry.  This means writes
> through the same virtual address will skip the A/D bit update in the pte.
> Now, for regular user memory access, that might not be a problem, but if it is
> the kernel touching user memory through kmap, the kernel virtual address might
> well ..  I'm off in the weeds.  But still slightly concerned about leaving
> that TLB entry stale across a possible preemption.

I don't see your kernel/kmap case, but certainly it could be a problem
for regular user memory access: the dirty bit could get lost that way,
just as the accessed bit can.  If the accessed bit gets lost, memory
management just makes a bad decision (one amidst many!) about which
page to recycle; if the dirty bit gets lost, user data can vanish
(though usually the write of the dirty page will happen long after
both modifications have been included, with no visible problem).

> 
> As an aside, in the tree I'm looking at, I don't see any users of
> ptep_clear_flush_dirty at all.. doesn't that mean writeback which undirties
> pages will leave the PTEs with latent dirty bits, causing them to get written
> back again when the process exits?  Or does rmap take care of this.  Ahh, I
> just discovered page_mkclean_one.  Never mind.  But still no users of
> ptep_clear_flush_dirty.

ptep_clear_flush_dirty used to be used in msync.c, but page_mkclean_one
etc. has now displaced that use.  Perhaps it'll come back into use again:
it's a primitive we do expect to have around.

Hugh

--- 2.6.21-rc6-mm1/fs/proc/task_mmu.c	2007-04-09 12:18:01.000000000 +0100
+++ linux/fs/proc/task_mmu.c	2007-04-16 18:27:58.000000000 +0100
@@ -432,6 +432,8 @@ static int clear_refs_pte_range(pmd_t *p
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
+	unsigned long flush_start;
+	unsigned long flush_end = 0;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -444,9 +446,22 @@ static int clear_refs_pte_range(pmd_t *p
 			continue;
 
 		/* Clear accessed and referenced bits. */
-		ptep_test_and_clear_young(vma, addr, pte);
+		if (ptep_test_and_clear_young(vma, addr, pte)) {
+			if (!flush_end)
+				flush_start = addr;
+			flush_end = addr + PAGE_SIZE;
+		}
 		ClearPageReferenced(page);
 	}
+	/*
+	 * If we're running over a shadow mode hypervisor, pte updates have
+	 * been deferred, and corruption might ensue if we don't flush TLB
+	 * before dropping page table lock.  In other cases, although repeated
+	 * TLB flushes will be wasteful while this mm is running, the longer
+	 * we delay them, the more likely a referenced bit is to be missed.
+	 */
+	if (flush_end)
+		flush_tlb_range(vma, flush_start, flush_end);
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 	return 0;
@@ -481,7 +496,6 @@ static ssize_t clear_refs_write(struct f
 			if (!is_vm_hugetlb_page(vma))
 				walk_page_range(mm, vma->vm_start, vma->vm_end,
 						&clear_refs_walk, vma);
-		flush_tlb_mm(mm);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-16 17:59           ` Hugh Dickins
@ 2007-04-16 18:51             ` David Rientjes
  2007-04-16 19:04               ` Hugh Dickins
  2007-04-16 22:00             ` Zachary Amsden
  1 sibling, 1 reply; 14+ messages in thread
From: David Rientjes @ 2007-04-16 18:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Zachary Amsden, Andrew Morton, linux-kernel

On Mon, 16 Apr 2007, Hugh Dickins wrote:

> When, as now, core mm people make some change, failing to take your
> case into account.
> 

That's exactly what happened here: include/asm-i386/pgtable.h is 
advertising both __HAVE_ARCH_PTEP_TEST_AND_CLEAR_{DIRTY,YOUNG} without 
actually having them.  No consideration is going to be given to such a 
hack when clearing A/D bits in generic code.  If the optimization for a 
specific architecture is not possible for an atomic test and clear of A/D 
bits generically, then it's not an optimization at all.  It's a bug.

> You're right to want to defer your pte_updates, David is right to want
> to batch his TLB flushes.  It bothers me that you have a surprising case,
> and that unless you abandon your optimization, it imposes a new constraint
> on how to proceed in common code (without #ifdef'ing around).
> 

Batching the TLB flushes was really a no brainer: there was a significant 
speed-up in /proc/pid/clear_refs by using flush_tlb_mm() over 
flush_tlb_page().  If there's an "optimization" for paravirt that gets in 
the way by defering the updated A/D bit, then it needs to be eliminated.  
It was merged because there were no generic uses of 
ptep_test_and_clear_{dirty,young}, but now there are.

If it was really considered to be an "optimization," which is was 
advertised to be in the changelog associated with a very trivial patch, 
then removing it by definition would not introduce new constraints that 
would need to be patched.  I'd like to see a patch without the defer and 
see how invasive it truly is on generic code because it was not originally 
merged that invasively.

> Compromise patch below: would that be satisfactory to you, David?
> 

I really like the patch, but for perhaps a slightly different reason: 
we're only flushing ranges that have been shown to need it.  We aren't 
completely flushing the entire mm which is likely to be excessive in 
situations where we're actually using /proc/pid/clear_refs in combination 
with /proc/pid/smaps for memory footprint approximation (i.e. it's on a 
fine granularity).

		David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-16 18:51             ` David Rientjes
@ 2007-04-16 19:04               ` Hugh Dickins
  2007-04-16 19:20                 ` David Rientjes
  0 siblings, 1 reply; 14+ messages in thread
From: Hugh Dickins @ 2007-04-16 19:04 UTC (permalink / raw)
  To: David Rientjes; +Cc: Zachary Amsden, Andrew Morton, linux-kernel

On Mon, 16 Apr 2007, David Rientjes wrote:
> 
> > Compromise patch below: would that be satisfactory to you, David?
> 
> I really like the patch, but for perhaps a slightly different reason: 
> we're only flushing ranges that have been shown to need it.  We aren't 
> completely flushing the entire mm which is likely to be excessive in 
> situations where we're actually using /proc/pid/clear_refs in combination 
> with /proc/pid/smaps for memory footprint approximation (i.e. it's on a 
> fine granularity).

It would be more of a reason to like the patch, if more architectures
actually implemented flush_tlb_range as anything different from
flush_tlb_mm ;)  Sadly, few can do better than flush_tlb_mm: ia64
is the exception I remember, and maybe a couple of others.  I put
flush_tlb_range there merely because it seems more appropriate,
but it's rather deceptive.

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-16 19:04               ` Hugh Dickins
@ 2007-04-16 19:20                 ` David Rientjes
  2007-04-16 22:08                   ` Zachary Amsden
  0 siblings, 1 reply; 14+ messages in thread
From: David Rientjes @ 2007-04-16 19:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Zachary Amsden, Andrew Morton, linux-kernel

On Mon, 16 Apr 2007, Hugh Dickins wrote:

> > > Compromise patch below: would that be satisfactory to you, David?
> > 
> > I really like the patch, but for perhaps a slightly different reason: 
> > we're only flushing ranges that have been shown to need it.  We aren't 
> > completely flushing the entire mm which is likely to be excessive in 
> > situations where we're actually using /proc/pid/clear_refs in combination 
> > with /proc/pid/smaps for memory footprint approximation (i.e. it's on a 
> > fine granularity).
> 
> It would be more of a reason to like the patch, if more architectures
> actually implemented flush_tlb_range as anything different from
> flush_tlb_mm ;)  Sadly, few can do better than flush_tlb_mm: ia64
> is the exception I remember, and maybe a couple of others.  I put
> flush_tlb_range there merely because it seems more appropriate,
> but it's rather deceptive.
> 

Sure, but what I really like about the patch is that we're only flushing 
something if !flush_end in the first place.  So we can eliminate any TLB 
flushing if that VMA didn't need it; that's a change from the current 
behavior.  And since the most obvious use-case for /proc/pid/clear_refs is 
in conjunction with /proc/pid/smaps for approximating memory footprint, 
we'll end up saving TLB flushes because the granularity with which that 
measurement is taken is usually very fine.

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-16 17:59           ` Hugh Dickins
  2007-04-16 18:51             ` David Rientjes
@ 2007-04-16 22:00             ` Zachary Amsden
  1 sibling, 0 replies; 14+ messages in thread
From: Zachary Amsden @ 2007-04-16 22:00 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: David Rientjes, Andrew Morton, linux-kernel

Hugh Dickins wrote:
> You're right to want to defer your pte_updates, David is right to want
> to batch his TLB flushes.  It bothers me that you have a surprising case,
> and that unless you abandon your optimization, it imposes a new constraint
> on how to proceed in common code (without #ifdef'ing around).
>
> But perhaps in this case David might concede that the longer we delay
> the TLB flush, the more likely a referenced bit is to be missed - that is,
> it gets cleared from the pte, but if that page is accessed again before
> the TLB is flushed, the processor may well omit to reinstate the accessed
> bit, and our stats drift away from reality.
>
> Compromise patch below: would that be satisfactory to you, David?
>   

Although I appreciate the heroics, you needn't do this on our account; 
the win of a couple thousand cycles is not worth the cost in complexity, 
IMHO, and the penalty on native quite potentially overshadows this.  If 
you still issue the flush inside the spinlock, as required for this 
paravirt optimization, you are taking the risk of holding the spinlock 
an extra long time while issuing a TLB shootdown - which means waiting 
for an IPI.

It might not matter that much on i386, but on big iron (or realtime) 
systems, this could have significant negative scaling effects for 
workloads where the page page table was hot on some set of CPUs (say, 
remapping file pages for database access).  In time, the benefits of 
this optimization to the hypervisor will decrease, while the benefits of 
optimizing the other way for shorter spinlock time may increase, both in 
a VM and on native hardware.

So I would rather just drop the pte_update_defer down to a pte_update if 
the flush is not immediately following - as it is nice and simply 
correct without getting in the way.  I see there is more in this thread 
that I haven't read yet, so I preemptively reserve the right to issue an 
invalidation of this opinion...

Thanks,

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
  2007-04-16 19:20                 ` David Rientjes
@ 2007-04-16 22:08                   ` Zachary Amsden
  0 siblings, 0 replies; 14+ messages in thread
From: Zachary Amsden @ 2007-04-16 22:08 UTC (permalink / raw)
  To: David Rientjes; +Cc: Hugh Dickins, Andrew Morton, linux-kernel

David Rientjes wrote:
> Sure, but what I really like about the patch is that we're only flushing 
> something if !flush_end in the first place.  So we can eliminate any TLB 
> flushing if that VMA didn't need it; that's a change from the current 
> behavior.  And since the most obvious use-case for /proc/pid/clear_refs is 
> in conjunction with /proc/pid/smaps for approximating memory footprint, 
> we'll end up saving TLB flushes because the granularity with which that 
> measurement is taken is usually very fine.
>
> Acked-by: David Rientjes <rientjes@google.com>
>   

I like the patch even better if you still batch the flushes, but keep 
the !flush_end machinery.  If I read it correctly, flush_start stays at 
the lower bound for the whole function, so it is still accurate later.  
And with the flush outside the spinlock, contention time is lower.

Thanks,

Zach

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-04-16 22:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-26  6:37 [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young} David Rientjes
2007-03-26  9:08 ` Andrew Morton
2007-03-26 20:24 ` Zachary Amsden
2007-04-13 17:54   ` Hugh Dickins
2007-04-13 19:05     ` Zachary Amsden
2007-04-13 19:27     ` Zachary Amsden
2007-04-13 20:24       ` Hugh Dickins
2007-04-13 20:59         ` Zachary Amsden
2007-04-16 17:59           ` Hugh Dickins
2007-04-16 18:51             ` David Rientjes
2007-04-16 19:04               ` Hugh Dickins
2007-04-16 19:20                 ` David Rientjes
2007-04-16 22:08                   ` Zachary Amsden
2007-04-16 22:00             ` Zachary Amsden

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).