LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] ppc64: Fix possible race with set_pte on a present PTE
@ 2004-05-24  3:29 Benjamin Herrenschmidt
  2004-05-24  3:47 ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  3:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, Linux Kernel list

(RESENT)

There is a subtle race which can cause set_pte to be called on ppc64 on
a PTE that is already present (that normally doesn't happen for us) and
which itself, in the proper race condition, can trigger a duplicate hash
entry to be added to the hash table (very bad).

This fixes it by making sure we trigger the actual flush of the batch
whenever set_pte is called on a present PTE, before putting the new PTE
in.

===== include/asm-ppc64/pgtable.h 1.32 vs edited =====
--- 1.32/include/asm-ppc64/pgtable.h	Fri Apr  9 03:30:57 2004
+++ edited/include/asm-ppc64/pgtable.h	Thu May 20 15:47:40 2004
@@ -396,8 +396,10 @@
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	if (pte_present(*ptep))
+	if (pte_present(*ptep)) {
 		pte_clear(ptep);
+		flush_tlb_pending();
+	}
 	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }
 



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  3:29 [PATCH] ppc64: Fix possible race with set_pte on a present PTE Benjamin Herrenschmidt
@ 2004-05-24  3:47 ` Linus Torvalds
  2004-05-24  4:13   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-24  3:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andrew Morton, Linux Kernel list



On Mon, 24 May 2004, Benjamin Herrenschmidt wrote:
> 
> There is a subtle race which can cause set_pte to be called on ppc64 on
> a PTE that is already present (that normally doesn't happen for us) and
> which itself, in the proper race condition, can trigger a duplicate hash
> entry to be added to the hash table (very bad).

So how exactly can the pte already be present? It's definitely illegal, 
since if that actually happened, that would imply a memory leak (whatever 
previous page was there just got silently dropped). 

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  3:47 ` Linus Torvalds
@ 2004-05-24  4:13   ` Benjamin Herrenschmidt
  2004-05-24  4:36     ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  4:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel list

On Mon, 2004-05-24 at 13:47, Linus Torvalds wrote:
> On Mon, 24 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > There is a subtle race which can cause set_pte to be called on ppc64 on
> > a PTE that is already present (that normally doesn't happen for us) and
> > which itself, in the proper race condition, can trigger a duplicate hash
> > entry to be added to the hash table (very bad).
> 
> So how exactly can the pte already be present? It's definitely illegal, 
> since if that actually happened, that would imply a memory leak (whatever 
> previous page was there just got silently dropped). 

Paulus and I identified a couple of cases in the page fault path. One typical is
the software accessed bit thing at the end of handle_pte_fault() where we cab
do

	entry = pte_mkyoung(entry);
	ptep_establish(vma, address, pte, entry);
	update_mmu_cache(vma, address, entry);

On a PTE that is present. That normally shouldn't happen as since the PTE is
present, we shouldn't have reached do_page_fault() in the first place, but
there is a window where that can happen, though that would be broken userspace
code.

Typically, you can have a thread faulting on a page. It goes through hash_page,
doesn't find the entry, and gets to do_page_fault(). However, just before it
takes the mm sem, another thread actually mmap's that page in. Thus we end up
in handle_pte_fault() with a present PTE which has a valid mapping already.

The risk here is that since we have a present PTE, we can at any time (another
thread/cpu ?) get it into the hash table. Our set_pte would then possibly replace
the PTE valid entry (with the same valid PTE entry) except that we lost the
HASH_PTE bit and hash index, thus we lose track of the one already in the hash
table in any. That mean we leave a dangling PTE in the hash, which is a very
bad thing.

I agree the race is very small and only possible with broken userland code I
suppose, but it could create all sort of bad things with the kernel, so it
needs to be fixed anyway. (It can panic on iSeries for example, or cause
undefined MMU behaviour).

There might be other similar cases where we set_pte a present PTE, that's
the one we have analyzed.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  4:13   ` Benjamin Herrenschmidt
@ 2004-05-24  4:36     ` Linus Torvalds
  2004-05-24  4:44       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-24  4:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andrew Morton, Linux Kernel list



On Mon, 24 May 2004, Benjamin Herrenschmidt wrote:
> 
> Typically, you can have a thread faulting on a page. It goes through hash_page,
> doesn't find the entry, and gets to do_page_fault(). However, just before it
> takes the mm sem, another thread actually mmap's that page in. Thus we end up
> in handle_pte_fault() with a present PTE which has a valid mapping already.

Ok, with you so far. But I don't see how you get to set_pte() that way, 
since handle_pte_fault() will re-test the pte_present() bit, and never 
even try to set_pte() if one already exists. Hmm?

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  4:36     ` Linus Torvalds
@ 2004-05-24  4:44       ` Benjamin Herrenschmidt
  2004-05-24  5:10         ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  4:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel list

On Mon, 2004-05-24 at 14:36, Linus Torvalds wrote:
> On Mon, 24 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > Typically, you can have a thread faulting on a page. It goes through hash_page,
> > doesn't find the entry, and gets to do_page_fault(). However, just before it
> > takes the mm sem, another thread actually mmap's that page in. Thus we end up
> > in handle_pte_fault() with a present PTE which has a valid mapping already.
> 
> Ok, with you so far. But I don't see how you get to set_pte() that way, 
> since handle_pte_fault() will re-test the pte_present() bit, and never 
> even try to set_pte() if one already exists. Hmm?

No, it doesn't. It tests it, if !present, it goes to do_no_page,
do_file_page or do_swap_page, but the,i if present, it does:

	entry = pte_mkyoung(entry);
	ptep_establish(vma, address, pte, entry);
	update_mmu_cache(vma, address, entry);
	pte_unmap(pte);
	spin_unlock(&mm->page_table_lock);
	return VM_FAULT_MINOR;

Which is, I think, the software PAGE_ACCESSED path used on some archs.

(ptep_establish does set_pte)

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  4:44       ` Benjamin Herrenschmidt
@ 2004-05-24  5:10         ` Linus Torvalds
  2004-05-24  5:34           ` Benjamin Herrenschmidt
                             ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-24  5:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm


On Mon, 24 May 2004, Benjamin Herrenschmidt wrote:
> 
> No, it doesn't. It tests it, if !present, it goes to do_no_page,
> do_file_page or do_swap_page, but the,i if present, it does:
> 
> 	entry = pte_mkyoung(entry);
> 	ptep_establish(vma, address, pte, entry);
> 	update_mmu_cache(vma, address, entry);
> 	pte_unmap(pte);
> 	spin_unlock(&mm->page_table_lock);
> 	return VM_FAULT_MINOR;
> 
> Which is, I think, the software PAGE_ACCESSED path used on some archs.
> 
> (ptep_establish does set_pte)

Ahh.. That's a bug, methinks.

The reason it's a bug is that if you do this, you can lose the dirty bit 
being written on some other CPU asynchronously.

In other words, I think it's pretty much always a bug to do a "set_pte()"
with an existing pte in place, exactly because you lose information. You
are trying to cover up the bug in ppc64-specific code, but I think that
what you found is actually a (really really) unlikely race condition that
can have serious consequences.

Or am I missing something else?

[ grep grep grep ]

Looks like "break_cow()" and "do_wp_page()" are safe, but only because
they always sets the dirty bit, and any other bits end up being pretty 
much "don't care if we miss an accessed bit update" or something.

Hmm. Maybe I'm wrong. If this really is buggy, it's been buggy this way 
basically forever. That code is _not_ new, it's some of the oldes code in 
the whole VM since the original three-level code rewrite. I think. Of 
course, back then SMP wasn't an issue, and this seems to have survived all 
the SMP fixes.

Who else has been working on the page tables that could verify this for 
me? Ingo? Ben LaHaise? I forget who even worked on this, because it's so 
long ago we went through all the atomicity issues with the page table 
updates on SMP. There may be some reason that I'm overlooking that 
explains why I'm full of sh*t.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  5:10         ` Linus Torvalds
@ 2004-05-24  5:34           ` Benjamin Herrenschmidt
  2004-05-24  5:38             ` Benjamin Herrenschmidt
  2004-05-24  7:39           ` Ingo Molnar
  2004-05-25  3:43           ` Andrea Arcangeli
  2 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  5:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm


> Ahh.. That's a bug, methinks.
> 
> The reason it's a bug is that if you do this, you can lose the dirty bit 
> being written on some other CPU asynchronously.

Hrm... right indeed.

> In other words, I think it's pretty much always a bug to do a "set_pte()"
> with an existing pte in place, exactly because you lose information. You
> are trying to cover up the bug in ppc64-specific code, but I think that
> what you found is actually a (really really) unlikely race condition that
> can have serious consequences.
> 
> Or am I missing something else?

Well, the original scenario triggering that from userland is, imho, so
broken, that we may just not care losing that dirty bit ... Oh well :)
Anyway, apply my patch. If pte is not present, this will have no effect,
if it is, it makes sure we never leave a stale HPTE in the hash, which
is fatal in far worse ways.

> [ grep grep grep ]
> 
> Looks like "break_cow()" and "do_wp_page()" are safe, but only because
> they always sets the dirty bit, and any other bits end up being pretty 
> much "don't care if we miss an accessed bit update" or something.
> 
> Hmm. Maybe I'm wrong. If this really is buggy, it's been buggy this way 
> basically forever. That code is _not_ new, it's some of the oldes code in 
> the whole VM since the original three-level code rewrite. I think. Of 
> course, back then SMP wasn't an issue, and this seems to have survived all 
> the SMP fixes.
> 
> Who else has been working on the page tables that could verify this for 
> me? Ingo? Ben LaHaise? I forget who even worked on this, because it's so 
> long ago we went through all the atomicity issues with the page table 
> updates on SMP. There may be some reason that I'm overlooking that 
> explains why I'm full of sh*t.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  5:34           ` Benjamin Herrenschmidt
@ 2004-05-24  5:38             ` Benjamin Herrenschmidt
  2004-05-24  5:52               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  5:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm


> Well, the original scenario triggering that from userland is, imho, so
> broken, that we may just not care losing that dirty bit ... Oh well :)
> Anyway, apply my patch. If pte is not present, this will have no effect,
> if it is, it makes sure we never leave a stale HPTE in the hash, which
> is fatal in far worse ways.

Hrm... Or maybe I should just do in set_pte something like

 BUG_ON(pte_present(ptep))

That would make me sleep better ;)

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  7:39           ` Ingo Molnar
@ 2004-05-24  5:39             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  5:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ben LaHaise, linux-mm

On Mon, 2004-05-24 at 17:39, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > Who else has been working on the page tables that could verify this
> > for me? Ingo? Ben LaHaise? I forget who even worked on this, because
> > it's so long ago we went through all the atomicity issues with the
> > page table updates on SMP. There may be some reason that I'm
> > overlooking that explains why I'm full of sh*t.
> 
> Ben's the master of atomic dirty pte updates! :)

Precise which Ben :) Well, in this case it's obvious but still.. :)

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  5:38             ` Benjamin Herrenschmidt
@ 2004-05-24  5:52               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-24  5:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm

On Mon, 2004-05-24 at 15:38, Benjamin Herrenschmidt wrote:
> > Well, the original scenario triggering that from userland is, imho, so
> > broken, that we may just not care losing that dirty bit ... Oh well :)
> > Anyway, apply my patch. If pte is not present, this will have no effect,
> > if it is, it makes sure we never leave a stale HPTE in the hash, which
> > is fatal in far worse ways.
> 
> Hrm... Or maybe I should just do in set_pte something like
> 
>  BUG_ON(pte_present(ptep))
> 
> That would make me sleep better ;)

 ... And would not work in the case you mentioned where we set it with
the dirty bit set... for write protect, we have a separate function now,
though. But I'm a bit paranoid with those PTE manipulations, so I think
it would be better to play it the safe way and keep my original patch.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  5:10         ` Linus Torvalds
  2004-05-24  5:34           ` Benjamin Herrenschmidt
@ 2004-05-24  7:39           ` Ingo Molnar
  2004-05-24  5:39             ` Benjamin Herrenschmidt
  2004-05-25  3:43           ` Andrea Arcangeli
  2 siblings, 1 reply; 79+ messages in thread
From: Ingo Molnar @ 2004-05-24  7:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ben LaHaise, linux-mm


* Linus Torvalds <torvalds@osdl.org> wrote:

> Who else has been working on the page tables that could verify this
> for me? Ingo? Ben LaHaise? I forget who even worked on this, because
> it's so long ago we went through all the atomicity issues with the
> page table updates on SMP. There may be some reason that I'm
> overlooking that explains why I'm full of sh*t.

Ben's the master of atomic dirty pte updates! :)

	Ingo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-24  5:10         ` Linus Torvalds
  2004-05-24  5:34           ` Benjamin Herrenschmidt
  2004-05-24  7:39           ` Ingo Molnar
@ 2004-05-25  3:43           ` Andrea Arcangeli
  2004-05-25  4:00             ` Linus Torvalds
  2 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm

On Sun, May 23, 2004 at 10:10:16PM -0700, Linus Torvalds wrote:
> what you found is actually a (really really) unlikely race condition that
> can have serious consequences.

agreed, it could in theory lose the dirty bit.

> course, back then SMP wasn't an issue, and this seems to have survived all 
> the SMP fixes.

looks like to trigger this two threads needs to page fault on the same
not-present MAP_SHARED page at the same time and the second thread must be a
read-fault. The first thread will establish the pte writeable and
dirty, then the first thread releases the page_table_lock, then the VM
takes the page_table_lock (from kswapd or similar) then the VM transfers
the dirty bit from pte to page and it even starts the and completes the
I/O before the the read-fault has a chance to execute. Then after the
I/O is completed the second thread finally takes the page_table_lock and
reads the pte. Then the first thread writes to the page from userspace
marking the pte dirty , and finally the second thread completes the
page_fault running pte_establish and losing the dirty bit on the page.

Maybe it can trigger even without I/O in between, not sure, certainly it
can happen in the above scenario, but the I/O window is quite huge for
not being able to take a spinlock before the I/O is started.

The below patch should fix it, the only problem is that it can screwup
some arch that might use page-faults to keep track of the accessed bit,
I think for istance ia64 might do that, not sure if it's using it in
linux though. anyways the below should be the optimal fix for x86 and
x86-64 at least and the other archs will not corrupt memory anymore too
(at worst they will live lock in userspace, and the livelock should be
solvable gracefully with sigkill since it's an userspace one).

warning, patch is absolutely untested.

Index: mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.154
diff -u -p -r1.154 memory.c
--- mm/memory.c	20 Apr 2004 14:22:03 -0000	1.154
+++ mm/memory.c	25 May 2004 03:04:48 -0000
@@ -1665,10 +1665,30 @@ static inline int handle_pte_fault(struc
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
 
 		entry = pte_mkdirty(entry);
+
+		/*
+		 * We can overwrite the pte not atomically only if this is a
+		 * write fault or we can lose the dirty bit. The dirty and the
+		 * accessed bit are the only two things that can change form
+		 * under us even if we hold the page_table_lock, they're
+		 * set by hardware while other threads runs in userspace.
+		 *
+		 * This could cause an infinite loop with a read-fault if some
+		 * arch tries to keep track of the accessed bit with a kernel
+		 * page fault (but keeping track of the accessed bit with
+		 * kernel exceptions sounds very inefficient anyways).
+		 * If some arch infinite loops, they will have to implement
+		 * some sort of atomic ptep_stablish where they atomically
+		 * compare and swap, then we'll change the common code here
+		 * to learn how to atomic swap the pte.
+		 *
+		 * x86 and x86-64 are sure optimal without atomic ops here and
+		 * with this simple code.
+		 */
+		entry = pte_mkyoung(entry);
+		ptep_establish(vma, address, pte, entry);
+		update_mmu_cache(vma, address, entry);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry);
-	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_MINOR;


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  3:43           ` Andrea Arcangeli
@ 2004-05-25  4:00             ` Linus Torvalds
  2004-05-25  4:17               ` Benjamin Herrenschmidt
                                 ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25  4:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group



On Tue, 25 May 2004, Andrea Arcangeli wrote:
> 
> The below patch should fix it, the only problem is that it can screwup
> some arch that might use page-faults to keep track of the accessed bit,

Indeed. At least alpha does this - that's where this code came from. SO 
this will cause infinite page faults on alpha and any other "accessed bit 
in software" architectures.

Not good.

I suspect we should just make a "ptep_set_bits()" inline function that 
_atomically_ does "set the dirty/accessed bits". On x86, it would be a 
simple

		asm("lock ; orl %1,%0"
			:"m" (*ptep)
			:"r" (entry));

and similarly on most other architectures it should be quite easy to do 
the equivalent. You can always do it with a simple compare-and-exchange 
loop, something any SMP-capable architecture should have.

Of course, arguably we can actually optimize this by "knowing" that it is
safe to set the dirty bit, so then we don't even need an atomic operation,
we just need one atomic write.  So we only actually need the atomic op for 
the accessed bit case, and if we make the write-case be totally separate..

Anybody willing to write up a patch for a few architectures? Is there any 
architecture out there that would have a problem with this?

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:00             ` Linus Torvalds
@ 2004-05-25  4:17               ` Benjamin Herrenschmidt
  2004-05-25  4:37                 ` Andrea Arcangeli
  2004-05-25  4:20               ` Andrea Arcangeli
  2004-05-25 11:44               ` Matthew Wilcox
  2 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25  4:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Andrew Morton, Linux Kernel list, Ingo Molnar,
	Ben LaHaise, linux-mm, Architectures Group


> and similarly on most other architectures it should be quite easy to do 
> the equivalent. You can always do it with a simple compare-and-exchange 
> loop, something any SMP-capable architecture should have.
> 
> Of course, arguably we can actually optimize this by "knowing" that it is
> safe to set the dirty bit, so then we don't even need an atomic operation,
> we just need one atomic write.  So we only actually need the atomic op for 
> the accessed bit case, and if we make the write-case be totally separate..

Looks good ! That gives us a guarantee that set_pte is never ever called
on a present PTE (thus letting set_pte be non-atomic) and we can safely
BUG_ON(pte_present(*ptep)) in it, right ?
 
Note that having different set_dirty and set_accessed may be useful for
some archs, thouh I agree a single atomic operation is enough on ppc
too, I also want to make sure nobody ever gets the idea of using that
for anything but those 2 bits... Well, that's a matter of taste, go for
what you prefer.

ppc64 version of this would look like

static inline unsigned long ptep_set_bits(pte_t *p, unsigned long set)
{
	unsigned long old, tmp;

	__asm__ __volatile__(
	"1:	ldarx	%0,0,%3\n\
	or	%1,%0,%4 \n\
	stdcx.	%1,0,%3 \n\
	bne-	1b"
	: "=&r" (old), "=&r" (tmp), "=m" (*p)
	: "r" (p), "r" (clr), "m" (*p)
	: "cc" );
	return old;
}

ppc32 would be:

#define ptep_set_bits(p, bits) pte_update(p, 0, bits)

> Anybody willing to write up a patch for a few architectures? Is there any 
> architecture out there that would have a problem with this?
> 
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:00             ` Linus Torvalds
  2004-05-25  4:17               ` Benjamin Herrenschmidt
@ 2004-05-25  4:20               ` Andrea Arcangeli
  2004-05-25  4:39                 ` Linus Torvalds
  2004-05-25  4:43                 ` David Mosberger
  2004-05-25 11:44               ` Matthew Wilcox
  2 siblings, 2 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  4:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group

On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 25 May 2004, Andrea Arcangeli wrote:
> > 
> > The below patch should fix it, the only problem is that it can screwup
> > some arch that might use page-faults to keep track of the accessed bit,
> 
> Indeed. At least alpha does this - that's where this code came from. SO 
> this will cause infinite page faults on alpha and any other "accessed bit 
> in software" architectures.

as you say the alpha has no accessed bit at all in hardware, so
it cannot generate page faults. 

"accessed bit in software" is fine with my fix.

the only architecture that has the accessed bit in _hardware_ via page
faults I know is ia64, but I don't know if it has a mode to set it
without page faults and how it is implementing the accessed bit in linux.

Everything else either has it in hardware trasparently set by the cpu
(like x86) in the pte, or in software (like alpha). Either ways it's
fine.

> I suspect we should just make a "ptep_set_bits()" inline function that 
> _atomically_ does "set the dirty/accessed bits". On x86, it would be a 
> simple
> 
> 		asm("lock ; orl %1,%0"
> 			:"m" (*ptep)
> 			:"r" (entry));
>
> and similarly on most other architectures it should be quite easy to do 
> the equivalent. You can always do it with a simple compare-and-exchange 
> loop, something any SMP-capable architecture should have.
> 
> Of course, arguably we can actually optimize this by "knowing" that it is
> safe to set the dirty bit, so then we don't even need an atomic operation,
> we just need one atomic write.  So we only actually need the atomic op for 
> the accessed bit case, and if we make the write-case be totally separate..

on x86 there's no point to use atomic ops in the write_access path and
there's no point to do anything in the read path.

The only issue are the archs that may generate page faults for young bit
clear and for those we should simply add a:

	ptep_atomic_set_young(pte)

right before the pte_unmap, the tlb flush is not needed of course (they
are generating active page faults so they will re-check).

But before adding the above operation, I'd wait to hear if any
architecture is really using _hardware_ page faults for the accessed bit
in linux.

Then it also depends if this is the only buggy place or if there are
others, but certainly this one doesn't need atomic ops on x86* and alpha
(that's the only two I'm 100% sure about ;).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:17               ` Benjamin Herrenschmidt
@ 2004-05-25  4:37                 ` Andrea Arcangeli
  2004-05-25  4:40                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  4:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar,
	Ben LaHaise, linux-mm, Architectures Group

On Tue, May 25, 2004 at 02:17:41PM +1000, Benjamin Herrenschmidt wrote:
> on a present PTE (thus letting set_pte be non-atomic) and we can safely
> BUG_ON(pte_present(*ptep)) in it, right ?

set_pte is used even to mark the pte non present, so you can forget
about using BUG_ON(pte_present(*ptep)) anywhere in set_pte regardless of
how we fix this race (see mm/objrmap.c:unmap_pte_page()). If you want to
trap for it you should add a set_pte_present and use it at least in
objrmap.c during the paging.

> ppc64 version of this would look like
> 
> static inline unsigned long ptep_set_bits(pte_t *p, unsigned long set)
> {
> 	unsigned long old, tmp;
> 
> 	__asm__ __volatile__(
> 	"1:	ldarx	%0,0,%3\n\
> 	or	%1,%0,%4 \n\
> 	stdcx.	%1,0,%3 \n\
> 	bne-	1b"
> 	: "=&r" (old), "=&r" (tmp), "=m" (*p)
> 	: "r" (p), "r" (clr), "m" (*p)
> 	: "cc" );
> 	return old;
> }
> 
> ppc32 would be:
> 
> #define ptep_set_bits(p, bits) pte_update(p, 0, bits)

unless you are generating page faults if the young bit is clear, this
will only slowdown compared to my simpler approch.

However if some arch is using page faults to set the young bit in
hardware (not in software), then slowing micro-down x86 and others might
be an option to share all the common code, but we could easily avoid
smp locking in x86 and alpha by threating the young-bit-page-fault archs
differently too. 

Would be nice to hear from the ia64 folks what they're doing w.r.t. to
the young bit, I think ia64 is the only one providing the young bit with
an hardware page fault.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:20               ` Andrea Arcangeli
@ 2004-05-25  4:39                 ` Linus Torvalds
  2004-05-25  4:44                   ` Linus Torvalds
  2004-05-25  4:50                   ` Andrea Arcangeli
  2004-05-25  4:43                 ` David Mosberger
  1 sibling, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25  4:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group



On Tue, 25 May 2004, Andrea Arcangeli wrote:

> On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Tue, 25 May 2004, Andrea Arcangeli wrote:
> > > 
> > > The below patch should fix it, the only problem is that it can screwup
> > > some arch that might use page-faults to keep track of the accessed bit,
> > 
> > Indeed. At least alpha does this - that's where this code came from. SO 
> > this will cause infinite page faults on alpha and any other "accessed bit 
> > in software" architectures.
> 
> as you say the alpha has no accessed bit at all in hardware, so
> it cannot generate page faults. 

It _does_ generate page faults.

We do the accessed bit by clearing the "user readable" thing (or
something. I forget the exact details, and I'm too lazy to check it out).  
And a page won't be _really_ readable until it has been marked young.

If you don't mark it young, you'll get infinite page faults.

That's how we do the accessed bit.

> "accessed bit in software" is fine with my fix.

NO IT IS NOT.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:37                 ` Andrea Arcangeli
@ 2004-05-25  4:40                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25  4:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar,
	Ben LaHaise, linux-mm, Architectures Group

On Tue, 2004-05-25 at 14:37, Andrea Arcangeli wrote:
> On Tue, May 25, 2004 at 02:17:41PM +1000, Benjamin Herrenschmidt wrote:
> > on a present PTE (thus letting set_pte be non-atomic) and we can safely
> > BUG_ON(pte_present(*ptep)) in it, right ?
> 
> set_pte is used even to mark the pte non present, so you can forget
> about using BUG_ON(pte_present(*ptep)) anywhere in set_pte regardless of
> how we fix this race (see mm/objrmap.c:unmap_pte_page()). If you want to
> trap for it you should add a set_pte_present and use it at least in
> objrmap.c during the paging.

Isn't this the work of pte_clear ? It's quite important to be very
careful about such PTE manipulations on ppc & ppc64 or we can wreck
everything by losting the hash state bits in there.

> unless you are generating page faults if the young bit is clear, this
> will only slowdown compared to my simpler approch.
> 
> However if some arch is using page faults to set the young bit in
> hardware (not in software), then slowing micro-down x86 and others might
> be an option to share all the common code, but we could easily avoid
> smp locking in x86 and alpha by threating the young-bit-page-fault archs
> differently too. 
> 
> Would be nice to hear from the ia64 folks what they're doing w.r.t. to
> the young bit, I think ia64 is the only one providing the young bit with
> an hardware page fault.
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:20               ` Andrea Arcangeli
  2004-05-25  4:39                 ` Linus Torvalds
@ 2004-05-25  4:43                 ` David Mosberger
  2004-05-25  4:53                   ` Andrea Arcangeli
  1 sibling, 1 reply; 79+ messages in thread
From: David Mosberger @ 2004-05-25  4:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

>>>>> On Tue, 25 May 2004 06:20:54 +0200, Andrea Arcangeli <andrea@suse.de> said:

  Andrea> the only architecture that has the accessed bit in
  Andrea> _hardware_ via page faults I know is ia64, but I don't know
  Andrea> if it has a mode to set it without page faults

No, it doesn't.

  Andrea> and how it is implementing the accessed bit in linux.

If the "accessed" or "dirty" bits are zero, accessing/writing the
page will cause a fault which will be handled in a low-level
fault handler.  The Linux version of these handlers simply turn
on the respective bit.  See daccess_bit(), iaccess_bit(), and dirty_bit()
in arch/ia64/kernel/ivt.S.

Note: I'm on travel and haven't seen the context of this discussion
and don't expect to have time to think about this until I return on
Thursday.  So if you don't hear from me, it's not because I'm ignoring
you... ;-)

	--david

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:39                 ` Linus Torvalds
@ 2004-05-25  4:44                   ` Linus Torvalds
  2004-05-25  4:59                     ` Andrea Arcangeli
  2004-05-25  4:50                   ` Andrea Arcangeli
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25  4:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group



On Mon, 24 May 2004, Linus Torvalds wrote:
>
> We do the accessed bit by clearing the "user readable" thing (or
> something. I forget the exact details, and I'm too lazy to check it out).  

Yup. Lookie here:

	#define __ACCESS_BITS   (_PAGE_ACCESSED | _PAGE_KRE | _PAGE_URE)
	extern inline pte_t pte_mkold(pte_t pte)        { pte_val(pte) &= ~(__ACCESS_BITS); return pte; }

Notice how an "old" pte won't be readable. Then, when we take the page 
fault, we'll do

	extern inline pte_t pte_mkyoung(pte_t pte)      { pte_val(pte) |= __ACCESS_BITS; return pte; }

and now the pte is readable again.

In other words, we absolutely _have_ to do the "pte_mkyoung()" part in the
page fault, or an "old" pte will never become readable again (unless it's
accessed with a write rather than a read, which will then happen to make
it young again).

I'm not quite senile yet.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:39                 ` Linus Torvalds
  2004-05-25  4:44                   ` Linus Torvalds
@ 2004-05-25  4:50                   ` Andrea Arcangeli
  2004-05-25  4:59                     ` Linus Torvalds
  1 sibling, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  4:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group

On Mon, May 24, 2004 at 09:39:05PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 25 May 2004, Andrea Arcangeli wrote:
> 
> > On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote:
> > > 
> > > 
> > > On Tue, 25 May 2004, Andrea Arcangeli wrote:
> > > > 
> > > > The below patch should fix it, the only problem is that it can screwup
> > > > some arch that might use page-faults to keep track of the accessed bit,
> > > 
> > > Indeed. At least alpha does this - that's where this code came from. SO 
> > > this will cause infinite page faults on alpha and any other "accessed bit 
> > > in software" architectures.
> > 
> > as you say the alpha has no accessed bit at all in hardware, so
> > it cannot generate page faults. 
> 
> It _does_ generate page faults.
> 
> We do the accessed bit by clearing the "user readable" thing (or
> something. I forget the exact details, and I'm too lazy to check it out).  
> And a page won't be _really_ readable until it has been marked young.
> 
> If you don't mark it young, you'll get infinite page faults.
> 
> That's how we do the accessed bit.

no, that's not what I remeber from alpha, alpha always sets the young
bit as soon as it sets the pte from non-null to something. No matter
what kind of pte is that (readonly/writeonly/whatever). Then this bit is
cleared by the VM but nothing ever happens again. The accessed bit is
quite useless infact.

> > "accessed bit in software" is fine with my fix.
> 
> NO IT IS NOT.

It definitely is, there's no way the architecture can loop on a bit that
doesn't exist as far as the hardware is concerned.

see the code:

/* .. and these are ours ... */
#define _PAGE_DIRTY	0x20000
#define _PAGE_ACCESSED	0x40000
#define _PAGE_FILE	0x80000	/* set:pagecache, unset:swap */

extern inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }

So alpha will work fine, there's absolutely no way pte_mkyoung can avoid
the next page fault, period.

furthermore there is no way to keep the accessed bit coherent without
hardware support, I tried that while I was working on the alpha some
year ago but there's no way. The accessed bit in software is only useful
to give a second chance to pages paged in recently (if useful at all).

There are two ways to do the hardware support, one is what x86 does,
that is to set the bit without a page fault that enters kernel. The
other way is an option that ia64 has to generate even the page fault.

My fix works fine for x86* and alpha. It may not work for ia64 if it's
using hardware page faults to set the young bit. comments on this detail
from the ia64 developers would be welcome.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:43                 ` David Mosberger
@ 2004-05-25  4:53                   ` Andrea Arcangeli
  2004-05-27 21:56                     ` David Mosberger
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  4:53 UTC (permalink / raw)
  To: davidm
  Cc: Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

On Mon, May 24, 2004 at 09:43:00PM -0700, David Mosberger wrote:
> >>>>> On Tue, 25 May 2004 06:20:54 +0200, Andrea Arcangeli <andrea@suse.de> said:
> 
>   Andrea> the only architecture that has the accessed bit in
>   Andrea> _hardware_ via page faults I know is ia64, but I don't know
>   Andrea> if it has a mode to set it without page faults
> 
> No, it doesn't.
> 
>   Andrea> and how it is implementing the accessed bit in linux.
> 
> If the "accessed" or "dirty" bits are zero, accessing/writing the
> page will cause a fault which will be handled in a low-level
> fault handler.  The Linux version of these handlers simply turn
> on the respective bit.  See daccess_bit(), iaccess_bit(), and dirty_bit()
> in arch/ia64/kernel/ivt.S.

so you mean, this is being set in the arch section before ever reaching
handle_mm_fault? in such case my fix should work fine for ia64 too.

> Note: I'm on travel and haven't seen the context of this discussion
> and don't expect to have time to think about this until I return on
> Thursday.  So if you don't hear from me, it's not because I'm ignoring
> you... ;-)

take your time ;) thanks a lot for the above hints about those ivt.S
functions (though I don't speak ia64 asm very well ;)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:50                   ` Andrea Arcangeli
@ 2004-05-25  4:59                     ` Linus Torvalds
  0 siblings, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25  4:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group



On Tue, 25 May 2004, Andrea Arcangeli wrote:
> 
> no, that's not what I remeber from alpha, alpha always sets the young
> bit as soon as it sets the pte from non-null to something.

Yes.

However, whtn the page is _aged_ later, the young bits will be cleared.

And _that_ is when you will now start getting infinite page faults, 
because with your patch the young bits will never be set again on a normal 
read.

See what I'm saying?

Your patch literally leaves the page table alone on pure reads. Which is 
not acceptable, since the page being marked old was what caused the 
read-fault in the first place.

But yes, pages will be young by default, so you won't ever actually _see_ 
this behaviour until you start having memory pressure and VM reclaim 
starts trying to age the things.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:44                   ` Linus Torvalds
@ 2004-05-25  4:59                     ` Andrea Arcangeli
  2004-05-25  5:09                       ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  4:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group

On Mon, May 24, 2004 at 09:44:08PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 24 May 2004, Linus Torvalds wrote:
> >
> > We do the accessed bit by clearing the "user readable" thing (or
> > something. I forget the exact details, and I'm too lazy to check it out).  
> 
> Yup. Lookie here:
> 
> 	#define __ACCESS_BITS   (_PAGE_ACCESSED | _PAGE_KRE | _PAGE_URE)
> 	extern inline pte_t pte_mkold(pte_t pte)        { pte_val(pte) &= ~(__ACCESS_BITS); return pte; }
> 
> Notice how an "old" pte won't be readable. Then, when we take the page 
> fault, we'll do
> 
> 	extern inline pte_t pte_mkyoung(pte_t pte)      { pte_val(pte) |= __ACCESS_BITS; return pte; }
> 
> and now the pte is readable again.
> 
> In other words, we absolutely _have_ to do the "pte_mkyoung()" part in the
> page fault, or an "old" pte will never become readable again (unless it's
> accessed with a write rather than a read, which will then happen to make
> it young again).
> 
> I'm not quite senile yet.

I see, sorry I was wrong. I misread this code a long time ago and I
noticed how the young bit works only now. Infact I always wondered if
the young bit was useful at all. So it was possible to emulate it after
all. However I wonder what happens for PROT_WRITE? How can you make a
mapping only writeable if the mk_young marks it readable? That's why I
misread it without even imagining it was setting the readable bit at the
same time of the young bit.

so while ia64 may not even need to set the young bit, alpha needs it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:59                     ` Andrea Arcangeli
@ 2004-05-25  5:09                       ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25  5:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list,
	Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group

> all. However I wonder what happens for PROT_WRITE? How can you make a

I understood now how it works with PROT_WRITE too, it's not FOR but URE
being tweaked together with ACCESSED. This has been a very big misread I
did when I was doing alpha stuff some year ago. that's why I was so
confident it was only setting it during the first page fault and never
clearing it again. Sounds good that it can be emulated fully, I thought
it wasn't even feasible at all.

thanks a lot for pointing out this huge mistake.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:00             ` Linus Torvalds
  2004-05-25  4:17               ` Benjamin Herrenschmidt
  2004-05-25  4:20               ` Andrea Arcangeli
@ 2004-05-25 11:44               ` Matthew Wilcox
  2004-05-25 14:48                 ` Linus Torvalds
  2 siblings, 1 reply; 79+ messages in thread
From: Matthew Wilcox @ 2004-05-25 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote:
> I suspect we should just make a "ptep_set_bits()" inline function that 
> _atomically_ does "set the dirty/accessed bits". On x86, it would be a 
> simple
> 
> 		asm("lock ; orl %1,%0"
> 			:"m" (*ptep)
> 			:"r" (entry));
> 
> and similarly on most other architectures it should be quite easy to do 
> the equivalent. You can always do it with a simple compare-and-exchange 
> loop, something any SMP-capable architecture should have.

... but PA doesn't.  Just load-and-clear-word (and its 64-bit equivalent
in 64-bit mode).  And that word has to be 16-byte aligned.  What race
are we protecting against?  If it's like xchg() and we only need to
protect against a racing xchg() and not a reader, we can just reuse the
global array of hashed spinlocks we have for that.

> Of course, arguably we can actually optimize this by "knowing" that it is
> safe to set the dirty bit, so then we don't even need an atomic operation,
> we just need one atomic write.  So we only actually need the atomic op for 
> the accessed bit case, and if we make the write-case be totally separate..

Ah, atomic writes we can do.  That's easy.  I think all Linux architectures
support atomic writes to naturally aligned addresses, don't they?

> Anybody willing to write up a patch for a few architectures? Is there any 
> architecture out there that would have a problem with this?
> 
> 		Linus

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 11:44               ` Matthew Wilcox
@ 2004-05-25 14:48                 ` Linus Torvalds
  2004-05-25 15:35                   ` Keith M Wesolowski
  2004-05-25 21:27                   ` Andrea Arcangeli
  0 siblings, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 14:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group



On Tue, 25 May 2004, Matthew Wilcox wrote:

> On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote:
> > I suspect we should just make a "ptep_set_bits()" inline function that 
> > _atomically_ does "set the dirty/accessed bits". On x86, it would be a 
> > simple
> > 
> > 		asm("lock ; orl %1,%0"
> > 			:"m" (*ptep)
> > 			:"r" (entry));
> > 
> > and similarly on most other architectures it should be quite easy to do 
> > the equivalent. You can always do it with a simple compare-and-exchange 
> > loop, something any SMP-capable architecture should have.
> 
> ... but PA doesn't.  Just load-and-clear-word (and its 64-bit equivalent
> in 64-bit mode).  And that word has to be 16-byte aligned.

Wow. And this architecture claims to support SMP? 

> What race are we protecting against?  If it's like xchg() and we only
> need to protect against a racing xchg() and not a reader, we can just
> reuse the global array of hashed spinlocks we have for that.

The race is:
 - one CPU sets the dirty bit (possibly with a hardware walker, but I 
   guess on PA it's probably done in sw)
 - the other CPU sets the accessed bit in sw as part of the 
   "handle_pte_fault()" processing.

Right now we set the accessed bit with a simple "ptep_establish()", which 
will use "set_pte()", which is just a regular write. So setting the 
accessed bit will basically be a nonatomic sequence of

 - read pte entry
 - entry = pte_mkyoung(entry)
 - set_pte(entry)

which is all done under the mm->page_table_lock, but which does NOT 
protect against any hardware page-table walkers or any asynchronous sw 
walkers (if anybody does them).

Basically, the suggestion is to replace the "set_pte()" with something 
that is safe against anything else that updates the page tables (whether 
software or hardware). If only core kernel code does that, then you should 
already be fine, since the page-table spinlock should already be held by 
all updaters.

NOTE! One really easy approach would be to say that we never mix software 
updates of the accessed bit with hw updates, and just have a rule that if 
the architecture does accessed-bit updates in hardware (and can thus race 
with us doing them in software _despite_ the fact that we hold the page 
table lock), then we just don't do the update at all. 

We'd just pass in a flag to "ptep_establish()" to tell it whether we
changed the dirty bit or not. It would be "write_access" in
handle_pte_fault(), and 1 in the other two cases.

> Ah, atomic writes we can do.  That's easy.  I think all Linux architectures
> support atomic writes to naturally aligned addresses, don't they?

Yes. You'd really have to work at it _not_ to support them ;)

However, the atomic write case only helps in the case when we update _all_ 
the bits that hw walkers can update, 

			Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 14:48                 ` Linus Torvalds
@ 2004-05-25 15:35                   ` Keith M Wesolowski
  2004-05-25 16:19                     ` Linus Torvalds
  2004-05-25 21:27                   ` Andrea Arcangeli
  1 sibling, 1 reply; 79+ messages in thread
From: Keith M Wesolowski @ 2004-05-25 15:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrea Arcangeli, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group

On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote:

> > > the equivalent. You can always do it with a simple compare-and-exchange 
> > > loop, something any SMP-capable architecture should have.
> 
> The race is:
>  - one CPU sets the dirty bit (possibly with a hardware walker, but I 
>    guess on PA it's probably done in sw)
>  - the other CPU sets the accessed bit in sw as part of the 
>    "handle_pte_fault()" processing.
> 
> Right now we set the accessed bit with a simple "ptep_establish()", which 
> will use "set_pte()", which is just a regular write. So setting the 
> accessed bit will basically be a nonatomic sequence of
> 
>  - read pte entry
>  - entry = pte_mkyoung(entry)
>  - set_pte(entry)
> 
> which is all done under the mm->page_table_lock, but which does NOT 
> protect against any hardware page-table walkers or any asynchronous sw 
> walkers (if anybody does them).

Some sparc32 CPUs are also vulnerable to this race; in fact the
supersparc manual describes it specifically and even outlines the
compare-exchange loop using our rotten swap instruction.  In our case,
the race is with a hardware walker.

-- 
Keith M Wesolowski

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 15:35                   ` Keith M Wesolowski
@ 2004-05-25 16:19                     ` Linus Torvalds
  2004-05-25 17:25                       ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 16:19 UTC (permalink / raw)
  To: Keith M Wesolowski
  Cc: Matthew Wilcox, Andrea Arcangeli, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group



On Tue, 25 May 2004, Keith M Wesolowski wrote:
> 
> Some sparc32 CPUs are also vulnerable to this race; in fact the
> supersparc manual describes it specifically and even outlines the
> compare-exchange loop using our rotten swap instruction.  In our case,
> the race is with a hardware walker.

Yes, but the sparc32 page tables are not the same as the linux kernel page 
tables, so in your case it's a different path and a different page table. 
Only the shared case really matters (ie things that do hw/microcode walk 
of a page table _tree_ not a hash).

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 16:19                     ` Linus Torvalds
@ 2004-05-25 17:25                       ` David S. Miller
  2004-05-25 17:49                         ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2004-05-25 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch

On Tue, 25 May 2004 09:19:52 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> On Tue, 25 May 2004, Keith M Wesolowski wrote:
> > 
> > Some sparc32 CPUs are also vulnerable to this race; in fact the
> > supersparc manual describes it specifically and even outlines the
> > compare-exchange loop using our rotten swap instruction.  In our case,
> > the race is with a hardware walker.
> 
> Yes, but the sparc32 page tables are not the same as the linux kernel page 
> tables, so in your case it's a different path and a different page table. 
> Only the shared case really matters (ie things that do hw/microcode walk 
> of a page table _tree_ not a hash).

Not true on 32-bit Sparc sun4m systems, it's exactly like i386 except
the hardware is stupid and we only have an atomic swap instruction.
:-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 17:25                       ` David S. Miller
@ 2004-05-25 17:49                         ` Linus Torvalds
  2004-05-25 17:54                           ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 17:49 UTC (permalink / raw)
  To: David S. Miller
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch



On Tue, 25 May 2004, David S. Miller wrote:
> 
> Not true on 32-bit Sparc sun4m systems, it's exactly like i386 except
> the hardware is stupid and we only have an atomic swap instruction.
> :-)

Ouch.

Ok. My second suggestion ("don't bother with accessed bit on hw that does 
it on its own") should work fine there too, I guess.

So what I can tell, the fix is really something like this (this does both 
x86 and ppc64 just to show how two different approaches would handle it, 
but I have literally _tested_ neither).

What do people think?

Ben, does this fix your problem? Does my ppc64 assembly code look sane? 
Shamelessly stolen from the function above it, and it seems to compile ;)

		Linus

-----
===== include/asm-generic/pgtable.h 1.3 vs edited =====
--- 1.3/include/asm-generic/pgtable.h	Sun Jan 18 22:35:59 2004
+++ edited/include/asm-generic/pgtable.h	Tue May 25 10:39:38 2004
@@ -10,9 +10,9 @@
  *
  * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
  */
-#define ptep_establish(__vma, __address, __ptep, __entry)		\
+#define ptep_establish(__vma, __address, __ptep, __entry, __dirty)	\
 do {									\
-	set_pte(__ptep, __entry);					\
+	ptep_update_dirty_accessed(__ptep, __entry, __dirty);		\
 	flush_tlb_page(__vma, __address);				\
 } while (0)
 #endif
===== include/asm-i386/pgtable.h 1.44 vs edited =====
--- 1.44/include/asm-i386/pgtable.h	Sat May 22 14:56:24 2004
+++ edited/include/asm-i386/pgtable.h	Tue May 25 10:39:56 2004
@@ -225,6 +225,12 @@
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
 
+static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty)
+{
+	if (dirty)
+		set_pte(ptep, entry);
+}
+
 /*
  * Macro to mark a page protection value as "uncacheable".  On processors which do not support
  * it, this is a no-op.
===== include/asm-ppc64/pgtable.h 1.33 vs edited =====
--- 1.33/include/asm-ppc64/pgtable.h	Sat May 22 14:56:24 2004
+++ edited/include/asm-ppc64/pgtable.h	Tue May 25 10:45:58 2004
@@ -306,6 +306,21 @@
 	return old;
 }
 
+static inline void ptep_update_dirty_accessed(pte_t *p, pte_t entry, int dirty)
+{
+	unsigned long old, tmp;
+
+	__asm__ __volatile__(
+	"1:	ldarx   %0,0,%3\n\
+	or	%1,%0,%4 \n\
+	stdcx.	%1,0,%3 \n\
+	bne-	1b"
+	: "=&r" (old), "=&r" (tmp), "=m" (*p)
+	: "r" (p), "r" (pte_val(entry)), "m" (*p)
+	: "cc" );
+}
+
+
 /* PTE updating functions */
 extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
 
===== mm/memory.c 1.176 vs edited =====
--- 1.176/mm/memory.c	Sat May 22 14:56:30 2004
+++ edited/mm/memory.c	Tue May 25 10:37:19 2004
@@ -1004,7 +1004,7 @@
 	flush_cache_page(vma, address);
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
-	ptep_establish(vma, address, page_table, entry);
+	ptep_establish(vma, address, page_table, entry, 1);
 	update_mmu_cache(vma, address, entry);
 }
 
@@ -1056,7 +1056,7 @@
 			flush_cache_page(vma, address);
 			entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
 					      vma);
-			ptep_establish(vma, address, page_table, entry);
+			ptep_establish(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1646,7 +1646,7 @@
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry);
+	ptep_establish(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 17:49                         ` Linus Torvalds
@ 2004-05-25 17:54                           ` David S. Miller
  2004-05-25 18:05                             ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2004-05-25 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch

On Tue, 25 May 2004 10:49:21 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> So what I can tell, the fix is really something like this (this does both 
> x86 and ppc64 just to show how two different approaches would handle it, 
> but I have literally _tested_ neither).
> 
> What do people think?

So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with
some kind of loop using the swap instruction?  That's in fact what
I've always wanted, someway to easily integrate the usage of such
a loop so that we could handle this problem on such systems.

Keith?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 17:54                           ` David S. Miller
@ 2004-05-25 18:05                             ` Linus Torvalds
  2004-05-25 20:30                               ` Linus Torvalds
                                                 ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 18:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch



On Tue, 25 May 2004, David S. Miller wrote:
> On Tue, 25 May 2004 10:49:21 -0700 (PDT)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > So what I can tell, the fix is really something like this (this does both 
> > x86 and ppc64 just to show how two different approaches would handle it, 
> > but I have literally _tested_ neither).
> > 
> > What do people think?
> 
> So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with
> some kind of loop using the swap instruction?

Yes. Except that if everybody else uses atomic updates (including the hw 
walkers), _and_ "dirty" is true, then you can optimize that case to just 
to an atomic write (since we don't care what the previous contents were, 
and everybody else is guaranteed to honor the fact that we set all the 
bits.

(And an independent optimization is obviously to not do the store at all
if it is already has the new value, although that _should_ be the rare 
case, since if that was true I don't see why you got a page fault in the 
first place).

So _if_ such an atomic loop is fundamentally expensive for some reason, it 
should be perfectly ok to do

	if (dirty) {
		one atomic write with all the bits set;
	} else {
		cmpxchg until successful;
	}

Oh - btw - my suggested patch was totally broken for ppc64, because that 
"ptep_update_dirty_accessed()" thing obviously also needs to that damn 
hpte_update() crud etc. 

BenH - I'm leaving that ppc64 code to somebody knows what the hell he is
doing. Ie you or Anton or something. Ok? I can act as a collector the
different architecture things for that "ptep_update_dirty_accessed()"
function.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 18:05                             ` Linus Torvalds
@ 2004-05-25 20:30                               ` Linus Torvalds
  2004-05-25 20:35                               ` David S. Miller
  2004-05-25 21:40                               ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 20:30 UTC (permalink / raw)
  To: David S. Miller
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch



On Tue, 25 May 2004, Linus Torvalds wrote:
> 
> BenH - I'm leaving that ppc64 code to somebody knows what the hell he is
> doing. Ie you or Anton or something. Ok? I can act as a collector the
> different architecture things for that "ptep_update_dirty_accessed()"
> function.

Following up to myself.

I just committed a couple of trivial changesets that allows any 
architecture to re-define its own "ptep_update_dirty_accessed()" method.

The default one (if none is defined by the architecture) is just

	#ifndef ptep_update_dirty_accessed
	#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
	#endif

ie no change in behaviour. As an example of an alternate strategy, this is 
the one I committed for x86:

	#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)	\
		do {							\
			if (__dirty) set_pte(__ptep, __entry);		\
		} while (0)

which is valid if the architecture updates its own accessed bits.

I just realized that for x86 the _clever_ way of doing this (for highmem
machines) is actually to only update the low word, which makes for much
better code for the PAE case (and still does exactle the same for the
non-PAE case):

	#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)		\
		do {								\
			if (__dirty) (__ptep)->pte_low = (__entry).pte_low;	\
		} while (0)

but I haven't actually tested this.

Anybody willing to test the x86 PAE optimization?

In the meantime, other architectures can now fix their dirty/accessed bit
setting any way they damn well please.

			Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 18:05                             ` Linus Torvalds
  2004-05-25 20:30                               ` Linus Torvalds
@ 2004-05-25 20:35                               ` David S. Miller
  2004-05-25 20:49                                 ` Linus Torvalds
  2004-05-25 21:40                               ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2004-05-25 20:35 UTC (permalink / raw)
  To: Linus Torvalds, wesolows
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch

On Tue, 25 May 2004 11:05:09 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> On Tue, 25 May 2004, David S. Miller wrote:
> > So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with
> > some kind of loop using the swap instruction?
> 
> Yes. Except that if everybody else uses atomic updates (including the hw 
> walkers), _and_ "dirty" is true, then you can optimize that case to just 
> to an atomic write (since we don't care what the previous contents were, 
> and everybody else is guaranteed to honor the fact that we set all the 
> bits.
> 
> (And an independent optimization is obviously to not do the store at all
> if it is already has the new value, although that _should_ be the rare 
> case, since if that was true I don't see why you got a page fault in the 
> first place).
> 
> So _if_ such an atomic loop is fundamentally expensive for some reason, it 
> should be perfectly ok to do
> 
> 	if (dirty) {
> 		one atomic write with all the bits set;
> 	} else {
> 		cmpxchg until successful;
> 	}

Hmmm, do you understand how broken the sparc hardware is? :-)

Seriously, the issue is that the MMU writes back access/dirty bits
asynchronously, does not do a relookup when it writes these bits back
into the PTE (like x86 and others do) it actually stores away the PTE
physical address and writes into the PTE using that, and finally as
previously mentioned we lack a cmpxchg we only have raw SWAP.

Keith W. can verify this, he has my old SuperSPARC manual which explains
all of this stuff.  Keith you might want to quote that little "atomic PTE
update sequence" piece of code that's in the manual for Linus.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 20:35                               ` David S. Miller
@ 2004-05-25 20:49                                 ` Linus Torvalds
  2004-05-25 20:57                                   ` David S. Miller
  2004-05-26  6:20                                   ` Keith M Wesolowski
  0 siblings, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 20:49 UTC (permalink / raw)
  To: David S. Miller
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch



On Tue, 25 May 2004, David S. Miller wrote:
> 
> Hmmm, do you understand how broken the sparc hardware is? :-)

I seem to always repress that part.

> Seriously, the issue is that the MMU writes back access/dirty bits
> asynchronously, does not do a relookup when it writes these bits back
> into the PTE (like x86 and others do) it actually stores away the PTE
> physical address and writes into the PTE using that, and finally as
> previously mentioned we lack a cmpxchg we only have raw SWAP.

Ok. Still, that doesn't sound too bad. I assume that the pte write has to 
be atomic anyway (ie it doesn't walk the page tables, but it clearly _has_ 
to do an atomic "read-modify-update" or just the _hardware_ updates might 
race against each other and one CPU loses the dirty bit when another CPU 
writes back the accessed bit).

So I really think it should be safe to just do a regular write when you 
update both bits, because you know that no other CPU will at least _clear_ 
any bits. Hmm?

So it sounds like the SWAP loop basically ends up being just something 
like

	val = pte_value(entry);
	for (;;) {
		oldval = SWAP(ptep, val);
		/*
		 * If we wrote a value that had the same or more bits set 
		 * than the old value, we're ok...
		 */
		if (!(oldval & ~val))
			break;
		/*
		 * ..otherwise we need to write the "or" of all bits and
		 * try again.
		 */
		val |= oldval;
	}

Right? Note that the reason we can do the "dirty and accessed bit both 
set" case with a simple write is exactly because that's already the 
"maximal" bits anybody can write to the thing, so we don't need to loop, 
we can just write it directly.

I realize that this isn't as simple as the x86 solution (or most
everything else, for that matter ;), but it's by no means totally
unreasonable. Or are there even _more_ gotchas that I've missed?

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 20:49                                 ` Linus Torvalds
@ 2004-05-25 20:57                                   ` David S. Miller
  2004-05-26  6:20                                   ` Keith M Wesolowski
  1 sibling, 0 replies; 79+ messages in thread
From: David S. Miller @ 2004-05-25 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl,
	linux-mm, linux-arch

On Tue, 25 May 2004 13:49:57 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> So it sounds like the SWAP loop basically ends up being just something 
> like
 ...
> Right? Note that the reason we can do the "dirty and accessed bit both 
> set" case with a simple write is exactly because that's already the 
> "maximal" bits anybody can write to the thing, so we don't need to loop, 
> we can just write it directly.

I believe so.  So it's still possible for the mmu to write something
with less bits, ie. say we're adding dirty, we write dirty+accessed but the
TLB has the pre-dirty PTE and writes that with just the accessed bit set.

And this is exactly what you code is trying to handle.  Perfect.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 14:48                 ` Linus Torvalds
  2004-05-25 15:35                   ` Keith M Wesolowski
@ 2004-05-25 21:27                   ` Andrea Arcangeli
  2004-05-25 21:43                     ` Linus Torvalds
  2004-05-25 21:44                     ` Andrea Arcangeli
  1 sibling, 2 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25 21:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote:
> We'd just pass in a flag to "ptep_establish()" to tell it whether we
> changed the dirty bit or not. It would be "write_access" in
> handle_pte_fault(), and 1 in the other two cases.

I want to show you another approch. This effectively make the end of
handle_pte_fault a noop for x86 and x86-64 and it implements two
operations ptep_handle_[young|dirty]_page_fault(ptep) that allows
architectures with a page-faulting dirty and/or young bits to resolve the
page fault inside their own architecture (since the alpha architecture
generates a page fault for both, there's not even need of atomic
instructions anywhere). The only brainer case is an architecture
generating a page fault for the young bit and not for the dirty bit. In
such case they would need to use something like this:

			entry = ptep_get_and_clear(pte);
			set_pte(pte, pte_modify(entry, newprot));

Again no atomic instructions. Though with an atomic instruction 
like a set_mask64(__DIRTY_BITS, &ptep->pte) such an architecture may run
faster than the above, but the above is guaranteed to work as far as
mprotect works too.

I believe using ptep_establish in handle_mm_fault makes little sense,
to make the most obvious example there's no need of flushing the tlb in
handle_mm_fault to resolve young or dirty page faults. Not a big deal
for x86 and x86-64 that reaches that path only if a race happens, but on
alpha we shouldn't flush the tlb. If some weird architecture really need
to flush the tlb they still can inside
ptep_handle_[young|dirty]_page_fault.

As for set_pte, there's quite a lot of legitimate stuff using it it on
present ptes, even not dirty ones (see zap_pte_* too).

The last issue is ptep_establish, we're flushing the pte in do_wp_page
inside ptep_establish again for no good reason. Those suprious tlb
flushes may even trigger IPIs (this time in x86 smp too even with
processes), so I'd really like to remove the explicit flush in
do_wp_page, however this will likely break s390 but I don't understand
s390 so I'll leave it broken for now (at least to show you this
alternative and to hear comments if it's as broken as the previous one).

Really my basic point for going this direction is that it makes little
sense to me to use a tlb-flushing ptep_establish in both handle_mm_fault
and do_wp_page, where no tlb flush is required in x86, x86-64 and alpha,
so I do see a window for improving performance. Yeah, I'm assuming all
the cpus are smart enough to automatically re-read the pte from memory
if the information they have in the tlb asks for a page-fault. I also
added a trap for non-dirty pte in ptep_establish but it's quite
superflous since there's only 1 ptep_establish caller left.

patch works for me on x86 and it should work on alpha too this time ;).

The really scary thing about this patch is the s390 ptep_establish.

Index: linux-2.5/include/asm-alpha/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v
retrieving revision 1.22
diff -u -p -r1.22 pgtable.h
--- linux-2.5/include/asm-alpha/pgtable.h	23 May 2004 05:02:25 -0000	1.22
+++ linux-2.5/include/asm-alpha/pgtable.h	25 May 2004 20:34:11 -0000
@@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte
 extern inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= __DIRTY_BITS; return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }
 
+static inline void ptep_handle_young_page_fault(pte_t *ptep)
+{
+	/*
+	 * WARNING: this is safe only because the dirty bit
+	 * cannot change trasparently in hardware in the pte.
+	 * If the dirty bit in the pte would be set trasparently
+	 * by the CPU this should be implemented with set_bit()
+	 * or with a new set_mask64() implementing an atomic "or".
+	 */
+	set_pte(ptep, pte_mkyoung(*ptep));
+}
+
+static inline void ptep_handle_dirty_page_fault(pte_t *ptep)
+{
+	/*
+	 * This can run without atomic instructions even if
+	 * the young bit is set in hardware by the architecture.
+	 * Losing the young bit is not important.
+	 */
+	set_pte(ptep, pte_mkdirty(*ptep));
+}
+
 #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address))
 
 /* to find an entry in a kernel page-table-directory */
Index: linux-2.5/include/asm-generic/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v
retrieving revision 1.4
diff -u -p -r1.4 pgtable.h
--- linux-2.5/include/asm-generic/pgtable.h	19 Jan 2004 18:43:03 -0000	1.4
+++ linux-2.5/include/asm-generic/pgtable.h	25 May 2004 20:38:39 -0000
@@ -12,6 +12,7 @@
  */
 #define ptep_establish(__vma, __address, __ptep, __entry)		\
 do {									\
+	BUG_ON(!pte_dirty(__entry));					\
 	set_pte(__ptep, __entry);					\
 	flush_tlb_page(__vma, __address);				\
 } while (0)
Index: linux-2.5/include/asm-i386/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v
retrieving revision 1.42
diff -u -p -r1.42 pgtable.h
--- linux-2.5/include/asm-i386/pgtable.h	23 May 2004 05:02:25 -0000	1.42
+++ linux-2.5/include/asm-i386/pgtable.h	25 May 2004 20:27:00 -0000
@@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 
+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_dirty(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_young(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); }
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
Index: linux-2.5/include/asm-x86_64/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v
retrieving revision 1.27
diff -u -p -r1.27 pgtable.h
--- linux-2.5/include/asm-x86_64/pgtable.h	23 May 2004 05:02:25 -0000	1.27
+++ linux-2.5/include/asm-x86_64/pgtable.h	25 May 2004 20:27:28 -0000
@@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte
 extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
 extern inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_dirty(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while(0)
 static inline  int ptep_test_and_clear_young(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); }
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, ptep); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, ptep); }
Index: linux-2.5/mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.169
diff -u -p -r1.169 memory.c
--- linux-2.5/mm/memory.c	23 May 2004 05:10:11 -0000	1.169
+++ linux-2.5/mm/memory.c	25 May 2004 20:28:23 -0000
@@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
 
-		entry = pte_mkdirty(entry);
+		ptep_handle_dirty_page_fault(pte);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry);
+	ptep_handle_young_page_fault(pte);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);
Signed-off-by: Andrea Arcangeli <andrea@suse.de>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 18:05                             ` Linus Torvalds
  2004-05-25 20:30                               ` Linus Torvalds
  2004-05-25 20:35                               ` David S. Miller
@ 2004-05-25 21:40                               ` Benjamin Herrenschmidt
  2004-05-25 21:54                                 ` Linus Torvalds
  2004-05-25 22:09                                 ` Linus Torvalds
  2 siblings, 2 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25 21:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list


> Oh - btw - my suggested patch was totally broken for ppc64, because that 
> "ptep_update_dirty_accessed()" thing obviously also needs to that damn 
> hpte_update() crud etc. 
> 
> BenH - I'm leaving that ppc64 code to somebody knows what the hell he is
> doing. Ie you or Anton or something. Ok? I can act as a collector the
> different architecture things for that "ptep_update_dirty_accessed()"
> function.

Well, just setting one of those 2 bits doesn't require a hash table
invalidate as long as nothing else changes.

We do dirty by mapping r/o in the hash table, and accessed on hash
faults (our clear_young triggers a flush). So just setting those bits
in the linux PTE without touching the hash table is fine, we'll just
possibly take an extra fault on the next write or access, but that
might not be much slower than going to the hash update the permissions
directly

The original problem I have with set_pte is that our current
implementation of set_pte will overwrite the entire PTE, possibly losing
the bits that indicate that there is a copy in the hash and its index in
the hash bucket. So if set_pte is called on a PTE that is present, we
must flush it properly first as we will lose track of the hash one when
overriding. hpte_update() will simply add the old PTE to a batch which
is then flushed by either the mmu gather batch end, or by a call to
flush_tlb_*  

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:27                   ` Andrea Arcangeli
@ 2004-05-25 21:43                     ` Linus Torvalds
  2004-05-25 21:55                       ` Andrea Arcangeli
  2004-05-25 21:44                     ` Andrea Arcangeli
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 21:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group



On Tue, 25 May 2004, Andrea Arcangeli wrote:
> 
> 			entry = ptep_get_and_clear(pte);
> 			set_pte(pte, pte_modify(entry, newprot));
> 
> Again no atomic instructions.

Well, actually, that "ptep_get_and_clear()" is actually an atomic 
instruction. Or at least it had _better_ be.

> I believe using ptep_establish in handle_mm_fault makes little sense,
> to make the most obvious example there's no need of flushing the tlb in
> handle_mm_fault to resolve young or dirty page faults. Not a big deal
> for x86 and x86-64 that reaches that path only if a race happens, but on
> alpha we shouldn't flush the tlb. If some weird architecture really need
> to flush the tlb they still can inside
> ptep_handle_[young|dirty]_page_fault.

Actually, especially on alpha we _do_ need to flush the TLB.

Think about it: the TLB clearly contains the right virt->physical mapping, 
but the TLB contains the wrong "control bits". 

If we don't flush the TLB, the TLB will _continue_ to contain the wrong 
control bits.

And as you saw earlier, if those bits are wrong, you get some really nasty 
behaviour, like infinite page faults. If the TLB still says that the page 
is non-readable, even though _memory_ says it is readable, it doesn't much 
matter that we updated the page tables correctly in memory. The CPU will 
use the TLB.

So that TLB flush actually _is_ fundamental. 

Arguably we could remove it from x86. On the other hand, arguably it 
doesn't actually matter on x86, so..

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:27                   ` Andrea Arcangeli
  2004-05-25 21:43                     ` Linus Torvalds
@ 2004-05-25 21:44                     ` Andrea Arcangeli
  1 sibling, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

in the previous email I attached a slightly older version of the patch
that didn't optimize the spurious tlb flush away from do_wp_page yet,
this is the latest version I tested:

Index: linux-2.5/include/asm-alpha/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v
retrieving revision 1.22
diff -u -p -r1.22 pgtable.h
--- linux-2.5/include/asm-alpha/pgtable.h	23 May 2004 05:02:25 -0000	1.22
+++ linux-2.5/include/asm-alpha/pgtable.h	25 May 2004 20:34:11 -0000
@@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte
 extern inline pte_t pte_mkdirty(pte_t pte)	{ pte_val(pte) |= __DIRTY_BITS; return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ pte_val(pte) |= __ACCESS_BITS; return pte; }
 
+static inline void ptep_handle_young_page_fault(pte_t *ptep)
+{
+	/*
+	 * WARNING: this is safe only because the dirty bit
+	 * cannot change trasparently in hardware in the pte.
+	 * If the dirty bit in the pte would be set trasparently
+	 * by the CPU this should be implemented with set_bit()
+	 * or with a new set_mask64() implementing an atomic "or".
+	 */
+	set_pte(ptep, pte_mkyoung(*ptep));
+}
+
+static inline void ptep_handle_dirty_page_fault(pte_t *ptep)
+{
+	/*
+	 * This can run without atomic instructions even if
+	 * the young bit is set in hardware by the architecture.
+	 * Losing the young bit is not important.
+	 */
+	set_pte(ptep, pte_mkdirty(*ptep));
+}
+
 #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address))
 
 /* to find an entry in a kernel page-table-directory */
Index: linux-2.5/include/asm-generic/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v
retrieving revision 1.4
diff -u -p -r1.4 pgtable.h
--- linux-2.5/include/asm-generic/pgtable.h	19 Jan 2004 18:43:03 -0000	1.4
+++ linux-2.5/include/asm-generic/pgtable.h	25 May 2004 20:38:39 -0000
@@ -12,6 +12,7 @@
  */
 #define ptep_establish(__vma, __address, __ptep, __entry)		\
 do {									\
+	BUG_ON(!pte_dirty(__entry));					\
 	set_pte(__ptep, __entry);					\
 	flush_tlb_page(__vma, __address);				\
 } while (0)
Index: linux-2.5/include/asm-i386/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v
retrieving revision 1.42
diff -u -p -r1.42 pgtable.h
--- linux-2.5/include/asm-i386/pgtable.h	23 May 2004 05:02:25 -0000	1.42
+++ linux-2.5/include/asm-i386/pgtable.h	25 May 2004 20:27:00 -0000
@@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 
+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_dirty(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_young(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); }
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
Index: linux-2.5/include/asm-x86_64/pgtable.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v
retrieving revision 1.27
diff -u -p -r1.27 pgtable.h
--- linux-2.5/include/asm-x86_64/pgtable.h	23 May 2004 05:02:25 -0000	1.27
+++ linux-2.5/include/asm-x86_64/pgtable.h	25 May 2004 20:27:28 -0000
@@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte
 extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
 extern inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
+/*
+ * This is used only to set the dirty bit during a "dirty" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_dirty_page_fault(ptep) do { } while (0)
 static inline  int ptep_test_and_clear_dirty(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); }
+/*
+ * This is used only to set the young bit during a "young" page fault.
+ * Nothing to do on x86-64 since it's set by the hardware transparently
+ * and it cannot generate page faults.
+ */
+#define ptep_handle_young_page_fault(ptep) do { } while(0)
 static inline  int ptep_test_and_clear_young(pte_t *ptep)	{ return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); }
 static inline void ptep_set_wrprotect(pte_t *ptep)		{ clear_bit(_PAGE_BIT_RW, ptep); }
 static inline void ptep_mkdirty(pte_t *ptep)			{ set_bit(_PAGE_BIT_DIRTY, ptep); }
Index: linux-2.5/mm/memory.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v
retrieving revision 1.169
diff -u -p -r1.169 memory.c
--- linux-2.5/mm/memory.c	23 May 2004 05:10:11 -0000	1.169
+++ linux-2.5/mm/memory.c	25 May 2004 21:20:11 -0000
@@ -1056,7 +1056,7 @@ static int do_wp_page(struct mm_struct *
 			flush_cache_page(vma, address);
 			entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
 					      vma);
-			ptep_establish(vma, address, page_table, entry);
+			set_pte(page_table, entry);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
 
-		entry = pte_mkdirty(entry);
+		ptep_handle_dirty_page_fault(pte);
 	}
-	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry);
+	ptep_handle_young_page_fault(pte);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);
Signed-off-by: Andrea Arcangeli <andrea@suse.de>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:40                               ` Benjamin Herrenschmidt
@ 2004-05-25 21:54                                 ` Linus Torvalds
  2004-05-25 22:00                                   ` Linus Torvalds
  2004-05-25 22:05                                   ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Benjamin Herrenschmidt
  2004-05-25 22:09                                 ` Linus Torvalds
  1 sibling, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 21:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> 
> Well, just setting one of those 2 bits doesn't require a hash table
> invalidate as long as nothing else changes.

Ok. And nothing ever writes to the SW page tables outside the page table 
lock, right? So on ppc64, we could just do

	#define ptep_update_dirty_accessed(ptep, entry, dirty) \
		*(ptep) = (entry)

and be done with it. No?

I'm not going to do it without a big ack from you.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:43                     ` Linus Torvalds
@ 2004-05-25 21:55                       ` Andrea Arcangeli
  2004-05-25 22:01                         ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

On Tue, May 25, 2004 at 02:43:46PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 25 May 2004, Andrea Arcangeli wrote:
> > 
> > 			entry = ptep_get_and_clear(pte);
> > 			set_pte(pte, pte_modify(entry, newprot));
> > 
> > Again no atomic instructions.
> 
> Well, actually, that "ptep_get_and_clear()" is actually an atomic 
> instruction. Or at least it had _better_ be.

sure, I really meant no "new" atomic instruction.

> > I believe using ptep_establish in handle_mm_fault makes little sense,
> > to make the most obvious example there's no need of flushing the tlb in
> > handle_mm_fault to resolve young or dirty page faults. Not a big deal
> > for x86 and x86-64 that reaches that path only if a race happens, but on
> > alpha we shouldn't flush the tlb. If some weird architecture really need
> > to flush the tlb they still can inside
> > ptep_handle_[young|dirty]_page_fault.
> 
> Actually, especially on alpha we _do_ need to flush the TLB.
> 
> Think about it: the TLB clearly contains the right virt->physical mapping, 
> but the TLB contains the wrong "control bits". 
> 
> If we don't flush the TLB, the TLB will _continue_ to contain the wrong 
> control bits.

I expected the pal code to re-read the pte if the control bits asked for
page fault, like it must happen if the control bits are set to
non-present. This latter this must be true or linux wouldn't run at all
on alpha.  We certainly don't flush the tlb after marking the page from
non-present to present, example in do_anonymous_page. Anyways if the pal
code behaves like that specifically with the KWE/UWE/KRE/URE and not
with the PAGE_VALID bit, I obviously agree have to flush the tlb. I just
didn't expect it, though I admit I couldn't easily find specs about it.

> And as you saw earlier, if those bits are wrong, you get some really nasty 
> behaviour, like infinite page faults. If the TLB still says that the page 
> is non-readable, even though _memory_ says it is readable, it doesn't much 
> matter that we updated the page tables correctly in memory. The CPU will 
> use the TLB.
> 
> So that TLB flush actually _is_ fundamental. 
> 
> Arguably we could remove it from x86. On the other hand, arguably it 
> doesn't actually matter on x86, so..

it doesn't matter in handle_mm_fault but it does matter in do_wp_page.
since we've to mess with ptep_establish to fix this race, it would be
nice to remove such flush_tlb_page from do_wp_page. Or is the x86 tlb
also not refilling the pte automatically if the control bits asks for
page-fault? In mprotect obviously we've to flush since we can upgrade
and downgrade the control bits, but in do_wp_page we only ugprade, so
there should be no need of tlb flush. I'll try to find some
documentation about this to be sure, at least for x86 it should be easy
to find.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:54                                 ` Linus Torvalds
@ 2004-05-25 22:00                                   ` Linus Torvalds
  2004-05-25 22:07                                     ` Benjamin Herrenschmidt
  2004-05-25 22:05                                   ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Benjamin Herrenschmidt
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 22:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Tue, 25 May 2004, Linus Torvalds wrote:
> 
> 
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > Well, just setting one of those 2 bits doesn't require a hash table
> > invalidate as long as nothing else changes.
> 
> Ok. And nothing ever writes to the SW page tables outside the page table 
> lock, right? So on ppc64, we could just do
> 
> 	#define ptep_update_dirty_accessed(ptep, entry, dirty) \
> 		*(ptep) = (entry)
> 
> and be done with it. No?

No. I'm an idiot. We do need to keep the "hashed" bit atomically, so it
would need to be something on the order of

	static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty)
	{
		unsigned long bits = pte_value(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED);
		unsigned long tmp;

		__asm__ __volatile__(
	"1:	lwarx	%0,0,%3\n\
		or	%0,%2,%0\n\
		stwcx.	%0,0,%3\n\
		bne-	1b"
		:"=&r" (tmp), "=m" (*ptep)
		:"r" (bits), "r" (ptep)
		:"cc");
	}

	/* Make asm-generic/pgtable.h know about it.. */
	#define ptep_update_dirty_accessed ptep_update_dirty_accessed

or whatever.

Anyway, I don't feel competent enough to commit it, so..

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:55                       ` Andrea Arcangeli
@ 2004-05-25 22:01                         ` Linus Torvalds
  2004-05-25 22:18                           ` Ivan Kokshaysky
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 22:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group



On Tue, 25 May 2004, Andrea Arcangeli wrote:
> 
> I expected the pal code to re-read the pte if the control bits asked for
> page fault, like it must happen if the control bits are set to
> non-present.

That may or may not be true. I _think_ it wasn't true.

> This latter this must be true or linux wouldn't run at all
> on alpha.

A "not-present" fault is a totally different fault from a "protection 
fault". Only the not-present fault ends up walking the page tables, if I 
remember correctly.

The PAL-code sources are out there somewhere, so I guess this should be 
easy to check if I wasn't so lazy.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:54                                 ` Linus Torvalds
  2004-05-25 22:00                                   ` Linus Torvalds
@ 2004-05-25 22:05                                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25 22:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 07:54, Linus Torvalds wrote:
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > Well, just setting one of those 2 bits doesn't require a hash table
> > invalidate as long as nothing else changes.
> 
> Ok. And nothing ever writes to the SW page tables outside the page table 
> lock, right? So on ppc64, we could just do
> 
> 	#define ptep_update_dirty_accessed(ptep, entry, dirty) \
> 		*(ptep) = (entry)
> 
> and be done with it. No?
> 
> I'm not going to do it without a big ack from you.

No. The hash fault path will update the PTE dirty/accessed on a hash miss
exception without holding the page table lock (acts a bit like a HW TLB
as far as linux is concerned). That's why it needs to be atomic.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:00                                   ` Linus Torvalds
@ 2004-05-25 22:07                                     ` Benjamin Herrenschmidt
  2004-05-25 22:14                                       ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list


> 	static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty)
> 	{
> 		unsigned long bits = pte_value(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED);
> 		unsigned long tmp;
> 
> 		__asm__ __volatile__(
> 	"1:	lwarx	%0,0,%3\n\
> 		or	%0,%2,%0\n\
> 		stwcx.	%0,0,%3\n\
> 		bne-	1b"
> 		:"=&r" (tmp), "=m" (*ptep)
> 		:"r" (bits), "r" (ptep)
> 		:"cc");
> 	}
> 
> 	/* Make asm-generic/pgtable.h know about it.. */
> 	#define ptep_update_dirty_accessed ptep_update_dirty_accessed

That looks good on a first look, I need to re-read the whole discussion since
yesterday through to make sure I didn't miss anything. Note that I'd rather
call the function ptep_set_* than ptep_update_* to make clear that it can
only ever be used to _set_ those bits.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 21:40                               ` Benjamin Herrenschmidt
  2004-05-25 21:54                                 ` Linus Torvalds
@ 2004-05-25 22:09                                 ` Linus Torvalds
  2004-05-25 22:19                                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 22:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> 
> Well, just setting one of those 2 bits doesn't require a hash table
> invalidate as long as nothing else changes.

I'm starting to doubt this, because:

> We do dirty by mapping r/o in the hash table, and accessed on hash
> faults (our clear_young triggers a flush). So just setting those bits
> in the linux PTE without touching the hash table is fine, we'll just
> possibly take an extra fault on the next write or access, but that
> might not be much slower than going to the hash update the permissions
> directly.

But if we don't update the hash tables, how will the TLB entry _ever_ say 
that the page is writable? So we won't take just _one_ extra fault on the 
next write, we'll _keep_ taking them, since the hash tables will continue 
to claim that the page is read-only, even if the linux sw page tables say 
it is writable.

So I think the code needs to invalidate the hash after having updated the 
pte. No?

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:07                                     ` Benjamin Herrenschmidt
@ 2004-05-25 22:14                                       ` Linus Torvalds
  2004-05-26  0:21                                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 22:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
>
> Note that I'd rather call the function ptep_set_* than ptep_update_* to
> make clear that it can only ever be used to _set_ those bits.

Good point.

Too late.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:01                         ` Linus Torvalds
@ 2004-05-25 22:18                           ` Ivan Kokshaysky
  2004-05-25 22:42                             ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Ivan Kokshaysky @ 2004-05-25 22:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Matthew Wilcox, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group

On Tue, May 25, 2004 at 03:01:55PM -0700, Linus Torvalds wrote:
> A "not-present" fault is a totally different fault from a "protection 
> fault". Only the not-present fault ends up walking the page tables, if I 
> remember correctly.

Precisely. The architecture reference manual says:
"Additionally, when the software changes any part (except the software
field) of a *valid* PTE, it must also execute a tbi instruction."

Ivan.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:09                                 ` Linus Torvalds
@ 2004-05-25 22:19                                   ` Benjamin Herrenschmidt
  2004-05-25 22:24                                     ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-25 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 08:09, Linus Torvalds wrote:

> But if we don't update the hash tables, how will the TLB entry _ever_ say 
> that the page is writable? So we won't take just _one_ extra fault on the 
> next write, we'll _keep_ taking them, since the hash tables will continue 
> to claim that the page is read-only, even if the linux sw page tables say 
> it is writable.
> 
> So I think the code needs to invalidate the hash after having updated the 
> pte. No?

No, we'll take a hash fault, not a page fault. The hash fault is an asm
fast path, which in this case, will update the HPTE RW permission when
the PTE has PAGE_RW (and will set PAGE_DIRTY again, but that's fine).

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:19                                   ` Benjamin Herrenschmidt
@ 2004-05-25 22:24                                     ` Linus Torvalds
  0 siblings, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-25 22:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > So I think the code needs to invalidate the hash after having updated the 
> > pte. No?
> 
> No, we'll take a hash fault, not a page fault. The hash fault is an asm
> fast path, which in this case, will update the HPTE RW permission when
> the PTE has PAGE_RW (and will set PAGE_DIRTY again, but that's fine).

Ahh. Goodie. Then the "simple" atomic bitset probably works. Assuming I 
translated the asm/atomic.h stuff correctly.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:18                           ` Ivan Kokshaysky
@ 2004-05-25 22:42                             ` Andrea Arcangeli
  2004-05-26  2:26                               ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-25 22:42 UTC (permalink / raw)
  To: Ivan Kokshaysky
  Cc: Linus Torvalds, Matthew Wilcox, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group

On Wed, May 26, 2004 at 02:18:45AM +0400, Ivan Kokshaysky wrote:
> On Tue, May 25, 2004 at 03:01:55PM -0700, Linus Torvalds wrote:
> > A "not-present" fault is a totally different fault from a "protection 
> > fault". Only the not-present fault ends up walking the page tables, if I 
> > remember correctly.
> 
> Precisely. The architecture reference manual says:
> "Additionally, when the software changes any part (except the software
> field) of a *valid* PTE, it must also execute a tbi instruction."

thanks for checking.

after various searching on the x86 docs I found:

	Whenever a page-directory or page-table entry is changed (including when
	the present flag is set to zero), the operating-system must immediately
	invalidate the corresponding entry in the TLB so that it can be updated
	the next time the entry is referenced.

according to the above we'd need to flush the tlb even in
do_anonymous_page on x86, or am I reading it wrong? We're not really
doing that, is that a bug? I'd be very surprised if we overlooked x86
wasting some time in some page fault loop, I guess it works like the
alpha in practice even if the specs tells us we've to flush.

anyways to make things work right with my approch I'd need to flush the
tlb after the handle_*_page_fault operations (they could return 1
if a flush is required before returning from the page fault) and I
should resurrect pte_establish in do_wp_page. but then I certainly agree
leaving ptep_establish in handle_mm_fault is fine if we've to flush the
tlb anyways, so I'm not going to update my patch unless anybody prefers
it for any other reason I don't see.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:14                                       ` Linus Torvalds
@ 2004-05-26  0:21                                         ` Benjamin Herrenschmidt
  2004-05-26  0:50                                           ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  0:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 08:14, Linus Torvalds wrote:
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> >
> > Note that I'd rather call the function ptep_set_* than ptep_update_* to
> > make clear that it can only ever be used to _set_ those bits.
> 
> Good point.
> 
> Too late.

Heh, I can still send a patch "fixing" it if you want ;)

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  0:21                                         ` Benjamin Herrenschmidt
@ 2004-05-26  0:50                                           ` Linus Torvalds
  2004-05-26  3:25                                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  0:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> 
> Heh, I can still send a patch "fixing" it if you want ;)

If you include a tested version of the ppc64 fix, I'd likely apply that.

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 22:42                             ` Andrea Arcangeli
@ 2004-05-26  2:26                               ` Linus Torvalds
  2004-05-26  7:06                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  2:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ivan Kokshaysky, Matthew Wilcox, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group



On Wed, 26 May 2004, Andrea Arcangeli wrote:
> 
> after various searching on the x86 docs I found:
> 
> 	Whenever a page-directory or page-table entry is changed (including when
> 	the present flag is set to zero), the operating-system must immediately
> 	invalidate the corresponding entry in the TLB so that it can be updated
> 	the next time the entry is referenced.
> 
> according to the above we'd need to flush the tlb even in
> do_anonymous_page on x86, or am I reading it wrong?

You're reading it wrong.

The "including when the present flag is set to zero" part does not mean 
that the present flag was zero _before_, it means "is being set to zero" 
as in "having been non-zero before that".

Anytime the P flag was clear _before_, we don't need to invalidate, 
because non-present entries are not cached in the TLB.

			Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  0:50                                           ` Linus Torvalds
@ 2004-05-26  3:25                                             ` Benjamin Herrenschmidt
  2004-05-26  4:08                                               ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  3:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 10:50, Linus Torvalds wrote: 
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> > 
> > Heh, I can still send a patch "fixing" it if you want ;)
> 
> If you include a tested version of the ppc64 fix, I'd likely apply that.

Ok, that doesn't work. Trigger BUG_ON in rmap.c:421

I think we are using ptep_establish for more than just setting those
2 bits (like for setting up the new PTE in break_cow and such while
the current implementationL/definition is more like just setting
those bits and nothing else....

If we end up doing more than setting those bits, we need to flush the
PTE completely from the hash etc....

Anyway, here's the non working patch, including the rename.

Ben.

===== include/asm-generic/pgtable.h 1.5 vs edited =====
--- 1.5/include/asm-generic/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-generic/pgtable.h	2004-05-26 10:58:17 +10:00
@@ -3,8 +3,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_ESTABLISH
 
-#ifndef ptep_update_dirty_accessed
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
+#ifndef ptep_set_dirty_accessed
+#define ptep_set_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
 #endif
 
 /*
@@ -17,7 +17,7 @@
  */
 #define ptep_establish(__vma, __address, __ptep, __entry, __dirty)	\
 do {									\
-	ptep_update_dirty_accessed(__ptep, __entry, __dirty);		\
+	ptep_set_dirty_accessed(__ptep, __entry, __dirty);		\
 	flush_tlb_page(__vma, __address);				\
 } while (0)
 #endif
===== include/asm-i386/pgtable.h 1.45 vs edited =====
--- 1.45/include/asm-i386/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-i386/pgtable.h	2004-05-26 10:59:12 +10:00
@@ -325,7 +325,7 @@
  * bit at the same time.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)	\
+#define ptep_set_dirty_accessed(__ptep, __entry, __dirty)	\
 	do {							\
 		if (__dirty) set_pte(__ptep, __entry);		\
 	} while (0)
===== include/asm-ppc64/pgtable.h 1.33 vs edited =====
--- 1.33/include/asm-ppc64/pgtable.h	2004-05-23 07:56:24 +10:00
+++ edited/include/asm-ppc64/pgtable.h	2004-05-26 13:09:59 +10:00
@@ -306,7 +306,10 @@
 	return old;
 }
 
-/* PTE updating functions */
+/* PTE updating functions, this function puts the PTE in the
+ * batch, doesn't actually triggers the hash flush immediately,
+ * you need to call flush_tlb_pending() to do that.
+ */
 extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
 
 static inline int ptep_test_and_clear_young(pte_t *ptep)
@@ -318,7 +321,7 @@
 	old = pte_update(ptep, _PAGE_ACCESSED);
 	if (old & _PAGE_HASHPTE) {
 		hpte_update(ptep, old, 0);
-		flush_tlb_pending();	/* XXX generic code doesn't flush */
+		flush_tlb_pending();
 	}
 	return (old & _PAGE_ACCESSED) != 0;
 }
@@ -396,10 +399,36 @@
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	if (pte_present(*ptep))
+	if (pte_present(*ptep)) {
 		pte_clear(ptep);
+		flush_tlb_pending();
+	}
 	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }
+
+/* Set the dirty and/or accessed bits atomically in a linux PTE, this
+ * function doesn't need to flush the hash entry
+ */
+static inline void ptep_set_dirty_accessed(pte_t *ptep, pte_t entry, int dirty)
+{
+	unsigned long bits = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED);
+	unsigned long tmp;
+
+	WARN_ON(!pte_present(*ptep));
+
+	__asm__ __volatile__(
+	       "1:     ldarx   %0,0,%3\n\
+                or      %0,%2,%0\n\
+                stdcx.  %0,0,%3\n\
+                bne-    1b"
+	:"=&r" (tmp), "=m" (*ptep)
+	:"r" (bits), "r" (ptep)
+	:"cc");
+}
+
+/* Make asm-generic/pgtable.h know about it.. */
+#define ptep_set_dirty_accessed ptep_set_dirty_accessed
+
 
 /*
  * Macro to mark a page protection value as "uncacheable".



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  3:25                                             ` Benjamin Herrenschmidt
@ 2004-05-26  4:08                                               ` Linus Torvalds
  2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
  2004-05-26  4:46                                                 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  4:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> 
> I think we are using ptep_establish for more than just setting those
> 2 bits (like for setting up the new PTE in break_cow and such while
> the current implementationL/definition is more like just setting
> those bits and nothing else....

You're right. We do use it on the do_wp_page() path, and there we actually 
use a whole new page in the "break_cow()" case. That case is in fact 
fundamentally different from the other ones.

So we should probably break up the "ptep_establish()" into its two pieces,
since the callers don't actually want to do the same thing. One really
wants to do a "clear old one, set a totally new one", and the two other
places want to actually update just the dirty and accessed bits.

In fact, the only non-generic user of "ptep_establish()" (s390) didn't 
want to use the generic version exactly because of this very conceptual 
bug. It uses "ptep_clear_flush()" for the replacement case, which actually 
makes sense.

So does it work if you do this appended patch first? This is a real 
cleanup, and I think it will allow us to get rid of the s390-specific code 
in ptep_establish(). Along with hopefully fixing your problem too.

After this, we should be able to have a BUG() in "set_pte()" if the entry 
wasn't clear before (assuming the arch doesn't use set_pte() for the dirty 
updates etc).

		Linus

---
===== mm/memory.c 1.177 vs edited =====
--- 1.177/mm/memory.c	Tue May 25 12:37:09 2004
+++ edited/mm/memory.c	Tue May 25 21:04:49 2004
@@ -1004,7 +1004,10 @@
 	flush_cache_page(vma, address);
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
-	ptep_establish(vma, address, page_table, entry, 1);
+
+	/* Get rid of the old entry, replace with new */
+	ptep_clear_flush(vma, address, page_table);
+	set_pte(page_table, entry);
 	update_mmu_cache(vma, address, entry);
 }
 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:08                                               ` Linus Torvalds
@ 2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
  2004-05-26  4:18                                                   ` Benjamin Herrenschmidt
  2004-05-26  4:28                                                   ` Linus Torvalds
  2004-05-26  4:46                                                 ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  4:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 14:08, Linus Torvalds wrote:

> You're right. We do use it on the do_wp_page() path, and there we actually 
> use a whole new page in the "break_cow()" case. That case is in fact 
> fundamentally different from the other ones.
> 
> So we should probably break up the "ptep_establish()" into its two pieces,
> since the callers don't actually want to do the same thing. One really
> wants to do a "clear old one, set a totally new one", and the two other
> places want to actually update just the dirty and accessed bits.

The first one could still be called "pte_establish" ... I mean, it makes
little sense to continue calling "pte_establish" something  that just
does set one of those 2 bits... And the flush done by pte_establish in
this case is useless on ppc... so I'd rather kill pte_establish
completely instead and define it's arch (or generic) impl. of
ptep_set_dirty_accessed() responsibility to do the TLB flush if
necessary, no ? (well, a call to it on ppc isn't expensive if we didn't
add anything to the batch anyway...)

> In fact, the only non-generic user of "ptep_establish()" (s390) didn't 
> want to use the generic version exactly because of this very conceptual 
> bug. It uses "ptep_clear_flush()" for the replacement case, which actually 
> makes sense.
> 
> So does it work if you do this appended patch first? This is a real 
> cleanup, and I think it will allow us to get rid of the s390-specific code 
> in ptep_establish(). Along with hopefully fixing your problem too.
> 
> After this, we should be able to have a BUG() in "set_pte()" if the entry 
> wasn't clear before (assuming the arch doesn't use set_pte() for the dirty 
> updates etc).

Ok, I'll give it a spin.

> 		Linus
> 
> ---
> ===== mm/memory.c 1.177 vs edited =====
> --- 1.177/mm/memory.c	Tue May 25 12:37:09 2004
> +++ edited/mm/memory.c	Tue May 25 21:04:49 2004
> @@ -1004,7 +1004,10 @@
>  	flush_cache_page(vma, address);
>  	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
>  			      vma);
> -	ptep_establish(vma, address, page_table, entry, 1);
> +
> +	/* Get rid of the old entry, replace with new */
> +	ptep_clear_flush(vma, address, page_table);
> +	set_pte(page_table, entry);
>  	update_mmu_cache(vma, address, entry);
>  }
>  
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
@ 2004-05-26  4:18                                                   ` Benjamin Herrenschmidt
  2004-05-26  4:50                                                     ` Linus Torvalds
  2004-05-26  4:28                                                   ` Linus Torvalds
  1 sibling, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  4:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 14:12, Benjamin Herrenschmidt wrote:
> On Wed, 2004-05-26 at 14:08, Linus Torvalds wrote:
> 
> > You're right. We do use it on the do_wp_page() path, and there we actually 
> > use a whole new page in the "break_cow()" case. That case is in fact 
> > fundamentally different from the other ones.
> > 
> > So we should probably break up the "ptep_establish()" into its two pieces,
> > since the callers don't actually want to do the same thing. One really
> > wants to do a "clear old one, set a totally new one", and the two other
> > places want to actually update just the dirty and accessed bits.

Hrm... Still dies, some kind of loop it seems, I'll have a look

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
  2004-05-26  4:18                                                   ` Benjamin Herrenschmidt
@ 2004-05-26  4:28                                                   ` Linus Torvalds
  1 sibling, 0 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  4:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> 
> The first one could still be called "ptep_establish"

Yes.

And the other places should be called "ptep_set_dirty_access()" (and you 
might verify that those bits are the only ones that change, if you want to 
make sure).

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:08                                               ` Linus Torvalds
  2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
@ 2004-05-26  4:46                                                 ` Benjamin Herrenschmidt
  2004-05-26  4:54                                                   ` Linus Torvalds
  1 sibling, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  4:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

Ok, your patch just missed the do_wp_page() case which needs the same
medication as break_cow(), which leaves us with only one caller of
ptep_establish... the one which just sets those 2 bits and shouldn't
be named ptep establish at all ;)

What do we do ?

I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and
the later just do the bit flip, but then, the arch (and possibly
generic) implementation of that bit flip must also do the TLB flush
when necessary (it's not on ppc).

Ben.  



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:50                                                     ` Linus Torvalds
@ 2004-05-26  4:49                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  4:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 14:50, Linus Torvalds wrote:
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> >
> > Hrm... Still dies, some kind of loop it seems, I'll have a look
> 
> Are you sure the it's not just taking infinite page fault because we keep 
> reloading the old value from the hash tables? That "hash fault" thing 
> still doesn't convince me. Why should the hash-refill fastpath ever look 
> at the software page tables?

Where do you think the hash PTE is filled from ? :)

We even use one of the linux PTE bits as a lock bit during the hash
refill (the PAGE_BUSY bit) to avoid a race where we can end up filling
more than one hash entry from the same linux PTE.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:18                                                   ` Benjamin Herrenschmidt
@ 2004-05-26  4:50                                                     ` Linus Torvalds
  2004-05-26  4:49                                                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  4:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
>
> Hrm... Still dies, some kind of loop it seems, I'll have a look

Are you sure the it's not just taking infinite page fault because we keep 
reloading the old value from the hash tables? That "hash fault" thing 
still doesn't convince me. Why should the hash-refill fastpath ever look 
at the software page tables?

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:46                                                 ` Benjamin Herrenschmidt
@ 2004-05-26  4:54                                                   ` Linus Torvalds
  2004-05-26  4:55                                                     ` Benjamin Herrenschmidt
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26  4:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list



On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
>
> Ok, your patch just missed the do_wp_page() case which needs the same
> medication as break_cow(), which leaves us with only one caller of
> ptep_establish... the one which just sets those 2 bits and shouldn't
> be named ptep establish at all ;)

Actually, the do_wp_page() one doesn't change the page, but it potentially
changes _three_ bits: accessed, dirty _and_ writable.

But the fix on ppc64 should be to add the writable bit to the mask of 
things, and rename the function to reflect that fact.

> I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and
> the later just do the bit flip, but then, the arch (and possibly
> generic) implementation of that bit flip must also do the TLB flush
> when necessary (it's not on ppc).

I agree on the renaming, but please don't use the ptep_clear_flush() in 
do_wp_page(), because it really isn't changing the virtual mapping, it's 
only a (slightly) extended case of the same old "set another bit".

		Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:54                                                   ` Linus Torvalds
@ 2004-05-26  4:55                                                     ` Benjamin Herrenschmidt
  2004-05-26  5:41                                                     ` Benjamin Herrenschmidt
  2004-05-26  5:59                                                     ` [PATCH] (signoff) " Benjamin Herrenschmidt
  2 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  4:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 14:54, Linus Torvalds wrote:
> On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
> >
> > Ok, your patch just missed the do_wp_page() case which needs the same
> > medication as break_cow(), which leaves us with only one caller of
> > ptep_establish... the one which just sets those 2 bits and shouldn't
> > be named ptep establish at all ;)
> 
> Actually, the do_wp_page() one doesn't change the page, but it potentially
> changes _three_ bits: accessed, dirty _and_ writable.

Hrm... that would work if we only ever set writable, so yes.

> But the fix on ppc64 should be to add the writable bit to the mask of 
> things, and rename the function to reflect that fact.
> 
> > I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and
> > the later just do the bit flip, but then, the arch (and possibly
> > generic) implementation of that bit flip must also do the TLB flush
> > when necessary (it's not on ppc).
> 
> I agree on the renaming, but please don't use the ptep_clear_flush() in 
> do_wp_page(), because it really isn't changing the virtual mapping, it's 
> only a (slightly) extended case of the same old "set another bit".

Ok, I just wasn't sure what this function was all about.

I'll cook a new patch including all the changes we discussed, may take some
time, but hopefully you'll get it today.

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:54                                                   ` Linus Torvalds
  2004-05-26  4:55                                                     ` Benjamin Herrenschmidt
@ 2004-05-26  5:41                                                     ` Benjamin Herrenschmidt
  2004-05-26  5:59                                                     ` [PATCH] (signoff) " Benjamin Herrenschmidt
  2 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  5:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

Ok, here it is. I blasted ptep_establish completely and the macro
thing and changed the function to ptep_set_access_flags() which
becomes also responsible for the flush if necesary.

Overall, I feel it's more consistent this way with the other stuff
in there.

I did the ppc64 impl, the x86 one (hope I got it right). I still need to
do ppc32 and I suppose s390 must be fixed now that ptep_estabish is gone
but I'll leave that to someone who understand something about these things ;)

Here it is, boots on g5 :

===== include/asm-generic/pgtable.h 1.5 vs edited =====
--- 1.5/include/asm-generic/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-generic/pgtable.h	2004-05-26 15:37:02 +10:00
@@ -1,24 +1,11 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
-#ifndef __HAVE_ARCH_PTEP_ESTABLISH
-
-#ifndef ptep_update_dirty_accessed
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
-#endif
-
-/*
- * Establish a new mapping:
- *  - flush the old one
- *  - update the page tables
- *  - inform the TLB about the new one
- *
- * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
- */
-#define ptep_establish(__vma, __address, __ptep, __entry, __dirty)	\
-do {									\
-	ptep_update_dirty_accessed(__ptep, __entry, __dirty);		\
-	flush_tlb_page(__vma, __address);				\
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+do {				  					  \
+	set_pte(__ptep, __entry);					  \
+	flush_tlb_page(__vma, __address);				  \
 } while (0)
 #endif
 
===== include/asm-i386/pgtable.h 1.45 vs edited =====
--- 1.45/include/asm-i386/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-i386/pgtable.h	2004-05-26 15:34:57 +10:00
@@ -325,9 +325,13 @@
  * bit at the same time.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)	\
-	do {							\
-		if (__dirty) set_pte(__ptep, __entry);		\
+#define  __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+	do {								  \
+		if (__dirty) {						  \
+			set_pte(__ptep, __entry);		  	  \
+			flush_tlb_page(__vma, __address);		  \
+		}							  \
 	} while (0)
 
 /* Encode and de-code a swap entry */
===== include/asm-ppc64/pgtable.h 1.33 vs edited =====
--- 1.33/include/asm-ppc64/pgtable.h	2004-05-23 07:56:24 +10:00
+++ edited/include/asm-ppc64/pgtable.h	2004-05-26 15:32:38 +10:00
@@ -306,7 +306,10 @@
 	return old;
 }
 
-/* PTE updating functions */
+/* PTE updating functions, this function puts the PTE in the
+ * batch, doesn't actually triggers the hash flush immediately,
+ * you need to call flush_tlb_pending() to do that.
+ */
 extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
 
 static inline int ptep_test_and_clear_young(pte_t *ptep)
@@ -318,7 +321,7 @@
 	old = pte_update(ptep, _PAGE_ACCESSED);
 	if (old & _PAGE_HASHPTE) {
 		hpte_update(ptep, old, 0);
-		flush_tlb_pending();	/* XXX generic code doesn't flush */
+		flush_tlb_pending();
 	}
 	return (old & _PAGE_ACCESSED) != 0;
 }
@@ -396,10 +399,34 @@
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	if (pte_present(*ptep))
+	if (pte_present(*ptep)) {
 		pte_clear(ptep);
+		flush_tlb_pending();
+	}
 	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }
+
+/* Set the dirty and/or accessed bits atomically in a linux PTE, this
+ * function doesn't need to flush the hash entry
+ */
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty)
+{
+	unsigned long bits = pte_val(entry) &
+		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW);
+	unsigned long tmp;
+
+	__asm__ __volatile__(
+	       "1:     ldarx   %0,0,%3\n\
+                or      %0,%2,%0\n\
+                stdcx.  %0,0,%3\n\
+                bne-    1b"
+	:"=&r" (tmp), "=m" (*ptep)
+	:"r" (bits), "r" (ptep)
+	:"cc");
+}
+#define  ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+        __ptep_set_access_flags(__ptep, __entry, __dirty)
 
 /*
  * Macro to mark a page protection value as "uncacheable".
===== mm/memory.c 1.177 vs edited =====
--- 1.177/mm/memory.c	2004-05-26 05:37:09 +10:00
+++ edited/mm/memory.c	2004-05-26 15:35:40 +10:00
@@ -1004,7 +1004,11 @@
 	flush_cache_page(vma, address);
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
-	ptep_establish(vma, address, page_table, entry, 1);
+
+	/* Get rid of the old entry, replace with new */
+	ptep_clear_flush(vma, address, page_table);
+	set_pte(page_table, entry);
+
 	update_mmu_cache(vma, address, entry);
 }
 
@@ -1056,7 +1060,7 @@
 			flush_cache_page(vma, address);
 			entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
 					      vma);
-			ptep_establish(vma, address, page_table, entry, 1);
+       			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1646,7 +1650,7 @@
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry, write_access);
+	ptep_set_access_flags(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] (signoff) ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  4:54                                                   ` Linus Torvalds
  2004-05-26  4:55                                                     ` Benjamin Herrenschmidt
  2004-05-26  5:41                                                     ` Benjamin Herrenschmidt
@ 2004-05-26  5:59                                                     ` Benjamin Herrenschmidt
  2004-05-26  6:55                                                       ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  5:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

(Same with signoff :)

Ok, here it is. I blasted ptep_establish completely and the macro
thing and changed the function to ptep_set_access_flags() which
becomes also responsible for the flush if necesary.

Overall, I feel it's more consistent this way with the other stuff
in there.

I did the ppc64 impl, the x86 one (hope I got it right). I still need to
do ppc32 and I suppose s390 must be fixed now that ptep_estabish is gone
but I'll leave that to someone who understand something about these things ;)

Here it is, boots on g5 :

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

--- 1.5/include/asm-generic/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-generic/pgtable.h	2004-05-26 15:37:02 +10:00
@@ -1,24 +1,11 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
-#ifndef __HAVE_ARCH_PTEP_ESTABLISH
-
-#ifndef ptep_update_dirty_accessed
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
-#endif
-
-/*
- * Establish a new mapping:
- *  - flush the old one
- *  - update the page tables
- *  - inform the TLB about the new one
- *
- * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
- */
-#define ptep_establish(__vma, __address, __ptep, __entry, __dirty)	\
-do {									\
-	ptep_update_dirty_accessed(__ptep, __entry, __dirty);		\
-	flush_tlb_page(__vma, __address);				\
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+do {				  					  \
+	set_pte(__ptep, __entry);					  \
+	flush_tlb_page(__vma, __address);				  \
 } while (0)
 #endif
 
===== include/asm-i386/pgtable.h 1.45 vs edited =====
--- 1.45/include/asm-i386/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-i386/pgtable.h	2004-05-26 15:34:57 +10:00
@@ -325,9 +325,13 @@
  * bit at the same time.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)	\
-	do {							\
-		if (__dirty) set_pte(__ptep, __entry);		\
+#define  __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+	do {								  \
+		if (__dirty) {						  \
+			set_pte(__ptep, __entry);		  	  \
+			flush_tlb_page(__vma, __address);		  \
+		}							  \
 	} while (0)
 
 /* Encode and de-code a swap entry */
===== include/asm-ppc64/pgtable.h 1.33 vs edited =====
--- 1.33/include/asm-ppc64/pgtable.h	2004-05-23 07:56:24 +10:00
+++ edited/include/asm-ppc64/pgtable.h	2004-05-26 15:32:38 +10:00
@@ -306,7 +306,10 @@
 	return old;
 }
 
-/* PTE updating functions */
+/* PTE updating functions, this function puts the PTE in the
+ * batch, doesn't actually triggers the hash flush immediately,
+ * you need to call flush_tlb_pending() to do that.
+ */
 extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
 
 static inline int ptep_test_and_clear_young(pte_t *ptep)
@@ -318,7 +321,7 @@
 	old = pte_update(ptep, _PAGE_ACCESSED);
 	if (old & _PAGE_HASHPTE) {
 		hpte_update(ptep, old, 0);
-		flush_tlb_pending();	/* XXX generic code doesn't flush */
+		flush_tlb_pending();
 	}
 	return (old & _PAGE_ACCESSED) != 0;
 }
@@ -396,10 +399,34 @@
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	if (pte_present(*ptep))
+	if (pte_present(*ptep)) {
 		pte_clear(ptep);
+		flush_tlb_pending();
+	}
 	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }
+
+/* Set the dirty and/or accessed bits atomically in a linux PTE, this
+ * function doesn't need to flush the hash entry
+ */
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty)
+{
+	unsigned long bits = pte_val(entry) &
+		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW);
+	unsigned long tmp;
+
+	__asm__ __volatile__(
+	       "1:     ldarx   %0,0,%3\n\
+                or      %0,%2,%0\n\
+                stdcx.  %0,0,%3\n\
+                bne-    1b"
+	:"=&r" (tmp), "=m" (*ptep)
+	:"r" (bits), "r" (ptep)
+	:"cc");
+}
+#define  ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+        __ptep_set_access_flags(__ptep, __entry, __dirty)
 
 /*
  * Macro to mark a page protection value as "uncacheable".
===== mm/memory.c 1.177 vs edited =====
--- 1.177/mm/memory.c	2004-05-26 05:37:09 +10:00
+++ edited/mm/memory.c	2004-05-26 15:35:40 +10:00
@@ -1004,7 +1004,11 @@
 	flush_cache_page(vma, address);
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
-	ptep_establish(vma, address, page_table, entry, 1);
+
+	/* Get rid of the old entry, replace with new */
+	ptep_clear_flush(vma, address, page_table);
+	set_pte(page_table, entry);
+
 	update_mmu_cache(vma, address, entry);
 }
 
@@ -1056,7 +1060,7 @@
 			flush_cache_page(vma, address);
 			entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
 					      vma);
-			ptep_establish(vma, address, page_table, entry, 1);
+       			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1646,7 +1650,7 @@
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry, write_access);
+	ptep_set_access_flags(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25 20:49                                 ` Linus Torvalds
  2004-05-25 20:57                                   ` David S. Miller
@ 2004-05-26  6:20                                   ` Keith M Wesolowski
  1 sibling, 0 replies; 79+ messages in thread
From: Keith M Wesolowski @ 2004-05-26  6:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, willy, andrea, benh, akpm, linux-kernel, mingo,
	bcrl, linux-mm

On Tue, May 25, 2004 at 01:49:57PM -0700, Linus Torvalds wrote:

> Ok. Still, that doesn't sound too bad. I assume that the pte write has to 
> be atomic anyway (ie it doesn't walk the page tables, but it clearly _has_ 
> to do an atomic "read-modify-update" or just the _hardware_ updates might 
> race against each other and one CPU loses the dirty bit when another CPU 
> writes back the accessed bit).

Correct, we don't have to worry about that.  The various chips either
(a) lock the bus during the entire tablewalk, or (b) set the accessed
bit atomically and use a normal write when setting both bits.

> So I really think it should be safe to just do a regular write when you 
> update both bits, because you know that no other CPU will at least _clear_ 
> any bits. Hmm?

Yes, and this is how our hardware at least works.

> So it sounds like the SWAP loop basically ends up being just something 
> like
> 
> 	val = pte_value(entry);
> 	for (;;) {
> 		oldval = SWAP(ptep, val);
> 		/*
> 		 * If we wrote a value that had the same or more bits set 
> 		 * than the old value, we're ok...
> 		 */
> 		if (!(oldval & ~val))
> 			break;
> 		/*
> 		 * ..otherwise we need to write the "or" of all bits and
> 		 * try again.
> 		 */
> 		val |= oldval;
> 	}
> 
> Right? Note that the reason we can do the "dirty and accessed bit both 
> set" case with a simple write is exactly because that's already the 
> "maximal" bits anybody can write to the thing, so we don't need to loop, 
> we can just write it directly.

The documentation actually provides a more general implementation for
any kind of pte update.  In this case it would logically reduce to
your outline above.  The actual outline in the manual is:

	val = pte_val(entry);
	for(;;) {
		/* Set the pte to 0 to discourage others */
		oldval = 0;
		SWAP(ptep, oldval);
		/* Flush this entry from all MMUs */
		flush_tlb_entry(ptep);
		/* Gather the existing bits */
		val |= oldval;
		/* Did another MMU monkey with it since? */
		if (!*ptep)
			break;
	}
	set_pte(ptep, entry);

which seems far more complicated than necessary just to set these 1 or
2 bits.

-- 
Keith M Wesolowski

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] (signoff) ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  5:59                                                     ` [PATCH] (signoff) " Benjamin Herrenschmidt
@ 2004-05-26  6:55                                                       ` Benjamin Herrenschmidt
  2004-05-26  7:11                                                         ` [PATCH] ppc32 implementation of ptep_set_access_flags Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  6:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David S. Miller, wesolows, willy, Andrea Arcangeli,
	Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm,
	Linux Arch list

On Wed, 2004-05-26 at 15:59, Benjamin Herrenschmidt wrote:
> (Same with signoff :)
> 
> Ok, here it is. I blasted ptep_establish completely and the macro
> thing and changed the function to ptep_set_access_flags() which
> becomes also responsible for the flush if necesary.

Ok, ppc64 implementation is bogus, I just completely forgot that
we have to also atomically wait for busy to be cleared. (Or we
would race with the hash refill which does set busy atomically
at the beginning of the opertation, but later updates the linux
PTE non atomically & clears it).

Brown paper bag since I'm the one who implemented the hash refill
this way in the first place =P

Here's a fixed version:

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

--- 1.5/include/asm-generic/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-generic/pgtable.h	2004-05-26 15:37:02 +10:00
@@ -1,24 +1,11 @@
 #ifndef _ASM_GENERIC_PGTABLE_H
 #define _ASM_GENERIC_PGTABLE_H
 
-#ifndef __HAVE_ARCH_PTEP_ESTABLISH
-
-#ifndef ptep_update_dirty_accessed
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry)
-#endif
-
-/*
- * Establish a new mapping:
- *  - flush the old one
- *  - update the page tables
- *  - inform the TLB about the new one
- *
- * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
- */
-#define ptep_establish(__vma, __address, __ptep, __entry, __dirty)	\
-do {									\
-	ptep_update_dirty_accessed(__ptep, __entry, __dirty);		\
-	flush_tlb_page(__vma, __address);				\
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+do {				  					  \
+	set_pte(__ptep, __entry);					  \
+	flush_tlb_page(__vma, __address);				  \
 } while (0)
 #endif
 
===== include/asm-i386/pgtable.h 1.45 vs edited =====
--- 1.45/include/asm-i386/pgtable.h	2004-05-26 06:04:54 +10:00
+++ edited/include/asm-i386/pgtable.h	2004-05-26 15:34:57 +10:00
@@ -325,9 +325,13 @@
  * bit at the same time.
  */
 #define update_mmu_cache(vma,address,pte) do { } while (0)
-#define ptep_update_dirty_accessed(__ptep, __entry, __dirty)	\
-	do {							\
-		if (__dirty) set_pte(__ptep, __entry);		\
+#define  __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+	do {								  \
+		if (__dirty) {						  \
+			set_pte(__ptep, __entry);		  	  \
+			flush_tlb_page(__vma, __address);		  \
+		}							  \
 	} while (0)
 
 /* Encode and de-code a swap entry */
===== include/asm-ppc64/pgtable.h 1.33 vs edited =====
--- 1.33/include/asm-ppc64/pgtable.h	2004-05-23 07:56:24 +10:00
+++ edited/include/asm-ppc64/pgtable.h	2004-05-26 16:47:05 +10:00
@@ -306,7 +306,10 @@
 	return old;
 }
 
-/* PTE updating functions */
+/* PTE updating functions, this function puts the PTE in the
+ * batch, doesn't actually triggers the hash flush immediately,
+ * you need to call flush_tlb_pending() to do that.
+ */
 extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
 
 static inline int ptep_test_and_clear_young(pte_t *ptep)
@@ -318,7 +321,7 @@
 	old = pte_update(ptep, _PAGE_ACCESSED);
 	if (old & _PAGE_HASHPTE) {
 		hpte_update(ptep, old, 0);
-		flush_tlb_pending();	/* XXX generic code doesn't flush */
+		flush_tlb_pending();
 	}
 	return (old & _PAGE_ACCESSED) != 0;
 }
@@ -396,10 +399,36 @@
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	if (pte_present(*ptep))
+	if (pte_present(*ptep)) {
 		pte_clear(ptep);
+		flush_tlb_pending();
+	}
 	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }
+
+/* Set the dirty and/or accessed bits atomically in a linux PTE, this
+ * function doesn't need to flush the hash entry
+ */
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty)
+{
+	unsigned long bits = pte_val(entry) &
+		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW);
+	unsigned long old, tmp;
+
+	__asm__ __volatile__(
+        "1:     ldarx   %0,0,%4\n\
+	        andi.	%1,%0,%6\n\
+	        bne-	1b \n\
+                or      %0,%3,%0\n\
+                stdcx.  %0,0,%4\n\
+                bne-    1b"
+	:"=&r" (old), "=&r" (tmp), "=m" (*ptep)
+	:"r" (bits), "r" (ptep), "m" (ptep), "i" (_PAGE_BUSY)
+	:"cc");
+}
+#define  ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+        __ptep_set_access_flags(__ptep, __entry, __dirty)
 
 /*
  * Macro to mark a page protection value as "uncacheable".
===== mm/memory.c 1.177 vs edited =====
--- 1.177/mm/memory.c	2004-05-26 05:37:09 +10:00
+++ edited/mm/memory.c	2004-05-26 15:35:40 +10:00
@@ -1004,7 +1004,11 @@
 	flush_cache_page(vma, address);
 	entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)),
 			      vma);
-	ptep_establish(vma, address, page_table, entry, 1);
+
+	/* Get rid of the old entry, replace with new */
+	ptep_clear_flush(vma, address, page_table);
+	set_pte(page_table, entry);
+
 	update_mmu_cache(vma, address, entry);
 }
 
@@ -1056,7 +1060,7 @@
 			flush_cache_page(vma, address);
 			entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
 					      vma);
-			ptep_establish(vma, address, page_table, entry, 1);
+       			ptep_set_access_flags(vma, address, page_table, entry, 1);
 			update_mmu_cache(vma, address, entry);
 			pte_unmap(page_table);
 			spin_unlock(&mm->page_table_lock);
@@ -1646,7 +1650,7 @@
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	ptep_establish(vma, address, pte, entry, write_access);
+	ptep_set_access_flags(vma, address, pte, entry, write_access);
 	update_mmu_cache(vma, address, entry);
 	pte_unmap(pte);
 	spin_unlock(&mm->page_table_lock);



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-26  2:26                               ` Linus Torvalds
@ 2004-05-26  7:06                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 79+ messages in thread
From: Andrea Arcangeli @ 2004-05-26  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ivan Kokshaysky, Matthew Wilcox, Benjamin Herrenschmidt,
	Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise,
	linux-mm, Architectures Group

On Tue, May 25, 2004 at 07:26:21PM -0700, Linus Torvalds wrote:
> You're reading it wrong.
> 
> The "including when the present flag is set to zero" part does not mean 
> that the present flag was zero _before_, it means "is being set to zero" 
> as in "having been non-zero before that".

"having been non-zero before that" makes a lot more sense indeed, the
wording in the specs wasn't the best IMHO.  Interestingly the
ptep_establish at the end of handle_pte_fault would have hidden any
double fault completely, nobody but a tracer would have noticed that,
but it made very little sense that non-present entries can be cached.
It's all clear now thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH] ppc32 implementation of ptep_set_access_flags
  2004-05-26  6:55                                                       ` Benjamin Herrenschmidt
@ 2004-05-26  7:11                                                         ` Benjamin Herrenschmidt
  2004-05-26 15:22                                                           ` Linus Torvalds
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26  7:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel list

Here's the ppc32 implementation of ptep_set_access_flags:

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

===== include/asm-ppc/pgtable.h 1.32 vs edited =====
--- 1.32/include/asm-ppc/pgtable.h	2004-05-23 07:56:24 +10:00
+++ edited/include/asm-ppc/pgtable.h	2004-05-26 17:07:35 +10:00
@@ -548,6 +548,16 @@
 	pte_update(ptep, 0, _PAGE_DIRTY);
 }
 
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty)
+{
+	unsigned long bits = pte_val(entry) &
+		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW);
+	pte_update(ptep, 0, bits);
+}
+#define  ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
+        __ptep_set_access_flags(__ptep, __entry, __dirty)
+
 /*
  * Macro to mark a page protection value as "uncacheable".
  */



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc32 implementation of ptep_set_access_flags
  2004-05-26  7:11                                                         ` [PATCH] ppc32 implementation of ptep_set_access_flags Benjamin Herrenschmidt
@ 2004-05-26 15:22                                                           ` Linus Torvalds
  2004-05-26 18:49                                                             ` David S. Miller
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Linus Torvalds @ 2004-05-26 15:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Andrew Morton, Linux Kernel list


On Wed, 26 May 2004, Benjamin Herrenschmidt wrote:
>
> Here's the ppc32 implementation of ptep_set_access_flags:

Ok. I modified the way things are done a bit, to make it easier
to keep architectures that haven't been updated yet in working order.

What I did was to basically split up the old "ptep_establish()" into a new 
"ptep_establish()" that is only used for COW, and your 
"ptep_update_accessed_bits()", which is used for the other cases.

I also left the default implementations in <asm-generic/pgtable.h> as 
exactly the same as the default implementation used to be for the old 
"ptep_establish()", so architectures that have not been updated to
take advantage of the split should work the way they always did. Except 
s390, which now gets the default function for the accessed bits update 
(which should be at least pretty close to correct for s390 too, I think 
the problem for s390 was the COW-case).

The "new" rules (well, they aren't new, but now they are explicitly
spelled out) for this thing are:

 - ptep_establish(__vma, __address, __ptep, __entry)

	Establish a new mapping:
	 - flush the old one
	 - update the page tables
	 - inform the TLB about the new one

	We hold the mm semaphore for reading and vma->vm_mm->page_table_lock.

	Note: the old pte is known to not be writable, so we don't need to
	worry about dirty bits etc getting lost.

 - ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty)

	Largely same as above, but only sets the access flags (dirty,
	accessed, and writable). Furthermore, we know it always gets set
	to a "more permissive" setting, which allows most architectures
	to optimize this.

and right now they both just default to

	set_pte(__ptep, __entry);
	flush_tlb_page(__vma, __address);

unless overridden by the architecture. 

			Linus

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc32 implementation of ptep_set_access_flags
  2004-05-26 15:22                                                           ` Linus Torvalds
@ 2004-05-26 18:49                                                             ` David S. Miller
  2004-05-26 21:43                                                             ` Benjamin Herrenschmidt
  2004-05-28  1:29                                                             ` David Mosberger
  2 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2004-05-26 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: benh, akpm, linux-kernel

On Wed, 26 May 2004 08:22:35 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> The "new" rules (well, they aren't new, but now they are explicitly
> spelled out) for this thing are

Looks great to me.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc32 implementation of ptep_set_access_flags
  2004-05-26 15:22                                                           ` Linus Torvalds
  2004-05-26 18:49                                                             ` David S. Miller
@ 2004-05-26 21:43                                                             ` Benjamin Herrenschmidt
  2004-05-28  1:29                                                             ` David Mosberger
  2 siblings, 0 replies; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-26 21:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel list

On Thu, 2004-05-27 at 01:22, Linus Torvalds wrote:

> The "new" rules (well, they aren't new, but now they are explicitly
> spelled out) for this thing are:

Looks fine. I'll dbl check everything on the g5 once I reach the office

Ben.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-25  4:53                   ` Andrea Arcangeli
@ 2004-05-27 21:56                     ` David Mosberger
  2004-05-27 22:00                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 79+ messages in thread
From: David Mosberger @ 2004-05-27 21:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: davidm, Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

>>>>> On Tue, 25 May 2004 06:53:22 +0200, Andrea Arcangeli <andrea@suse.de> said:

  >> If the "accessed" or "dirty" bits are zero, accessing/writing the
  >> page will cause a fault which will be handled in a low-level
  >> fault handler.  The Linux version of these handlers simply turn
  >> on the respective bit.  See daccess_bit(), iaccess_bit(), and dirty_bit()
  >> in arch/ia64/kernel/ivt.S.

  Andrea> so you mean, this is being set in the arch section before
  Andrea> ever reaching handle_mm_fault?

Correct.  The low-level fault handlers set the ACCESSED/DIRTY bits
with an atomic compare-and-exchange (on SMP).  They don't (normally)
bubble up all the way to the Linux page-fault handler.

	--david

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-27 21:56                     ` David Mosberger
@ 2004-05-27 22:00                       ` Benjamin Herrenschmidt
  2004-05-27 22:12                         ` David Mosberger
  0 siblings, 1 reply; 79+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-27 22:00 UTC (permalink / raw)
  To: davidm
  Cc: Andrea Arcangeli, Linus Torvalds, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

On Fri, 2004-05-28 at 07:56, David Mosberger wrote:
> >>>>> On Tue, 25 May 2004 06:53:22 +0200, Andrea Arcangeli <andrea@suse.de> said:
> 
>   >> If the "accessed" or "dirty" bits are zero, accessing/writing the
>   >> page will cause a fault which will be handled in a low-level
>   >> fault handler.  The Linux version of these handlers simply turn
>   >> on the respective bit.  See daccess_bit(), iaccess_bit(), and dirty_bit()
>   >> in arch/ia64/kernel/ivt.S.
> 
>   Andrea> so you mean, this is being set in the arch section before
>   Andrea> ever reaching handle_mm_fault?
> 
> Correct.  The low-level fault handlers set the ACCESSED/DIRTY bits
> with an atomic compare-and-exchange (on SMP).  They don't (normally)
> bubble up all the way to the Linux page-fault handler.

Same for PPC, but we still have a race where we can reach that code, see
my initial mail (typically, because your low level code, for obvious
reasons, doesn't take the mm semaphore, thus the page may have been
mapped right after you decide to go to do_page_fault and before it takes
the mm sem).

Ben.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
  2004-05-27 22:00                       ` Benjamin Herrenschmidt
@ 2004-05-27 22:12                         ` David Mosberger
  0 siblings, 0 replies; 79+ messages in thread
From: David Mosberger @ 2004-05-27 22:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: davidm, Andrea Arcangeli, Linus Torvalds, Andrew Morton,
	Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm,
	Architectures Group

>>>>> On Fri, 28 May 2004 08:00:54 +1000, Benjamin Herrenschmidt <benh@kernel.crashing.org> said:

  Benjamin> Same for PPC, but we still have a race where we can reach
  Benjamin> that code, see my initial mail (typically, because your
  Benjamin> low level code, for obvious reasons, doesn't take the mm
  Benjamin> semaphore, thus the page may have been mapped right after
  Benjamin> you decide to go to do_page_fault and before it takes the
  Benjamin> mm sem).

Yes; I was only explaining how this works on ia64.

	--david

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH] ppc32 implementation of ptep_set_access_flags
  2004-05-26 15:22                                                           ` Linus Torvalds
  2004-05-26 18:49                                                             ` David S. Miller
  2004-05-26 21:43                                                             ` Benjamin Herrenschmidt
@ 2004-05-28  1:29                                                             ` David Mosberger
  2 siblings, 0 replies; 79+ messages in thread
From: David Mosberger @ 2004-05-28  1:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list

>>>>> On Wed, 26 May 2004 08:22:35 -0700 (PDT), Linus Torvalds <torvalds@osdl.org> said:

  Linus> What I did was to basically split up the old
  Linus> "ptep_establish()" into a new "ptep_establish()" that is only
  Linus> used for COW, and your "ptep_update_accessed_bits()", which
  Linus> is used for the other cases.

Below is an update for ia64.  I hope I didn't miss anything.

	--david

ia64: Implement race-free ptep_set_access_flags()

Signed-off-by: davidm@hpl.hp.com

===== include/asm-ia64/pgtable.h 1.40 vs edited =====
--- 1.40/include/asm-ia64/pgtable.h	Sat May 22 14:56:24 2004
+++ edited/include/asm-ia64/pgtable.h	Thu May 27 17:12:04 2004
@@ -8,7 +8,7 @@
  * This hopefully works with any (fixed) IA-64 page-size, as defined
  * in <asm/page.h> (currently 8192).
  *
- * Copyright (C) 1998-2003 Hewlett-Packard Co
+ * Copyright (C) 1998-2004 Hewlett-Packard Co
  *	David Mosberger-Tang <davidm@hpl.hp.com>
  */
 
@@ -475,6 +475,42 @@
  * flushing that may be necessary.
  */
 extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+/*
+ * Update PTEP with ENTRY, which is guaranteed to be a less
+ * restrictive PTE.  That is, ENTRY may have the ACCESSED, DIRTY, and
+ * WRITABLE bits turned on, when the value at PTEP did not.  The
+ * WRITABLE bit may only be turned if SAFELY_WRITABLE is TRUE.
+ *
+ * SAFELY_WRITABLE is TRUE if we can update the value at PTEP without
+ * having to worry about races.  On SMP machines, there are only two
+ * cases where this is true:
+ *
+ *	(1) *PTEP has the PRESENT bit turned OFF
+ *	(2) ENTRY has the DIRTY bit turned ON
+ *
+ * On ia64, we could implement this routine with a cmpxchg()-loop
+ * which ORs in the _PAGE_A/_PAGE_D bit if they're set in ENTRY.
+ * However, like on x86, we can get a more streamlined version by
+ * observing that it is OK to drop ACCESSED bit updates when
+ * SAFELY_WRITABLE is FALSE.  Besides being rare, all that would do is
+ * result in an extra Access-bit fault, which would then turn on the
+ * ACCESSED bit in the low-level fault handler (iaccess_bit or
+ * daccess_bit in ivt.S).
+ */
+#ifdef CONFIG_SMP
+# define ptep_set_access_flags(__vma, __addr, __ptep, __entry, __safely_writable)	\
+do {											\
+	if (__safely_writable) {							\
+		set_pte(__ptep, __entry);						\
+		flush_tlb_page(__vma, __addr);						\
+	}										\
+} while (0)
+#else
+# define ptep_set_access_flags(__vma, __addr, __ptep, __entry, __safely_writable)	\
+	ptep_establish(__vma, __addr, __ptep, __entry)
+#endif
 
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
   /* arch mem_map init routine is needed due to holes in a virtual mem_map */

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2004-05-28  1:29 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-05-24  3:29 [PATCH] ppc64: Fix possible race with set_pte on a present PTE Benjamin Herrenschmidt
2004-05-24  3:47 ` Linus Torvalds
2004-05-24  4:13   ` Benjamin Herrenschmidt
2004-05-24  4:36     ` Linus Torvalds
2004-05-24  4:44       ` Benjamin Herrenschmidt
2004-05-24  5:10         ` Linus Torvalds
2004-05-24  5:34           ` Benjamin Herrenschmidt
2004-05-24  5:38             ` Benjamin Herrenschmidt
2004-05-24  5:52               ` Benjamin Herrenschmidt
2004-05-24  7:39           ` Ingo Molnar
2004-05-24  5:39             ` Benjamin Herrenschmidt
2004-05-25  3:43           ` Andrea Arcangeli
2004-05-25  4:00             ` Linus Torvalds
2004-05-25  4:17               ` Benjamin Herrenschmidt
2004-05-25  4:37                 ` Andrea Arcangeli
2004-05-25  4:40                   ` Benjamin Herrenschmidt
2004-05-25  4:20               ` Andrea Arcangeli
2004-05-25  4:39                 ` Linus Torvalds
2004-05-25  4:44                   ` Linus Torvalds
2004-05-25  4:59                     ` Andrea Arcangeli
2004-05-25  5:09                       ` Andrea Arcangeli
2004-05-25  4:50                   ` Andrea Arcangeli
2004-05-25  4:59                     ` Linus Torvalds
2004-05-25  4:43                 ` David Mosberger
2004-05-25  4:53                   ` Andrea Arcangeli
2004-05-27 21:56                     ` David Mosberger
2004-05-27 22:00                       ` Benjamin Herrenschmidt
2004-05-27 22:12                         ` David Mosberger
2004-05-25 11:44               ` Matthew Wilcox
2004-05-25 14:48                 ` Linus Torvalds
2004-05-25 15:35                   ` Keith M Wesolowski
2004-05-25 16:19                     ` Linus Torvalds
2004-05-25 17:25                       ` David S. Miller
2004-05-25 17:49                         ` Linus Torvalds
2004-05-25 17:54                           ` David S. Miller
2004-05-25 18:05                             ` Linus Torvalds
2004-05-25 20:30                               ` Linus Torvalds
2004-05-25 20:35                               ` David S. Miller
2004-05-25 20:49                                 ` Linus Torvalds
2004-05-25 20:57                                   ` David S. Miller
2004-05-26  6:20                                   ` Keith M Wesolowski
2004-05-25 21:40                               ` Benjamin Herrenschmidt
2004-05-25 21:54                                 ` Linus Torvalds
2004-05-25 22:00                                   ` Linus Torvalds
2004-05-25 22:07                                     ` Benjamin Herrenschmidt
2004-05-25 22:14                                       ` Linus Torvalds
2004-05-26  0:21                                         ` Benjamin Herrenschmidt
2004-05-26  0:50                                           ` Linus Torvalds
2004-05-26  3:25                                             ` Benjamin Herrenschmidt
2004-05-26  4:08                                               ` Linus Torvalds
2004-05-26  4:12                                                 ` Benjamin Herrenschmidt
2004-05-26  4:18                                                   ` Benjamin Herrenschmidt
2004-05-26  4:50                                                     ` Linus Torvalds
2004-05-26  4:49                                                       ` Benjamin Herrenschmidt
2004-05-26  4:28                                                   ` Linus Torvalds
2004-05-26  4:46                                                 ` Benjamin Herrenschmidt
2004-05-26  4:54                                                   ` Linus Torvalds
2004-05-26  4:55                                                     ` Benjamin Herrenschmidt
2004-05-26  5:41                                                     ` Benjamin Herrenschmidt
2004-05-26  5:59                                                     ` [PATCH] (signoff) " Benjamin Herrenschmidt
2004-05-26  6:55                                                       ` Benjamin Herrenschmidt
2004-05-26  7:11                                                         ` [PATCH] ppc32 implementation of ptep_set_access_flags Benjamin Herrenschmidt
2004-05-26 15:22                                                           ` Linus Torvalds
2004-05-26 18:49                                                             ` David S. Miller
2004-05-26 21:43                                                             ` Benjamin Herrenschmidt
2004-05-28  1:29                                                             ` David Mosberger
2004-05-25 22:05                                   ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Benjamin Herrenschmidt
2004-05-25 22:09                                 ` Linus Torvalds
2004-05-25 22:19                                   ` Benjamin Herrenschmidt
2004-05-25 22:24                                     ` Linus Torvalds
2004-05-25 21:27                   ` Andrea Arcangeli
2004-05-25 21:43                     ` Linus Torvalds
2004-05-25 21:55                       ` Andrea Arcangeli
2004-05-25 22:01                         ` Linus Torvalds
2004-05-25 22:18                           ` Ivan Kokshaysky
2004-05-25 22:42                             ` Andrea Arcangeli
2004-05-26  2:26                               ` Linus Torvalds
2004-05-26  7:06                                 ` Andrea Arcangeli
2004-05-25 21:44                     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).