LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
@ 2006-12-06 16:43 David Howells
  2006-12-06 17:21 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 76+ messages in thread
From: David Howells @ 2006-12-06 16:43 UTC (permalink / raw)
  To: torvalds, akpm, linux-arm-kernel; +Cc: linux-kernel, linux-arch, dhowells

From: David Howells <dhowells@redhat.com>

Implement generic UP cmpxchg() where an arch doesn't otherwise support it.
This assuming that the arch doesn't have support SMP without providing its own
cmpxchg() implementation.

This is required because cmpxchg() is used by the reduced work queue patches to
adjust the management data in a work_struct.


Also provide ARMv6 with a cmpxchg() implementation using LDREX/STREXEQ.  Pre-v6
ARM doesn't support SMP according to ARM's atomic.h, so the generic
IRQ-disablement based cmpxchg() is entirely adequate there (if it isn't, then
atomic_cmpxchg() is also broken on ARM).


Furthermore, if the generic IRQ-disablement cmpxchg() is being used, then a
preprocessor symbol is defined (ARCH_USING_IRQ_BASED_CMPXCHG) so that code
using it can make alternative arrangements where such would be more efficient.

__queue_work() in kernel/workqueue.c then makes use of this as it disables
interrupts before attempting to modify the work queue pointer in the work item.

With the reduction patches, cmpxchg() is called indirectly from within
__queue_work(), and so it's more efficient to use __queue_work()'s interrupt
disablement and just make the desired change directly if the aforementioned
ARCH_USING_IRQ_BASED_CMPXCHG is set rather than using cmpxchg() at all.

Unfortunately, the compiler's optimiser can't discard excessive interrupt
disablement, so optimisation here has to be done manually, if at all.

Signed-Off-By: David Howells <dhowells@redhat.com>
---

 include/asm-arm/system.h      |   40 ++++++++++++++++++++++++++++++++++
 include/asm-generic/cmpxchg.h |   48 +++++++++++++++++++++++++++++++++++++++++
 kernel/workqueue.c            |   37 +++++++++++++++++++++++++++++++-
 3 files changed, 124 insertions(+), 1 deletions(-)

diff --git a/include/asm-arm/system.h b/include/asm-arm/system.h
index f05fbe3..f16e42d 100644
--- a/include/asm-arm/system.h
+++ b/include/asm-arm/system.h
@@ -325,6 +325,46 @@ #endif
 extern void disable_hlt(void);
 extern void enable_hlt(void);
 
+/*
+ * We only implement cmpxchg in ASM on ARMv6 where we have LDREX/STREX
+ * available, and we only implement it for word-sized exchanges
+ */
+#if __LINUX_ARM_ARCH__ >= 6
+extern void __bad_cmpxchg(volatile void *, int);
+
+#define cmpxchg(ptr, old, new)						\
+({									\
+	__typeof__ (ptr) ____p = (ptr);					\
+	__typeof__(*ptr) ____old = (old);				\
+	__typeof__(*ptr) ____new = (new);				\
+	__typeof__(*ptr) ____oldval;					\
+	__typeof__(*ptr) ____res;					\
+									\
+	switch (sizeof(____res)) {					\
+	case 4:								\
+		do {							\
+			__asm__ __volatile__("@ cmpxchg\n"		\
+			"ldrex	%1, [%2]\n"				\
+			"mov	%0, #0\n"				\
+			"teq	%1, %3\n"				\
+			"strexeq %0, %4, [%2]\n"			\
+			: "=&r" (____res), "=&r" (____oldval)		\
+			: "r" (____p), "Ir" (____old), "r" (____new)	\
+			: "cc");					\
+		} while(____res);					\
+		break;							\
+	default:							\
+		__bad_cmpxchg(____p, sizeof(____res));			\
+		____oldval = 0;						\
+		break;							\
+	}								\
+	____oldval;							\
+})
+
+#else
+#include <asm-generic/cmpxchg.h>
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #define arch_align_stack(x) (x)
diff --git a/include/asm-generic/cmpxchg.h b/include/asm-generic/cmpxchg.h
new file mode 100644
index 0000000..4316092
--- /dev/null
+++ b/include/asm-generic/cmpxchg.h
@@ -0,0 +1,48 @@
+/* Generic cmpxchg for those arches that don't implement it themselves
+ *
+ * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_CMPXCHG_H
+#define _ASM_GENERIC_CMPXCHG_H
+
+#if !defined(cmpxchg) && !defined(CONFIG_SMP)
+
+/**
+ * cmpxchg - Atomically conditionally exchange one value for another.
+ * @ptr - Pointer to the value to be altered.
+ * @old - The value to change from.
+ * @new - The value to change to.
+ *
+ * This function atomically compares the current value at the word pointed to
+ * by @ptr, and if it's the same as @old, changes it to @new.  If it's not the
+ * same then it's left unchanged.
+ *
+ * The value that was in the word pointed to by @ptr is returned, whether or
+ * not it was changed to @new.
+ */
+#define cmpxchg(ptr, old, new)			\
+({						\
+	unsigned long ____flags;		\
+	__typeof__ (ptr) ____p = (ptr);		\
+	__typeof__(*ptr) ____old = (old);	\
+	__typeof__(*ptr) ____new = (new);	\
+	__typeof__(*ptr) ____res;		\
+	raw_local_irq_save(____flags);		\
+	____res = *____p;			\
+	if (likely(____res == (____old)))	\
+		*____p = (____new);		\
+	raw_local_irq_restore(____flags);	\
+	____res;				\
+})
+
+#define ARCH_USING_IRQ_BASED_CMPXCHG
+
+#endif /* !cmpxchg && !SMP */
+#endif /* _ASM_GENERIC_CMPXCHG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8d1e7cb..fdf0c30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -80,6 +80,11 @@ static inline int is_single_threaded(str
 	return list_empty(&wq->list);
 }
 
+/*
+ * set the work queue pointer in the work item's management data
+ * - must retain the NAR flag, but the pending flag can just be set
+ */
+#ifndef ARCH_USING_IRQ_BASED_CMPXCHG
 static inline void set_wq_data(struct work_struct *work, void *wq)
 {
 	unsigned long new, old, res;
@@ -98,6 +103,36 @@ static inline void set_wq_data(struct wo
 	}
 }
 
+#define __set_wq_data(w, wq) set_wq_data((w), (wq))
+
+#else
+/*
+ * set the work queue pointer when the caller guarantees atomicity by disabling
+ * interrupts on a UP box (where there's no CMPXCHG equivalent)
+ */
+static inline void __set_wq_data(struct work_struct *work, void *wq)
+{
+	unsigned long tmp;
+
+	tmp = work->management & WORK_STRUCT_FLAG_MASK;
+	tmp |= (unsigned long) wq;
+	tmp |= 1UL << WORK_STRUCT_PENDING;
+	work->management = tmp;
+}
+
+static inline void set_wq_data(struct work_struct *work, void *wq)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__set_wq_data(work, wq);
+	local_irq_restore(flags);
+}
+#endif
+
+/*
+ * get the workqueue data from the work item's management data
+ */
 static inline void *get_wq_data(struct work_struct *work)
 {
 	return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK);
@@ -110,7 +145,7 @@ static void __queue_work(struct cpu_work
 	unsigned long flags;
 
 	spin_lock_irqsave(&cwq->lock, flags);
-	set_wq_data(work, cwq);
+	__set_wq_data(work, cwq);
 	list_add_tail(&work->entry, &cwq->worklist);
 	cwq->insert_sequence++;
 	wake_up(&cwq->more_work);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 16:43 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it David Howells
@ 2006-12-06 17:21 ` Linus Torvalds
  2006-12-06 18:56 ` Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 17:21 UTC (permalink / raw)
  To: David Howells; +Cc: akpm, linux-arm-kernel, linux-kernel, linux-arch



On Wed, 6 Dec 2006, David Howells wrote:
> 
> Implement generic UP cmpxchg() where an arch doesn't otherwise support it.
> This assuming that the arch doesn't have support SMP without providing its own
> cmpxchg() implementation.

This is too ugly to live. At least the kernel/workqueue.c part.

The requirement that everybody implement a workable cmpxchg (and falling 
back to spinlocks if required as per the atomic bit operations) looks ok, 
but don't show that in generic code.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 16:43 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it David Howells
  2006-12-06 17:21 ` Linus Torvalds
@ 2006-12-06 18:56 ` Christoph Lameter
  2006-12-06 19:00   ` Russell King
  2006-12-06 19:05   ` Linus Torvalds
  2006-12-06 19:12 ` Lennert Buytenhek
  2006-12-06 19:47 ` David Howells
  3 siblings, 2 replies; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 18:56 UTC (permalink / raw)
  To: David Howells; +Cc: torvalds, akpm, linux-arm-kernel, linux-kernel, linux-arch

I'd really appreciate a cmpxchg that is generically available for 
all arches. It will allow lockless implementation for various performance 
criticial portions of the kernel.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 18:56 ` Christoph Lameter
@ 2006-12-06 19:00   ` Russell King
  2006-12-06 19:16     ` Christoph Lameter
  2006-12-06 19:05   ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-06 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> I'd really appreciate a cmpxchg that is generically available for 
> all arches. It will allow lockless implementation for various performance 
> criticial portions of the kernel.

Let's recap on cmpxchg.

For CPUs with no atomic operation other than SWP, it is not lockless.
For CPUs with load locked + store conditional, it is expensive.

I've shown in the past that a generic implementation based around ll/sc
can be cheaply implemented using cmpxchg.  The reverse is *not* true.

If you want an operation for performance critical portions of the
kernel, please allow architecture maintainers the freedom to use their
best performance enhancements.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 18:56 ` Christoph Lameter
  2006-12-06 19:00   ` Russell King
@ 2006-12-06 19:05   ` Linus Torvalds
  2006-12-06 19:08     ` Al Viro
                       ` (3 more replies)
  1 sibling, 4 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:05 UTC (permalink / raw)
  To: Christoph Lameter, Russell King
  Cc: David Howells, Andrew Morton, linux-arm-kernel,
	Linux Kernel Mailing List, linux-arch



On Wed, 6 Dec 2006, Christoph Lameter wrote:
>
> I'd really appreciate a cmpxchg that is generically available for 
> all arches. It will allow lockless implementation for various performance 
> criticial portions of the kernel.

I suspect ARM may have been the last one without one, no?

That said, cmpxchg won't necessarily be "high-performance" unless the hw 
supports it naturally in hardware, so..

Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
like to get an ack from the ARM maintainer before applying it, but it 
looked ok.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:05   ` Linus Torvalds
@ 2006-12-06 19:08     ` Al Viro
  2006-12-06 19:25       ` Linus Torvalds
  2006-12-07  1:09       ` David Miller
  2006-12-06 19:26     ` Matthew Wilcox
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 76+ messages in thread
From: Al Viro @ 2006-12-06 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?

No.  sparc32 doesn't have one, for instance.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 16:43 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it David Howells
  2006-12-06 17:21 ` Linus Torvalds
  2006-12-06 18:56 ` Christoph Lameter
@ 2006-12-06 19:12 ` Lennert Buytenhek
  2006-12-06 19:47 ` David Howells
  3 siblings, 0 replies; 76+ messages in thread
From: Lennert Buytenhek @ 2006-12-06 19:12 UTC (permalink / raw)
  To: David Howells; +Cc: torvalds, akpm, linux-arm-kernel, linux-arch, linux-kernel

On Wed, Dec 06, 2006 at 04:43:14PM +0000, David Howells wrote:

> Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,

That's not quite true, there exist ARMv5 processors that in theory
can support SMP.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:00   ` Russell King
@ 2006-12-06 19:16     ` Christoph Lameter
  2006-12-06 19:28       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 19:16 UTC (permalink / raw)
  To: Russell King
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, 6 Dec 2006, Russell King wrote:

> On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> Let's recap on cmpxchg.
> 
> For CPUs with no atomic operation other than SWP, it is not lockless.

But then its also just requires disable/enable interrupts on UP which may 
be cheaper than an atomic operation.

> For CPUs with load locked + store conditional, it is expensive.

Because it locks the bus? I am not that familiar with those architectures 
but it seems that those will have a general problem anyways.

> If you want an operation for performance critical portions of the
> kernel, please allow architecture maintainers the freedom to use their
> best performance enhancements.

And thereby denying the kernel developers to use a simple atomic SMP 
operation? Adding additional defines for each arch and each performance 
critical piece of kernel logic?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:08     ` Al Viro
@ 2006-12-06 19:25       ` Linus Torvalds
  2006-12-06 19:29         ` Matthew Wilcox
  2006-12-06 19:43         ` David Howells
  2006-12-07  1:09       ` David Miller
  1 sibling, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Lameter, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch



On Wed, 6 Dec 2006, Al Viro wrote:
> 
> No.  sparc32 doesn't have one, for instance.

Ok. For SMP-safety, it's important that any architecture that can't do 
this needs to _share_ the same spinlock (on SMP only, of course) that it 
uses for the bitops. 

It would be good (but perhaps not as strict a requirement) if the atomic 
counters also use the same lock. But that is probably impossible on 
sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a 
cmpxchg() on an atomic_t would be a bug.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:05   ` Linus Torvalds
  2006-12-06 19:08     ` Al Viro
@ 2006-12-06 19:26     ` Matthew Wilcox
  2006-12-06 19:29       ` Christoph Lameter
                         ` (2 more replies)
  2006-12-06 20:00     ` Russell King
  2006-12-07 15:06     ` Russell King
  3 siblings, 3 replies; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?

It's just been pointed out to me that the parisc one isn't safe.

<dhowells> imagine variable X is set to 3
<dhowells> CPU A issues cmpxchg(&X, 3, 5)
<dhowells> you'd expect that to change X to 5
<dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
and it setting X?

Given parisc's paucity of atomic operations (load-and-zero-32bit and
load-and-zero-64bit), cmpxchg() is impossible to implement safely.
There has to be something we can hook to exclude another processor
modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
atomic_set locked against it.

Of course, using cmpxchg() isn't really lockless.  It's just hidden
locking.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:16     ` Christoph Lameter
@ 2006-12-06 19:28       ` Linus Torvalds
  2006-12-06 19:58       ` Russell King
  2006-12-07  0:46       ` Ralf Baechle
  2 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Russell King, David Howells, akpm, linux-arm-kernel,
	linux-kernel, linux-arch



On Wed, 6 Dec 2006, Christoph Lameter wrote:
> 
> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

load_locked + store_conditional should _not_ be any more expensive than 
any atomic sequence will always be.

Atomic sequences in SMP are obviously never "cheap". But cmpxchg shouldn't 
be any more expensive than any other atomic sequence if you have 
load-locked and store-conditional.

There are obviously _implementation_ bugs. The early alpha's had such an 
atrocious ldl/stc that it wasn't even funny. That might be true in other 
implementations too, but it's definitely not cmpxchg-specific if so. It 
will affect _any_ atomic ops on such an architecture (atomic_inc() and 
friends)

			Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:25       ` Linus Torvalds
@ 2006-12-06 19:29         ` Matthew Wilcox
  2006-12-06 19:54           ` Linus Torvalds
  2006-12-06 19:43         ` David Howells
  1 sibling, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Lameter, Russell King, David Howells,
	Andrew Morton, linux-arm-kernel, Linux Kernel Mailing List,
	linux-arch

On Wed, Dec 06, 2006 at 11:25:35AM -0800, Linus Torvalds wrote:
> Ok. For SMP-safety, it's important that any architecture that can't do 
> this needs to _share_ the same spinlock (on SMP only, of course) that it 
> uses for the bitops. 

That doesn't help, since assignment can't be guarded by any lock.

> It would be good (but perhaps not as strict a requirement) if the atomic 
> counters also use the same lock. But that is probably impossible on 
> sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a 
> cmpxchg() on an atomic_t would be a bug.

sparc32 switched over to the parisc way of doing things, so they could
expand their atomic_t to a full 32 bits.  They still have the old
atomic_24_t lying around for their arch-private use, but atomic_t uses a
hashed spinlock.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:26     ` Matthew Wilcox
@ 2006-12-06 19:29       ` Christoph Lameter
  2006-12-06 19:36         ` Matthew Wilcox
  2006-12-06 19:34       ` Linus Torvalds
  2006-12-06 19:45       ` David Howells
  2 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 19:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> It's just been pointed out to me that the parisc one isn't safe.
> 
> <dhowells> imagine variable X is set to 3
> <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> <dhowells> you'd expect that to change X to 5
> <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> and it setting X?

The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
then other cpu performs a store before the next instruction.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:26     ` Matthew Wilcox
  2006-12-06 19:29       ` Christoph Lameter
@ 2006-12-06 19:34       ` Linus Torvalds
  2006-12-06 19:41         ` Matthew Wilcox
  2006-12-06 19:45       ` David Howells
  2 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch



On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> Given parisc's paucity of atomic operations (load-and-zero-32bit and
> load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> There has to be something we can hook to exclude another processor
> modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
> atomic_set locked against it.

How do you to the atomic bitops?

Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
would be different. ANY cmpxchg() needs to be atomic - if it's not, 
there's no point to the operation at all, since you'd just write it as

	if (*p == x)
		*p = y;

instead, and not use cmpxchg(). 

So yes, architectures without native support (where "native" includes 
load-locked + store-conditional) always need to

 - on UP, just disable interrupts
 - on SMP, use a spinlock (with interrupts disabled), and share that 
   spinlock with all the other atomic ops (bitops at a minimum - the 
   "atomic_t" operations have traditionally been in another "locking 
   space" because of sparc32 historic braindamage, but I'd suggest 
   sharing the same spinlock between them all).

And yeah, it sucks. You _can_ (if you really want to) make the spinlock be 
hashed based on the address of the atomic data structure. That at least 
allows you to do _multiple_ spinlocks, but let's face it, your real 
problem is _likely_ going to be cacheline bouncing, not contention, and 
then using a hashed lock won't be likely to buy you all that much.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:29       ` Christoph Lameter
@ 2006-12-06 19:36         ` Matthew Wilcox
  2006-12-06 19:47           ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 19:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > It's just been pointed out to me that the parisc one isn't safe.
> > 
> > <dhowells> imagine variable X is set to 3
> > <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> > <dhowells> you'd expect that to change X to 5
> > <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> > and it setting X?
> 
> The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
> then other cpu performs a store before the next instruction.

For someone who's advocating use of cmpxchg, it seems you don't
understand its semantics!  In the scenario dhowells pointed out, X would
be left set to 5.  X should have the value 6 under any legitimate
implementation:

CPU A		CPU B
cmpxchg(3,5)
		X = 6


CPU A		CPU B
		X = 6
cmpxhcg(3,5)


CPU A
cmpxchg(3,
		X = 6
5)


Given that even yourself got confused about how to use it, perhaps it's
not a good idea to expose this primitive to most programmers anyway?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:34       ` Linus Torvalds
@ 2006-12-06 19:41         ` Matthew Wilcox
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:34:52AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Given parisc's paucity of atomic operations (load-and-zero-32bit and
> > load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> > There has to be something we can hook to exclude another processor
> > modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
> > atomic_set locked against it.
> 
> How do you to the atomic bitops?

The same way we do atomic_t.

What I hadn't realised (because I hadn't read dhowell's implementation
... because it hasn't shown up on git2.kernel.org yet) is that he
doesn't actually *use* this unlocked-assignment that would cause the
problem.  He uses bitops which use the same locks.

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
> would be different. ANY cmpxchg() needs to be atomic - if it's not, 
> there's no point to the operation at all, since you'd just write it as
> 
> 	if (*p == x)
> 		*p = y;
> 
> instead, and not use cmpxchg(). 

The difference is that we can (and do) acquire a lock for atomic_set.
We can't acquire one for X = 6.

>  - on SMP, use a spinlock (with interrupts disabled), and share that 
>    spinlock with all the other atomic ops (bitops at a minimum - the 
>    "atomic_t" operations have traditionally been in another "locking 
>    space" because of sparc32 historic braindamage, but I'd suggest 
>    sharing the same spinlock between them all).

Yep, we agree.

> And yeah, it sucks. You _can_ (if you really want to) make the spinlock be 
> hashed based on the address of the atomic data structure. That at least 
> allows you to do _multiple_ spinlocks, but let's face it, your real 
> problem is _likely_ going to be cacheline bouncing, not contention, and 
> then using a hashed lock won't be likely to buy you all that much.

We do hash based on the address -- and we try to arrange things such
that different spinlocks are acquired for different cachelines.  I don't
know if anyone's benchmarked it recently to see how well it performs.
It's a bit of a waltzing bear [1] at times ;-)

[1] The wonder is not how well it waltzes, but that it waltzes at all

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-06 19:25       ` Linus Torvalds
  2006-12-06 19:29         ` Matthew Wilcox
@ 2006-12-06 19:43         ` David Howells
  1 sibling, 0 replies; 76+ messages in thread
From: David Howells @ 2006-12-06 19:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Al Viro, Christoph Lameter, Russell King,
	David Howells, Andrew Morton, linux-arm-kernel,
	Linux Kernel Mailing List, linux-arch

Matthew Wilcox <matthew@wil.cx> wrote:

> > Ok. For SMP-safety, it's important that any architecture that can't do 
> > this needs to _share_ the same spinlock (on SMP only, of course) that it 
> > uses for the bitops. 
> 
> That doesn't help, since assignment can't be guarded by any lock.

It's not a problem for workqueues, since the only direct assignment to the
management member variable is during initialisation.

But in general cmpxchg() might be a problem with respect to assignment.

atomic_cmpxchg() should be safe wrt atomic_set().

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-06 19:26     ` Matthew Wilcox
  2006-12-06 19:29       ` Christoph Lameter
  2006-12-06 19:34       ` Linus Torvalds
@ 2006-12-06 19:45       ` David Howells
  2 siblings, 0 replies; 76+ messages in thread
From: David Howells @ 2006-12-06 19:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Christoph Lameter, Russell King, David Howells,
	Andrew Morton, linux-arm-kernel, Linux Kernel Mailing List,
	linux-arch

Linus Torvalds <torvalds@osdl.org> wrote:

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
> would be different. ANY cmpxchg() needs to be atomic - if it's not, 
> there's no point to the operation at all, since you'd just write it as

It's not that atomic_cmpxchg() is different to cmpxchg(), it's that
atomic_set() is different to direct assignment.

atomic_set() on PA-RISC, for example, has spinlocks in it.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-06 16:43 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it David Howells
                   ` (2 preceding siblings ...)
  2006-12-06 19:12 ` Lennert Buytenhek
@ 2006-12-06 19:47 ` David Howells
  2006-12-06 20:09   ` Lennert Buytenhek
  3 siblings, 1 reply; 76+ messages in thread
From: David Howells @ 2006-12-06 19:47 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-arch,
	linux-kernel

Lennert Buytenhek <buytenh@wantstofly.org> wrote:

> > Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,
> 
> That's not quite true, there exist ARMv5 processors that in theory
> can support SMP.

I meant that the Linux ARM arch doesn't support it:

	#else /* ARM_ARCH_6 */

	#include <asm/system.h>

	#ifdef CONFIG_SMP
	#error SMP not supported on pre-ARMv6 CPUs
	#endif

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:36         ` Matthew Wilcox
@ 2006-12-06 19:47           ` Christoph Lameter
  2006-12-06 19:50             ` Matthew Wilcox
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 19:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> > On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > 
> > > It's just been pointed out to me that the parisc one isn't safe.
> > > 
> > > <dhowells> imagine variable X is set to 3
> > > <dhowells> CPU A issues cmpxchg(&X, 3, 5)
> > > <dhowells> you'd expect that to change X to 5
> > > <dhowells> but what if CPU B assigns 6 to X between cmpxchg reading X
> > > and it setting X?
> > 
> > The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
> > then other cpu performs a store before the next instruction.
> 
> For someone who's advocating use of cmpxchg, it seems you don't
> understand its semantics!  In the scenario dhowells pointed out, X would
> be left set to 5.  X should have the value 6 under any legitimate
> implementation:

Nope this is a UP implementation. There is no cpu B.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:47           ` Christoph Lameter
@ 2006-12-06 19:50             ` Matthew Wilcox
  2006-12-06 20:11               ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> Nope this is a UP implementation. There is no cpu B.

Follow the thread back.  I'm talking about parisc.  Machines exist (are
still being sold) with up to 128 cores.  I think the largest we've
confirmed work are 8 CPUs.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:29         ` Matthew Wilcox
@ 2006-12-06 19:54           ` Linus Torvalds
  2006-12-06 19:56             ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, Christoph Lameter, Russell King, David Howells,
	Andrew Morton, linux-arm-kernel, Linux Kernel Mailing List,
	linux-arch



On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> That doesn't help, since assignment can't be guarded by any lock.

True. Pure assignment will be lost, and is only ok for the case of a pure 
initializer (where nobody can see the old state).

Your problem will be, of course, that any architecture that does this in 
hardware will just DoTheRightThing, and as such, broken architectures with 
bad locking primitives will have to test and do source-level analysis 
more.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:54           ` Linus Torvalds
@ 2006-12-06 19:56             ` Linus Torvalds
  0 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 19:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, Christoph Lameter, Russell King, David Howells,
	Andrew Morton, linux-arm-kernel, Linux Kernel Mailing List,
	linux-arch



On Wed, 6 Dec 2006, Linus Torvalds wrote:
> 
> Your problem will be, of course, that any architecture that does this in 
> hardware will just DoTheRightThing, and as such, broken architectures with 
> bad locking primitives will have to test and do source-level analysis 
> more.

(The underlying thread here being that the workqueue stuff _should_ be 
safe in this regard. But the "testing will not catch bugs like these is 
certainly true in general)

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:16     ` Christoph Lameter
  2006-12-06 19:28       ` Linus Torvalds
@ 2006-12-06 19:58       ` Russell King
  2006-12-06 21:36         ` Matthew Wilcox
  2006-12-07  9:31         ` Nick Piggin
  2006-12-07  0:46       ` Ralf Baechle
  2 siblings, 2 replies; 76+ messages in thread
From: Russell King @ 2006-12-06 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Russell King wrote:
> 
> > On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > Let's recap on cmpxchg.
> > 
> > For CPUs with no atomic operation other than SWP, it is not lockless.
> 
> But then its also just requires disable/enable interrupts on UP which may 
> be cheaper than an atomic operation.

No.  SWP is atomic on the CPU it's being issued on, especially wrt
interrupts.  Only on one ARM CPU (which is UP) does it have a
questionable use, and there we do it via interrupt disable/enable.

> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

No.  That certainly would be bad for performance.  I can talk
authoritively from the ARM implementation.

When you use a special "ldrex" (load exclusive) instruction, the
CPU remembers the "address + cpu" pair.  If another access occurs
to the same address, this state is reset.

Only if this state is preserved will a "strex" (store exclusive)
instruction succeed.  This instruction returns status indicating
whether it succeeded.

So, to implement cmpxchg, you need to do this:

	; r1 = temporary register
	; r2 = address
	; r4 = new value
	; r3 = returned status
	ldrex	r1, [r2]
	cmp	r1, old_value
	streqex	r3, r4, [r2]

> > If you want an operation for performance critical portions of the
> > kernel, please allow architecture maintainers the freedom to use their
> > best performance enhancements.
> 
> And thereby denying the kernel developers to use a simple atomic SMP 
> operation? Adding additional defines for each arch and each performance 
> critical piece of kernel logic?

No.  If you read what I said, you'll see that you can _cheaply_ use
cmpxchg in a ll/sc based implementation.  Take an atomic increment
operation.

	do {
		old = load_locked(addr);
	} while (store_exclusive(old, old + 1, addr);

On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
instruction, comparing the first arg, and if equal storing the second.
The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
becomes:

	do {
		load value
		manipulate it
		conditional store
	} while not stored

On ll/sc, the load_locked() macro is the load locked instruction.  The
store_exclusive() macro is the exclusive store and it doesn't need to
use the first parameter at all.  Ergo, ARM becomes:

	do {
		ldrex r1, [r2]
		manipulate r1
		strex r0, r1, [r2]
	} while failed

Notice that both are optimal.

Now let's consider the cmpxchg case.

	do {
		val = *addr;
	} while (cmpxchg(val, val + 1, addr);

The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
entirely.  No impact what so ever.

Let's look at the ll/sc case.  The cmpxchg code implemented on this has
to reload the original value, compare it, if equal store the new value.
So:

	do {
		val = *addr;
		(r2 = addr, 
		ldrex r1, [r2]
		compare r1, r0
		strexeq r4, r3, [r2] (store exclusive if equal)
	} while store failed or comparecondition failed

Note how the cmpxchg has _forced_ the ll/sc implementation to become
more complex.

So, let's recap.

Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:05   ` Linus Torvalds
  2006-12-06 19:08     ` Al Viro
  2006-12-06 19:26     ` Matthew Wilcox
@ 2006-12-06 20:00     ` Russell King
  2006-12-07 15:06     ` Russell King
  3 siblings, 0 replies; 76+ messages in thread
From: Russell King @ 2006-12-06 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?
> 
> That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> supports it naturally in hardware, so..
> 
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> like to get an ack from the ARM maintainer before applying it, but it 
> looked ok.

I've only recently got in today, so it'll be a while (== tomorrow) before
I can get around to doing any tests.  (It's evening this side of the
pond.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:47 ` David Howells
@ 2006-12-06 20:09   ` Lennert Buytenhek
  0 siblings, 0 replies; 76+ messages in thread
From: Lennert Buytenhek @ 2006-12-06 20:09 UTC (permalink / raw)
  To: David Howells; +Cc: torvalds, akpm, linux-arm-kernel, linux-arch, linux-kernel

On Wed, Dec 06, 2006 at 07:47:22PM +0000, David Howells wrote:

> > > Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,
> > 
> > That's not quite true, there exist ARMv5 processors that in theory
> > can support SMP.
> 
> I meant that the Linux ARM arch doesn't support it:

Ah, yes, not currently.  That's true.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:50             ` Matthew Wilcox
@ 2006-12-06 20:11               ` Christoph Lameter
  2006-12-06 20:17                 ` Matthew Wilcox
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 20:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> > Nope this is a UP implementation. There is no cpu B.
> 
> Follow the thread back.  I'm talking about parisc.  Machines exist (are
> still being sold) with up to 128 cores.  I think the largest we've
> confirmed work are 8 CPUs.

And you only have an atomic get and clear atomic op? Amazing. Is this 
really still in production?



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 20:11               ` Christoph Lameter
@ 2006-12-06 20:17                 ` Matthew Wilcox
  0 siblings, 0 replies; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 20:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Russell King, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 12:11:01PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Follow the thread back.  I'm talking about parisc.  Machines exist (are
> > still being sold) with up to 128 cores.  I think the largest we've
> > confirmed work are 8 CPUs.
> 
> And you only have an atomic get and clear atomic op? Amazing. Is this 
> really still in production?

Yes.  And yes.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:58       ` Russell King
@ 2006-12-06 21:36         ` Matthew Wilcox
  2006-12-06 21:52           ` Christoph Lameter
  2006-12-07  9:31         ` Nick Piggin
  1 sibling, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 21:36 UTC (permalink / raw)
  To: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Wed, Dec 06, 2006 at 07:58:20PM +0000, Russell King wrote:
> No.  If you read what I said, you'll see that you can _cheaply_ use
> cmpxchg in a ll/sc based implementation.  Take an atomic increment
> operation.
> 
> 	do {
> 		old = load_locked(addr);
> 	} while (store_exclusive(old, old + 1, addr);

[...]

> Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> architectures to produce optimal code.
> 
> Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> to produce optimal code and ll/sc non-optimal code.

And for those of us with only load-and-zero, that's simply:

#define load_locked(addr) spin_lock(hash(addr)), *addr
#define store_exclusive(addr, old, new) \
			*addr = new, spin_unlock(hash(addr)), 0

which is also optimal for us.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 21:36         ` Matthew Wilcox
@ 2006-12-06 21:52           ` Christoph Lameter
  2006-12-06 22:05             ` Matthew Wilcox
  2006-12-06 22:38             ` Linus Torvalds
  0 siblings, 2 replies; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 21:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> And for those of us with only load-and-zero, that's simply:
> 
> #define load_locked(addr) spin_lock(hash(addr)), *addr
> #define store_exclusive(addr, old, new) \
> 			*addr = new, spin_unlock(hash(addr)), 0
> 
> which is also optimal for us.

This means we tolerate the assignment race for SMP that was pointed out 
earlier?

cmpxchg emulation may then also be tolerable just replace the irq 
enable/disable in David's implementation with taking a spin lock?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 21:52           ` Christoph Lameter
@ 2006-12-06 22:05             ` Matthew Wilcox
  2006-12-06 22:15               ` Christoph Lameter
  2006-12-07  0:37               ` Roman Zippel
  2006-12-06 22:38             ` Linus Torvalds
  1 sibling, 2 replies; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-06 22:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, Dec 06, 2006 at 01:52:20PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > And for those of us with only load-and-zero, that's simply:
> > 
> > #define load_locked(addr) spin_lock(hash(addr)), *addr
> > #define store_exclusive(addr, old, new) \
> > 			*addr = new, spin_unlock(hash(addr)), 0
> > 
> > which is also optimal for us.
> 
> This means we tolerate the assignment race for SMP that was pointed out 
> earlier?

What gave you that impression?  It simply wasn't part of this example.

To be honest, it'd be much easier if we only defined these operations on
atomic_t's.  We have all the infrastructure in place for them, and
they're fairly well understood.  If you need different sizes, I'm OK
with an atomic_pointer_t, or whatever.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 22:05             ` Matthew Wilcox
@ 2006-12-06 22:15               ` Christoph Lameter
  2006-12-07  0:37               ` Roman Zippel
  1 sibling, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2006-12-06 22:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, torvalds, akpm, linux-arm-kernel, linux-kernel,
	linux-arch

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's.  We have all the infrastructure in place for them, and
> they're fairly well understood.  If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

An pointer is probably one of the most important entities to use with 
cmpxchg (aside from the ints and longs that we already support).

A pointer can be defined to point to any other type. So we would need
atomic_pointer_t(<type_pointed_to>)?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 21:52           ` Christoph Lameter
  2006-12-06 22:05             ` Matthew Wilcox
@ 2006-12-06 22:38             ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-06 22:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, David Howells, akpm, linux-arm-kernel,
	linux-kernel, linux-arch



On Wed, 6 Dec 2006, Christoph Lameter wrote:
> 
> This means we tolerate the assignment race for SMP that was pointed out 
> earlier?

The assignment "race" doesn't really exist in real code, I suspect.

There's two cases of assignment:

 - pure initialization. This one isn't racy, since it's literally the 
   initial setting of some data structure, and if it's visible in 
   uninitialized state on other CPU's before being initialized, you have 
   bigger problems than some silly assignment race ;)

   So the race doesn't exist for this class of assignment at all, and 
   that's the _common_ case of a pure assignment (ie the "we start out 
   with this field having value X")

 - mixing assignment on one CPU (and not using "atomic_set()") and using 
   "cmpxchg()". 

   This would be a problem on architectures that use an external spinlock 
   or something, but I don't really see how it could be a valid code 
   sequence ANYWAY.

   Any code that does this had better have some _higherlevel_ 
   serialization around that data structure, since while it's not a _bug_ 
   on architectures that do "cmpxchg()" in hardware (as far as the cmpxchg 
   itself is concerned), the fact is, in the absense of any other locking, 
   what the hell is such code supposed to _mean_?

In other words - the second case would be a bug in that it bypasses the 
serialization of the spinlock, but why would you ever do that anyway? What 
could _possibly_ be a valid use of such a "set value blindly" kind of 
sequence? Since another CPU is clearly doing something to the value (ie 
the cmpxchg), the "set it to value X" is just not a well-defined 
operation in the first place, since you don't know what the end result is 
(will it be "X", or will the "cmpxchg" on the other CPU have set it to 
something else?).

So I don't think the race really exists for any sane code anyway. Even if 
people don't use "atomic_set()".

But hey, maybe I'm just ignoring some really odd usage case.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 22:05             ` Matthew Wilcox
  2006-12-06 22:15               ` Christoph Lameter
@ 2006-12-07  0:37               ` Roman Zippel
  2006-12-07  0:54                 ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Roman Zippel @ 2006-12-07  0:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Hi,

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's.  We have all the infrastructure in place for them, and
> they're fairly well understood.  If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

FWIW Seconded.

bye, Roman

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:16     ` Christoph Lameter
  2006-12-06 19:28       ` Linus Torvalds
  2006-12-06 19:58       ` Russell King
@ 2006-12-07  0:46       ` Ralf Baechle
  2 siblings, 0 replies; 76+ messages in thread
From: Ralf Baechle @ 2006-12-07  0:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Russell King, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:

> But then its also just requires disable/enable interrupts on UP which may 
> be cheaper than an atomic operation.
> 
> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

On a decent implementation ll/sc will have the same cost as ordinary
non-atomic load and store instructions.  A likely uniprocessor
implementation uses a single flip-flop ("llbit") in the CPU which is set
by ll and cleared by any exception handler, especially interrupt.  A later
store conditional will then simply fail if that bit is cleared.  That
is extremly trivial stuff.  On SMP it's somewhat more complex; A
processor will have to remember the address used with ll and start
snooping the bus for writes to it.  The store conditional will then
go and upgrade the cache line to exclusive state if the llbit is still
set and perform the store.  The llbit would be cleared if the processor
has snooped any other write to the cacheline.  Details are fun but that's
bascially how it's implemented.

Of course load linked / store conditional are typically used in loops
so there is a little extra overhead from that especially where when the
branch is misspredicted.

Also note there is no locked cycle required to implement load linked /
store conditional.

  Ralf

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  0:37               ` Roman Zippel
@ 2006-12-07  0:54                 ` Linus Torvalds
  2006-12-07  1:05                   ` Roman Zippel
  2006-12-07  9:23                   ` Nick Piggin
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-07  0:54 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Thu, 7 Dec 2006, Roman Zippel wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > To be honest, it'd be much easier if we only defined these operations on
> > atomic_t's.  We have all the infrastructure in place for them, and
> > they're fairly well understood.  If you need different sizes, I'm OK
> > with an atomic_pointer_t, or whatever.
> 
> FWIW Seconded.

I disagree.

Any _real_ CPU will simply never care about _anything_ else than just the 
size of the datum in question. There's absolutely no point to only allow 
it on certain types, especially as we _know_ those certain types are 
already going to be more than one, and we also know that they are going to 
be different sizes. In other words, in reality, we have to handle a 
sizeable subset of the whole generic situation, and the "on certain types 
only" situation is only going to be awkward and irritating.

For example, would we have a different "cmpxchg_ptr()" function for the 
atomic pointer thing? With any reasonable CPU just depending on the _size_ 
of the type, I don't see what the problem is with just doing the 
bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case- 
statement in a header file to do it all automatically for you, and then we 
don't need to worry about it.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  0:54                 ` Linus Torvalds
@ 2006-12-07  1:05                   ` Roman Zippel
  2006-12-07  1:18                     ` Linus Torvalds
  2006-12-07  9:23                   ` Nick Piggin
  1 sibling, 1 reply; 76+ messages in thread
From: Roman Zippel @ 2006-12-07  1:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > To be honest, it'd be much easier if we only defined these operations on
> > > atomic_t's.  We have all the infrastructure in place for them, and
> > > they're fairly well understood.  If you need different sizes, I'm OK
> > > with an atomic_pointer_t, or whatever.
> > 
> > FWIW Seconded.
> 
> I disagree.
> 
> Any _real_ CPU will simply never care about _anything_ else than just the 
> size of the datum in question.

..or alignment which a dedicated atomic type would allow to be attached.

bye, Roman

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:08     ` Al Viro
  2006-12-06 19:25       ` Linus Torvalds
@ 2006-12-07  1:09       ` David Miller
  1 sibling, 0 replies; 76+ messages in thread
From: David Miller @ 2006-12-07  1:09 UTC (permalink / raw)
  To: viro
  Cc: torvalds, clameter, rmk+lkml, dhowells, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

From: Al Viro <viro@ftp.linux.org.uk>
Date: Wed, 6 Dec 2006 19:08:28 +0000

> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > I suspect ARM may have been the last one without one, no?
> 
> No.  sparc32 doesn't have one, for instance.

That's correct.  It has an atomic swap, but not a cmpxchg.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:05                   ` Roman Zippel
@ 2006-12-07  1:18                     ` Linus Torvalds
  2006-12-07  1:24                       ` Roman Zippel
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-07  1:18 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Thu, 7 Dec 2006, Roman Zippel wrote:
> > 
> > Any _real_ CPU will simply never care about _anything_ else than just the 
> > size of the datum in question.
> 
> ..or alignment which a dedicated atomic type would allow to be attached.

Can you give any example of a real CPU where alignment matters?

Sure, it needs to be naturally aligned, but that's true of _any_ type in 
the kernel. We don't do unaligneds without "get_unaligned()" and friends.

Btw, if you want to leave out 8-bit and 16-bit data, that's fine. So 
generally you'd only need to handle 32-bit (and 64-bit on a 64-bit 
architecture) accesses anyway.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:18                     ` Linus Torvalds
@ 2006-12-07  1:24                       ` Roman Zippel
  2006-12-07  1:36                         ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Roman Zippel @ 2006-12-07  1:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, Andrew Morton,
	linux-kernel, linux-arch

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > Any _real_ CPU will simply never care about _anything_ else than just the 
> > > size of the datum in question.
> > 
> > ..or alignment which a dedicated atomic type would allow to be attached.
> 
> Can you give any example of a real CPU where alignment matters?
> 
> Sure, it needs to be naturally aligned, but that's true of _any_ type in 
> the kernel. We don't do unaligneds without "get_unaligned()" and friends.

m68060 produces a trap for unaligned atomic access, unfortunately standard 
alignment is smaller than this.

bye, Roman

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:24                       ` Roman Zippel
@ 2006-12-07  1:36                         ` Linus Torvalds
  2006-12-07  1:44                           ` Matthew Wilcox
  2006-12-07  1:52                           ` Roman Zippel
  0 siblings, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-07  1:36 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, Andrew Morton,
	linux-kernel, linux-arch



On Thu, 7 Dec 2006, Roman Zippel wrote:
>
> m68060 produces a trap for unaligned atomic access, unfortunately standard 
> alignment is smaller than this.

Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you 
saying that you can't do a 32-bit atomic access at any 32-bit aligned 
boundary? Or are you saying that gcc aligns normal 32-bit entities at 
16-bit alignment? Neither of those sound very likely.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:36                         ` Linus Torvalds
@ 2006-12-07  1:44                           ` Matthew Wilcox
  2006-12-07  2:09                             ` Douglas McNaught
  2006-12-07  1:52                           ` Roman Zippel
  1 sibling, 1 reply; 76+ messages in thread
From: Matthew Wilcox @ 2006-12-07  1:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Christoph Lameter, David Howells, Andrew Morton,
	linux-kernel, linux-arch

On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
> Or are you saying that gcc aligns normal 32-bit entities at 
> 16-bit alignment? Neither of those sound very likely.

alignof(u32) is 2 on m68k.  Crazy, huh?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:36                         ` Linus Torvalds
  2006-12-07  1:44                           ` Matthew Wilcox
@ 2006-12-07  1:52                           ` Roman Zippel
  1 sibling, 0 replies; 76+ messages in thread
From: Roman Zippel @ 2006-12-07  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Christoph Lameter, David Howells, Andrew Morton,
	linux-kernel, linux-arch

Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > m68060 produces a trap for unaligned atomic access, unfortunately standard 
> > alignment is smaller than this.
> 
> Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you 
> saying that you can't do a 32-bit atomic access at any 32-bit aligned 
> boundary? Or are you saying that gcc aligns normal 32-bit entities at 
> 16-bit alignment? Neither of those sound very likely.

The latter.
And yes, I'm starting to hate it too, especially after I've seen all the 
weird bugs this causes in userspace, which then only trigger on m68k.
I'm thinking of changing it when switching to proper TLS support, but it's 
not there yet.
BTW there is another reason I actually like the atomic type - 
documentation. It makes it more clear that this intended to be used as 
atomic value, so an unhealthy mixture of different accesses is less likely 
and if they pile up at some place it's a good indicator to maybe switch 
back to spinlocks.

bye, Roman

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  1:44                           ` Matthew Wilcox
@ 2006-12-07  2:09                             ` Douglas McNaught
  0 siblings, 0 replies; 76+ messages in thread
From: Douglas McNaught @ 2006-12-07  2:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Roman Zippel, Christoph Lameter, David Howells,
	Andrew Morton, linux-kernel, linux-arch

Matthew Wilcox <matthew@wil.cx> writes:

> On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
>> Or are you saying that gcc aligns normal 32-bit entities at 
>> 16-bit alignment? Neither of those sound very likely.
>
> alignof(u32) is 2 on m68k.  Crazy, huh?

The original 68000 had a 16-bit bus (but 32-bit registers), which is
probably why it's that way.

-Doug

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  0:54                 ` Linus Torvalds
  2006-12-07  1:05                   ` Roman Zippel
@ 2006-12-07  9:23                   ` Nick Piggin
  1 sibling, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2006-12-07  9:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roman Zippel, Matthew Wilcox, Christoph Lameter, David Howells,
	akpm, linux-arm-kernel, linux-kernel, linux-arch

Linus Torvalds wrote:
> 
> On Thu, 7 Dec 2006, Roman Zippel wrote:
> 
>>On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>>
>>
>>>To be honest, it'd be much easier if we only defined these operations on
>>>atomic_t's.  We have all the infrastructure in place for them, and
>>>they're fairly well understood.  If you need different sizes, I'm OK
>>>with an atomic_pointer_t, or whatever.
>>
>>FWIW Seconded.
> 
> 
> I disagree.
> 
> Any _real_ CPU will simply never care about _anything_ else than just the 
> size of the datum in question. There's absolutely no point to only allow 
> it on certain types, especially as we _know_ those certain types are 
> already going to be more than one, and we also know that they are going to 
> be different sizes. In other words, in reality, we have to handle a 
> sizeable subset of the whole generic situation, and the "on certain types 
> only" situation is only going to be awkward and irritating.
> 
> For example, would we have a different "cmpxchg_ptr()" function for the 
> atomic pointer thing? With any reasonable CPU just depending on the _size_ 
> of the type, I don't see what the problem is with just doing the 
> bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case- 
> statement in a header file to do it all automatically for you, and then we 
> don't need to worry about it.

What's wrong with using atomic_cmpxchg? For unsigned long / pointers,
there is a patch to implement atomic_long_cmpxchg.

Some architectures simply can't implement cmpxchg on memory that may
be modified in arbitrary ways. If you add some more conditions to use
cmpxchg, then you weaken it (ie. can't use it for synchronisation
with userspace) and on top of that you don't get the easy static
checking that atomic_t gives you.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:58       ` Russell King
  2006-12-06 21:36         ` Matthew Wilcox
@ 2006-12-07  9:31         ` Nick Piggin
  2006-12-07 13:20           ` Ivan Kokshaysky
                             ` (3 more replies)
  1 sibling, 4 replies; 76+ messages in thread
From: Nick Piggin @ 2006-12-07  9:31 UTC (permalink / raw)
  To: Russell King
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Russell King wrote:
> On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:

> No.  If you read what I said, you'll see that you can _cheaply_ use
> cmpxchg in a ll/sc based implementation.  Take an atomic increment
> operation.
> 
> 	do {
> 		old = load_locked(addr);
> 	} while (store_exclusive(old, old + 1, addr);
> 
> On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
> instruction, comparing the first arg, and if equal storing the second.
> The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
> becomes:
> 
> 	do {
> 		load value
> 		manipulate it
> 		conditional store
> 	} while not stored
> 
> On ll/sc, the load_locked() macro is the load locked instruction.  The
> store_exclusive() macro is the exclusive store and it doesn't need to
> use the first parameter at all.  Ergo, ARM becomes:
> 
> 	do {
> 		ldrex r1, [r2]
> 		manipulate r1
> 		strex r0, r1, [r2]
> 	} while failed
> 
> Notice that both are optimal.
> 
> Now let's consider the cmpxchg case.
> 
> 	do {
> 		val = *addr;
> 	} while (cmpxchg(val, val + 1, addr);
> 
> The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
> entirely.  No impact what so ever.
> 
> Let's look at the ll/sc case.  The cmpxchg code implemented on this has
> to reload the original value, compare it, if equal store the new value.
> So:
> 
> 	do {
> 		val = *addr;
> 		(r2 = addr, 
> 		ldrex r1, [r2]
> 		compare r1, r0
> 		strexeq r4, r3, [r2] (store exclusive if equal)
> 	} while store failed or comparecondition failed
> 
> Note how the cmpxchg has _forced_ the ll/sc implementation to become
> more complex.
> 
> So, let's recap.
> 
> Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> architectures to produce optimal code.
> 
> Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> to produce optimal code and ll/sc non-optimal code.
> 
> See my point?

Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all, and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.

It might be reasonable to implement this watered down version, but:
don't some architectures have restrictions on what instructions can
be issued between the ll and the sc?

But in general I agree with you, in that a higher level primitive is
preferable (eg. atomic_add_unless).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  9:31         ` Nick Piggin
@ 2006-12-07 13:20           ` Ivan Kokshaysky
  2006-12-07 15:03           ` Russell King
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Ivan Kokshaysky @ 2006-12-07 13:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King, Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. On Alpha you cannot execute other load/stores, taken branches etc.
So in general case the code between ll and sc cannot be written in C...

Ivan.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  9:31         ` Nick Piggin
  2006-12-07 13:20           ` Ivan Kokshaysky
@ 2006-12-07 15:03           ` Russell King
  2006-12-08  1:18             ` Nick Piggin
  2006-12-07 15:36           ` Linus Torvalds
  2006-12-07 16:51           ` Ralf Baechle
  3 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-07 15:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
> 
> >No.  If you read what I said, you'll see that you can _cheaply_ use
> >cmpxchg in a ll/sc based implementation.  Take an atomic increment
> >operation.
> >
> >	do {
> >		old = load_locked(addr);
> >	} while (store_exclusive(old, old + 1, addr);
> >
> >On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
> >instruction, comparing the first arg, and if equal storing the second.
> >The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
> >becomes:
> >
> >	do {
> >		load value
> >		manipulate it
> >		conditional store
> >	} while not stored
> >
> >On ll/sc, the load_locked() macro is the load locked instruction.  The
> >store_exclusive() macro is the exclusive store and it doesn't need to
> >use the first parameter at all.  Ergo, ARM becomes:
> >
> >	do {
> >		ldrex r1, [r2]
> >		manipulate r1
> >		strex r0, r1, [r2]
> >	} while failed
> >
> >Notice that both are optimal.
> >
> >Now let's consider the cmpxchg case.
> >
> >	do {
> >		val = *addr;
> >	} while (cmpxchg(val, val + 1, addr);
> >
> >The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
> >entirely.  No impact what so ever.
> >
> >Let's look at the ll/sc case.  The cmpxchg code implemented on this has
> >to reload the original value, compare it, if equal store the new value.
> >So:
> >
> >	do {
> >		val = *addr;
> >		(r2 = addr, 
> >		ldrex r1, [r2]
> >		compare r1, r0
> >		strexeq r4, r3, [r2] (store exclusive if equal)
> >	} while store failed or comparecondition failed
> >
> >Note how the cmpxchg has _forced_ the ll/sc implementation to become
> >more complex.
> >
> >So, let's recap.
> >
> >Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >architectures to produce optimal code.
> >
> >Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >to produce optimal code and ll/sc non-optimal code.
> >
> >See my point?
> 
> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all,

Intentional - cmpxchg architectures don't generally have a load locked.

> and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.

In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.

I do not see any problem.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-06 19:05   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2006-12-06 20:00     ` Russell King
@ 2006-12-07 15:06     ` Russell King
  2006-12-08 15:32       ` Russell King
  3 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-07 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?
> 
> That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> supports it naturally in hardware, so..
> 
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> like to get an ack from the ARM maintainer before applying it, but it 
> looked ok.

Things seem to have moved on since this request.  Do I need to do
anything?  I dunno.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  9:31         ` Nick Piggin
  2006-12-07 13:20           ` Ivan Kokshaysky
  2006-12-07 15:03           ` Russell King
@ 2006-12-07 15:36           ` Linus Torvalds
  2006-12-07 16:51           ` Ralf Baechle
  3 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-07 15:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King, Christoph Lameter, David Howells, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Thu, 7 Dec 2006, Nick Piggin wrote:
> 
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. You really probably do not want to expose ll/sc on a C level because 
of this.

On alpha, the architecture manual says (I didn't go back and check, but 
I'm pretty sure) that a ld.l and st.c cannot have a taken branch in 
between then, for example. That basically means that you can't allow the 
compiler to reorder the basic blocks (which it often will with a 
while-loop).

Now, I actually suspect that this was not a microarchitectural flaw, and 
that a branch would _work_ there, and that the architecture manual was 
just being anal, but strictly speaking, it means that these things had 
better always be in assembly, and you can sadly not expose them (on alpha, 
at least) as higher-level primitives as such - you can only expose the 
end result ("cmpxchg" or similar).

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07  9:31         ` Nick Piggin
                             ` (2 preceding siblings ...)
  2006-12-07 15:36           ` Linus Torvalds
@ 2006-12-07 16:51           ` Ralf Baechle
  3 siblings, 0 replies; 76+ messages in thread
From: Ralf Baechle @ 2006-12-07 16:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Russell King, Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all, and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.
> 
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

On MIPS the restriction is no loads or stores or sync instructions between
ll/sc.  Also there may be at most 2048 bytes between the address of the
ll and sc instructions.  Which means ll/sc sequences should better be
written in assembler or gcc might do a bit too creative things ...

  Ralf

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07 15:03           ` Russell King
@ 2006-12-08  1:18             ` Nick Piggin
  2006-12-08  8:56               ` Russell King
  0 siblings, 1 reply; 76+ messages in thread
From: Nick Piggin @ 2006-12-08  1:18 UTC (permalink / raw)
  To: Russell King
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Russell King wrote:
> On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

>>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
>>>architectures to produce optimal code.
>>>
>>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
>>>to produce optimal code and ll/sc non-optimal code.
>>>
>>>See my point?
>>
>>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
>>load_locked is not locked at all,
> 
> 
> Intentional - cmpxchg architectures don't generally have a load locked.

Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.

>>and there can be interleaving writes
>>between the load and cmpxchg which do not cause the store_conditional
>>to fail.
> 
> 
> In which case the cmpxchg fails and we do the atomic operation again,
> in exactly the same way that we do the operation again if the 'sc'
> fails in the ll/sc case.

Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.

> I do not see any problem.

This was not the big problem -- as I said, if this was the only problem
we could opt for a "watered down" version that doesn't actually load
locked [the ll/sc interface would be much cooler than cmpxchg, IMO :)]

The main problem is the restrictions between the ll and sc. This is why
I implemented atomic_cmpxchg rather than atomic_ll/sc. However I agree
that in critical code, a higher level API should be used instead (eg.
see atomic_add_unless, which can be implemented optimally on RISCs).

Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08  1:18             ` Nick Piggin
@ 2006-12-08  8:56               ` Russell King
  2006-12-08 16:06                 ` Christoph Lameter
                                   ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Russell King @ 2006-12-08  8:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> 
> >>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >>>architectures to produce optimal code.
> >>>
> >>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >>>to produce optimal code and ll/sc non-optimal code.
> >>>
> >>>See my point?
> >>
> >>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> >>load_locked is not locked at all,
> >
> >
> >Intentional - cmpxchg architectures don't generally have a load locked.
> 
> Exactly, so it is wrong -- you can't implement that behaviour with
> load + cmpxchg.

I disagree.  I _have_ implemented the required behaviour.  I really
don't understand your point saying that it is wrong.

> >>and there can be interleaving writes
> >>between the load and cmpxchg which do not cause the store_conditional
> >>to fail.
> >
> >
> >In which case the cmpxchg fails and we do the atomic operation again,
> >in exactly the same way that we do the operation again if the 'sc'
> >fails in the ll/sc case.
> 
> Not if cmpxchg sees the same value, it won't fail, regardless of how
> many writes have hit that memory address.

Don't see anything wrong with that.  If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do.  I'd argue that's
impossible.

I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric "cmpxchg is the best, everyone
must implement it" mentality.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-07 15:06     ` Russell King
@ 2006-12-08 15:32       ` Russell King
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King @ 2006-12-08 15:32 UTC (permalink / raw)
  To: Linus Torvalds, Christoph Lameter, David Howells, Andrew Morton,
	linux-arm-kernel, Linux Kernel Mailing List, linux-arch

On Thu, Dec 07, 2006 at 03:06:06PM +0000, Russell King wrote:
> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > I suspect ARM may have been the last one without one, no?
> > 
> > That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> > supports it naturally in hardware, so..
> > 
> > Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> > like to get an ack from the ARM maintainer before applying it, but it 
> > looked ok.
> 
> Things seem to have moved on since this request.  Do I need to do
> anything?  I dunno.

ARM's still broken.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08  8:56               ` Russell King
@ 2006-12-08 16:06                 ` Christoph Lameter
  2006-12-08 16:31                   ` Russell King
  2006-12-08 16:56                 ` David Howells
  2006-12-08 22:33                 ` Nick Piggin
  2 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-08 16:06 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, 8 Dec 2006, Russell King wrote:

> I'm trying to suggest a better implementation for atomic ops rather
> than just bowing to this x86-centric "cmpxchg is the best, everyone
> must implement it" mentality.

cmpxchg is the simplest solution to realize many other atomic operations 
and its widely available on a wide variety of platforms. It is the most 
universal atomic instruction that I know of. Other atomic operations may 
be more efficient but certainly cmpxchg is the most universal.

Having multiple instructions with restrictions of what can be done in 
between just complicates the use and seems to be arch specific. I have not 
seen a better solution. Are you really advocating the weirdly complex 
ll/sc be adopted by other architectures?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 16:06                 ` Christoph Lameter
@ 2006-12-08 16:31                   ` Russell King
  2006-12-08 16:43                     ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-08 16:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 08:06:23AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > I'm trying to suggest a better implementation for atomic ops rather
> > than just bowing to this x86-centric "cmpxchg is the best, everyone
> > must implement it" mentality.
> 
> cmpxchg is the simplest solution to realize many other atomic operations 
> and its widely available on a wide variety of platforms. It is the most 
> universal atomic instruction that I know of. Other atomic operations may 
> be more efficient but certainly cmpxchg is the most universal.
> 
> Having multiple instructions with restrictions of what can be done in 
> between just complicates the use and seems to be arch specific. I have not 
> seen a better solution. Are you really advocating the weirdly complex 
> ll/sc be adopted by other architectures?

You're advocating cmpxchg is adopted by all architectures.  It isn't
available on many architectures, and those which it can be requires
unnecessarily complicated coding.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 16:31                   ` Russell King
@ 2006-12-08 16:43                     ` Christoph Lameter
  2006-12-08 16:47                       ` Russell King
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-08 16:43 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, 8 Dec 2006, Russell King wrote:

> You're advocating cmpxchg is adopted by all architectures.  It isn't
> available on many architectures, and those which it can be requires
> unnecessarily complicated coding.

Not having cmpxchg is even worse because it requires the introduction and 
maintenance of large sets of arch specific operations. Much more complex.



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 16:43                     ` Christoph Lameter
@ 2006-12-08 16:47                       ` Russell King
  2006-12-08 16:53                         ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-08 16:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 08:43:09AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > You're advocating cmpxchg is adopted by all architectures.  It isn't
> > available on many architectures, and those which it can be requires
> > unnecessarily complicated coding.
> 
> Not having cmpxchg is even worse because it requires the introduction and 
> maintenance of large sets of arch specific operations. Much more complex.

And which bit of "not available on many architectures" have you not grasped
yet?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 16:47                       ` Russell King
@ 2006-12-08 16:53                         ` Christoph Lameter
  2006-12-08 16:58                           ` Russell King
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-08 16:53 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, 8 Dec 2006, Russell King wrote:

> > Not having cmpxchg is even worse because it requires the introduction and 
> > maintenance of large sets of arch specific operations. Much more complex.
> 
> And which bit of "not available on many architectures" have you not grasped
> yet?

We discussed various forms of emulating that functionality on this thread. 
Seems to work satisfactorily. You can discover the information you skipped 
by going back to some earlier messages of this thread.





^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-08  8:56               ` Russell King
  2006-12-08 16:06                 ` Christoph Lameter
@ 2006-12-08 16:56                 ` David Howells
  2006-12-08 17:06                   ` Christoph Lameter
  2006-12-08 18:46                   ` Linus Torvalds
  2006-12-08 22:33                 ` Nick Piggin
  2 siblings, 2 replies; 76+ messages in thread
From: David Howells @ 2006-12-08 16:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Russell King, Nick Piggin, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Christoph Lameter <clameter@sgi.com> wrote:

> cmpxchg is the simplest solution to realize many other atomic operations and
> its widely available on a wide variety of platforms.

But by no means all.  Where it doesn't exist it can only be properly emulated
by something such as LL/SC if they are there.  If not, then you can't do a
safe cmpxchg on SMP.

> It is the most universal atomic instruction that I know of.

I think TAS-type things and XCHG-type things are more common.

In fact I think more things have LL/SC than have CMPXCHG.

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 16:53                         ` Christoph Lameter
@ 2006-12-08 16:58                           ` Russell King
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King @ 2006-12-08 16:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, David Howells, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 08:53:22AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > > Not having cmpxchg is even worse because it requires the introduction and 
> > > maintenance of large sets of arch specific operations. Much more complex.
> > 
> > And which bit of "not available on many architectures" have you not grasped
> > yet?
> 
> We discussed various forms of emulating that functionality on this thread. 
> Seems to work satisfactorily. You can discover the information you skipped 
> by going back to some earlier messages of this thread.

Oh for god sake.  I've put forward a coherent argument.  I've disproved
every point put forward by people for my approach.

ARM continues to be broken because people like you are stuck with doing
things only one way.  Grow up and open your mind to other possibilities.

(It should be noted that LDREX/STREX is broken on some ARM implementations
at the moment - they will cause a livelock due to silicon bugs if both
CPUs have exactly the same number of cycles between LDREX and STREX.
This makes loops-within-loops implementations using cmpxchg _extremely_
expensive on ARM.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-08 16:56                 ` David Howells
@ 2006-12-08 17:06                   ` Christoph Lameter
  2006-12-08 17:18                     ` Russell King
  2006-12-08 18:46                   ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-08 17:06 UTC (permalink / raw)
  To: David Howells
  Cc: Russell King, Nick Piggin, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, 8 Dec 2006, David Howells wrote:

> > It is the most universal atomic instruction that I know of.
> 
> I think TAS-type things and XCHG-type things are more common.

Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
all have one or the other form of cmpxchg (some issues with early sparc 
and i386).

And yes the xchg was the first multiprocessor instruction and therefore 
is also available on very old processors.

> In fact I think more things have LL/SC than have CMPXCHG.

LL/SC can be easily used to come up with a cmpxchg equivalent.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 17:06                   ` Christoph Lameter
@ 2006-12-08 17:18                     ` Russell King
  2006-12-08 17:23                       ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-08 17:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Howells, Nick Piggin, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 09:06:00AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, David Howells wrote:
> 
> > > It is the most universal atomic instruction that I know of.
> > 
> > I think TAS-type things and XCHG-type things are more common.
> 
> Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
> all have one or the other form of cmpxchg (some issues with early sparc 
> and i386).

According to the latest figures, 621 million ARM processors were
shipped in Q3, which equates to about 78 processors per second.
(taken from http://www.arm.com/news/15300.html).

I'd like to know the figures for these other so-called "popular
architectures".

But in any case that's an utterly irrelevant point to the argument
at hand.

> > In fact I think more things have LL/SC than have CMPXCHG.
> 
> LL/SC can be easily used to come up with a cmpxchg equivalent.

As proven previously the reverse is also true.  And as shown previously
the cheaper out of the two for all platforms is the LL/SC based
implementation, where the architecture specific implementation can
be _either_ LL/SC based or cmpxchg based depending on what is
supported in their hardware.

I'm going to keep saying this until people get it.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 17:18                     ` Russell King
@ 2006-12-08 17:23                       ` Christoph Lameter
  2006-12-08 19:15                         ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2006-12-08 17:23 UTC (permalink / raw)
  To: Russell King
  Cc: David Howells, Nick Piggin, torvalds, akpm, linux-arm-kernel,
	linux-kernel, linux-arch

On Fri, 8 Dec 2006, Russell King wrote:

> As proven previously the reverse is also true.  And as shown previously
> the cheaper out of the two for all platforms is the LL/SC based
> implementation, where the architecture specific implementation can
> be _either_ LL/SC based or cmpxchg based depending on what is
> supported in their hardware.

As also shown in this thread: There are restrictions on what you can do 
between ll/sc. You would not want to use C code there. ll/sc is an thing 
that needs to be restricted to asm code. So this is not a viable proposal 
at all. ll/sc is useful to construct various atomic functions but cannot 
be directly used in C code. cmpxchg can be effectively realized using 
ll/sc.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-08 16:56                 ` David Howells
  2006-12-08 17:06                   ` Christoph Lameter
@ 2006-12-08 18:46                   ` Linus Torvalds
  2006-12-08 19:04                     ` Russell King
  2006-12-11 11:04                     ` David Howells
  1 sibling, 2 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 18:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Lameter, Russell King, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Fri, 8 Dec 2006, David Howells wrote:
> 
> In fact I think more things have LL/SC than have CMPXCHG.

But you cannot expose ll/sc to C, so that's a bogus argument.

If you do ll/sc, you need to program in assembly language.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 18:46                   ` Linus Torvalds
@ 2006-12-08 19:04                     ` Russell King
  2006-12-08 19:35                       ` Linus Torvalds
  2006-12-11 11:04                     ` David Howells
  1 sibling, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-08 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Christoph Lameter, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 10:46:17AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, David Howells wrote:
> > 
> > In fact I think more things have LL/SC than have CMPXCHG.
> 
> But you cannot expose ll/sc to C, so that's a bogus argument.

Yes you can.  Well, you can on ARM at least.  Between the load exclusive
you can do anything you like until you hit the store exclusive.  If you
haven't touched the location (with anything other than another load
exclusive) and no other CPU has issued a load exclusive, your store
exclusive succeeds.  Whether you have branches, loads or stores to
other locations, etc.  None of that matters.

Not so bogus as you make out.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 17:23                       ` Christoph Lameter
@ 2006-12-08 19:15                         ` Linus Torvalds
  2006-12-08 19:31                           ` Russell King
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 19:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Russell King, David Howells, Nick Piggin, akpm, linux-arm-kernel,
	linux-kernel, linux-arch



On Fri, 8 Dec 2006, Christoph Lameter wrote:
> 
> As also shown in this thread: There are restrictions on what you can do 
> between ll/sc

This, btw, is almost certainly true on ARM too.

There are three major reasons for restrictions on ll/sc:

 - bus-cycle induced things (eg variations of "you cannot do a store in 
   between the ll and the sc, because it will touch the cache and clear 
   the bit", where "the store" might be a load too, and "the cache" might
   be just "the bus interface")

 - trap handling usually clears the internal lock bit too, which means 
   that depending on the micro-architecture, even internal microtraps 
   (like even just branch misprediction, but more commonly things like TLB 
   misses etc) can cause a sc to always fail.

 - timing. Livelock in particular.

The last one is the one that hits everybody, regardless of 
microarchitecture. The rule may be that the LL/SC need to be within a 
certain number of cycles (which can be very small - like ten) in order to 
guarantee that the cacheline can't be stolen. 

All of which means that _nobody_ can really do this reliably in C. Even if 
there are no other microarchitectural rules (and it sounds like that might 
be true on ARM), the timing issue means that you can _still_ only use it 
for very specific and simple sequences, and trying to expose it as a 
higher-level thing is not going to work in general for anything even 
remotely complicated.

(The timing may also mean that you end up having to do random back-off 
etc, just to make sure _somebody_ makes progress. Ie it might not be a 
matter of "within ten cycles", but "you need to randomize the timing").

In other words, it's simply not an option to expose LL/SC as an interface. 
It would be VERY convenient to do, since cmpxchg can emulate ll/sc (the 
"ll" part is a normal load, the "sc" part is a "compare that the old value 
still matches, and store the new one if so"). But because you can't expose 
LL/SC anyway in any reasonably portable way, that just doesn't work.

So, you really do end up with three possibilities:

 - do things with TRULY PORTABLE interfaces. And like it or not, cmpxchg 
   is the closest thing you can get to that. It's trivial to do cmpxchg 
   using ll/sc (modulo the "random backoff part" if you need it, which is 
   still pretty simple, but no longer totally trivial), and architectures 
   that have neither ll/sc _nor_ a native cmpxchg can just go screw 
   themselves with spinlocks - they really aren't worth worrying about in 
   SMP. At some point you have to tell hardware designers that their 
   hardware just sucks.

 - have ugly conditional code in generic code. I personally think this is 
   a _much_ worse option in most cases.

 - have a much higher-level interface and make it _all_ architecture- 
   dependent (possibly with a "generic" version for sane architectures). 
   This works, but the more high-level it is, the more you end up having 
   the same thign written in many different ways, and nasty maintenance.

   So we generally set the bar pretty low. Things like semaphore locking 
   primitives are high-level enough already that we prefer to try to make 
   them use common lower-level interfaces (spinlocks, cmpxchg etc). 
   Something like kernel/workqueue.c is _way_ too high a level to do 
   arch-specific.

So right now, I think the "cmpxchg" or the "bitmask set" approach are the 
alternatives. Russell - LL/SC simply isn't on the table as an interface, 
whether you like it or not.

		Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:15                         ` Linus Torvalds
@ 2006-12-08 19:31                           ` Russell King
  2006-12-08 19:37                             ` Linus Torvalds
  2006-12-08 20:01                             ` Linus Torvalds
  0 siblings, 2 replies; 76+ messages in thread
From: Russell King @ 2006-12-08 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, David Howells, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 11:15:58AM -0800, Linus Torvalds wrote:
> On Fri, 8 Dec 2006, Christoph Lameter wrote:
> > 
> > As also shown in this thread: There are restrictions on what you can do 
> > between ll/sc
> 
> This, btw, is almost certainly true on ARM too.
> 
> There are three major reasons for restrictions on ll/sc:
> 
>  - bus-cycle induced things (eg variations of "you cannot do a store in 
>    between the ll and the sc, because it will touch the cache and clear 
>    the bit", where "the store" might be a load too, and "the cache" might
>    be just "the bus interface")

No such restriction on ARM.

>  - trap handling usually clears the internal lock bit too, which means 
>    that depending on the micro-architecture, even internal microtraps 
>    (like even just branch misprediction, but more commonly things like TLB 
>    misses etc) can cause a sc to always fail.

Also not true.  The architectural implementation is:

	ldrex: tags the _physical_ address + cpu number,
		transitions to exclusive access.

	strex: if in exclusive access state and matches the previous
		tagged physical address + cpu number pair, store
		succeeds.

This is typically implemented in hardware, and in the case of a SMP
system, external to the CPU cores themselves.  So all it's doing is
looking at the exclusive accesses.  It is not embedded into the CPU
core so that micro-architectural stuff affects it.

>  - timing. Livelock in particular.

That is a problem that we're facing, and the solution is rather simple.
You need to introduce a CPU-number specific number of cycles before
retrying the operation on failure.

> All of which means that _nobody_ can really do this reliably in C.

I utterly disagree.  I could code atomic_add() as:

extern void cpu_specific_delay(void);

static inline int atomic_add_return(int i, atomic_t *v)
{
	do {
		asm("ldrex %0, [%1]" : "=r" (val) : "r" (v));
		val += i;
		asm("strex %0, %1, [%2]" : "=r" (res) : "r" (val), "r" (v));
		if (res)
			cpu_specific_delay();
	} while (res);

	return val;
}

and actually we /are/ going to have to go down this path to break the
livelock problem, like it or not.  Ditto for the ARM bitops operations,
so we might as well have ARM at least implement a generic ll/sc thing.

Coding the cpu specific delays into every strex site is going to make
them far too heavy to put inline.

> So right now, I think the "cmpxchg" or the "bitmask set" approach are the 
> alternatives. Russell - LL/SC simply isn't on the table as an interface, 
> whether you like it or not.

Buggerit then.  cmpxchg sucks as an interface, and I am going to
continue to assert that.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:04                     ` Russell King
@ 2006-12-08 19:35                       ` Linus Torvalds
  2006-12-08 19:59                         ` Russell King
  0 siblings, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 19:35 UTC (permalink / raw)
  To: Russell King
  Cc: David Howells, Christoph Lameter, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Fri, 8 Dec 2006, Russell King wrote:
>
> Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> you can do anything you like until you hit the store exclusive.  If you
> haven't touched the location (with anything other than another load
> exclusive) and no other CPU has issued a load exclusive, your store
> exclusive succeeds.

Is that actually true?

Almost all LL/SC implementations have granularity rules, where "touch the 
location" is not a byte-granular thing, but actually ends up being 
something like "touch the same cachline".

They also often have _other_ rules like: "the cacheline has to stay in the 
L1 in exclusive state" etc. Which means that in a direct-mapped L1 cache, 
you can't even load anything that might be in the same way, because it 
would cause a cache eviction that invalidates the SC.

It's possible that ARM has really strong LL/SC, but quite frankly, that 
sounds unlikely. I've never heard of anybody ever _architecturally_ saying 
that they support that strong requirements, even if certain micro- 
architectures might actually support stronger semantics than the ones 
guaranteed by the architectural rules.

			Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:31                           ` Russell King
@ 2006-12-08 19:37                             ` Linus Torvalds
  2006-12-08 19:43                               ` Russell King
  2006-12-08 20:01                             ` Linus Torvalds
  1 sibling, 1 reply; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 19:37 UTC (permalink / raw)
  To: Russell King
  Cc: Christoph Lameter, David Howells, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Fri, 8 Dec 2006, Russell King wrote:
> 
> I utterly disagree.  I could code atomic_add() as:

Sure. And Alpha could do that too. If you write the C code a specific way, 
you can make it work. That does NOT mean that you can expose it widely as 
a portable interface - it's still just a very _nonportable_ interface that 
you use internally within one architecture to implement other interfaces.

			Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:37                             ` Linus Torvalds
@ 2006-12-08 19:43                               ` Russell King
  0 siblings, 0 replies; 76+ messages in thread
From: Russell King @ 2006-12-08 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, David Howells, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 11:37:45AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, Russell King wrote:
> > 
> > I utterly disagree.  I could code atomic_add() as:
> 
> Sure. And Alpha could do that too. If you write the C code a specific way, 
> you can make it work. That does NOT mean that you can expose it widely as 
> a portable interface - it's still just a very _nonportable_ interface that 
> you use internally within one architecture to implement other interfaces.

However, nothing stops you wrapping the non-portable nature of ll/sc up
into the store part though.

If you can efficiently implement cmpxchg inside an ll/sc based portable
interface (yes you can) and you can implement problematical ll/sc
structures inside a cmpxchg() interface, you can do it either way around.
Only one way doesn't penalise broken ll/sc based implementations though.

That is the essence of my argument.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:35                       ` Linus Torvalds
@ 2006-12-08 19:59                         ` Russell King
  2006-12-08 20:34                           ` Linus Torvalds
  0 siblings, 1 reply; 76+ messages in thread
From: Russell King @ 2006-12-08 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Christoph Lameter, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

On Fri, Dec 08, 2006 at 11:35:39AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, Russell King wrote:
> >
> > Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> > you can do anything you like until you hit the store exclusive.  If you
> > haven't touched the location (with anything other than another load
> > exclusive) and no other CPU has issued a load exclusive, your store
> > exclusive succeeds.
> 
> Is that actually true?
> 
> Almost all LL/SC implementations have granularity rules, where "touch the 
> location" is not a byte-granular thing, but actually ends up being 
> something like "touch the same cachline".

ARM's implementation works by using some CPU core peripheral hardware to hold
two pieces of information:

 1. the physical address being accessed by the load exclusive instruction.
 2. the cpu number.

When an exclusive load is performed, this hardware "remembers" bits
[31:N] (7 >= N >= 2 *) of the physical address and the CPU number.

When an exclusive store is performed, the address and CPU number are
compared to the stored versions, and the store is only allowed to succeed
if they match.  Hence, it is independent of the actual cache state,
and is relatively cheap to implement.

The only instructions which affect the exclusive access state are the
exclusive access instructions themselves.  Hence, to the same memory
location with the following sequence:

	load exclusive
	load normal
	modify
	store normal
	store exclusive

the store exclusive _will_ succeed even though you've modified the value
in that memory location via standard loads/stores.

(* - while it is true that the physical address is stored,
implementations may themselves choose not to store between the
lower 7 to 2 bits inclusive - which incidentally means putting
two atomic_t's next together is probably a Bad Thing(tm)... and
rather sucks.  However, non-exclusive accesses within that address
region do not affect the exclusive access state.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:31                           ` Russell King
  2006-12-08 19:37                             ` Linus Torvalds
@ 2006-12-08 20:01                             ` Linus Torvalds
  1 sibling, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 20:01 UTC (permalink / raw)
  To: Russell King
  Cc: Christoph Lameter, David Howells, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Fri, 8 Dec 2006, Russell King wrote:
> 
> No such restriction on ARM.
> 
> Also not true.  The architectural implementation is:

I checked the ARM manuals, and quite fankly, they don't back you up.

They do not claim that the physical address tag is byte-granular, and in 
fact they make it pretty clear that the same tag is used for all the 
sizes, which implies that the tag granularity is NOT byte granular, but 
likely something else. 

Both the manuals I checked also say: "Other events might cause the tag to 
be cleared", without going into particular details other than saying that 
a region that is marked non-shared might still clear the tag on access by 
other CPU's - but they leave it open whether that's by design or not.

In other words, if there actually is an architectural guarantee that 
ldrex/strex are really as strong as you imply, it's not in the standard 
architecture manuals from ARM at least for the ARM11562 or the ARM1136.

So I suspect you're wrong, and that the ldrex/strex tags actually are not 
all that different from other archtiectures which tend to have cacheline 
granularities or more (I _think_ the original alpha granularity was the 
whole address space, and any cache traffic would clear it. That's _really_ 
pathetically weak, but hey, I might remember wrong, and it was the very 
first implementation. I doubt ARM is _that_ weak, but I doubt it's as 
strong as you claim).

			Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08 19:59                         ` Russell King
@ 2006-12-08 20:34                           ` Linus Torvalds
  0 siblings, 0 replies; 76+ messages in thread
From: Linus Torvalds @ 2006-12-08 20:34 UTC (permalink / raw)
  To: Russell King
  Cc: David Howells, Christoph Lameter, Nick Piggin, akpm,
	linux-arm-kernel, linux-kernel, linux-arch



On Fri, 8 Dec 2006, Russell King wrote:
>> 
> The only instructions which affect the exclusive access state are the
> exclusive access instructions themselves.

Not according to the docs I found.

The ARM1136 manual explicitly states that any attempt to modify that 
address clears the tag (for shared memory regions, by _any_ CPU, and for 
nonshared regions by _that_ CPU).

And btw, that _has_ to be true, because otherwise the whole ldrex/strex 
sequence would be totally unusable as a way to do atomic bit operations on 
UP in the presense of interrupts (well, you could have a clrex instruction 
in the interrupt handler, but as far as I know you don't, so that seems to 
be a moot point - you only seem to do it in __switch_to).

In other words, I _really_ think you're wrong. 

So the very code sequence you quote MUST NOT WORK the way you claim it 
does. And not only that, since the granularity of the mark is not just for 
the bytes in question, but potentially apparently up to 128 bytes, any 
store even _close_ to the memory you had xclusive access to will break the 
exclusive access.

Really, Russell. Your stance makes no sense. It doesn't make any sense 
from a microarchitectural standpoint (it's not how you'd normally 
implement these things), but it ALSO makes no sense from the way you 
already use those instructions (as a way to protect against other 
processors - including your own - touching that word).

                    Linus

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it
  2006-12-08  8:56               ` Russell King
  2006-12-08 16:06                 ` Christoph Lameter
  2006-12-08 16:56                 ` David Howells
@ 2006-12-08 22:33                 ` Nick Piggin
  2 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2006-12-08 22:33 UTC (permalink / raw)
  To: Russell King
  Cc: Christoph Lameter, David Howells, torvalds, akpm,
	linux-arm-kernel, linux-kernel, linux-arch

Russell King wrote:
> On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
> 
>>Russell King wrote:
>>
>>>On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
>>
>>>>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
>>>>>architectures to produce optimal code.
>>>>>
>>>>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
>>>>>to produce optimal code and ll/sc non-optimal code.
>>>>>
>>>>>See my point?
>>>>
>>>>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
>>>>load_locked is not locked at all,
>>>
>>>
>>>Intentional - cmpxchg architectures don't generally have a load locked.
>>
>>Exactly, so it is wrong -- you can't implement that behaviour with
>>load + cmpxchg.
> 
> 
> I disagree.  I _have_ implemented the required behaviour.  I really
> don't understand your point saying that it is wrong.
> 
> 
>>>>and there can be interleaving writes
>>>>between the load and cmpxchg which do not cause the store_conditional
>>>>to fail.
>>>
>>>
>>>In which case the cmpxchg fails and we do the atomic operation again,
>>>in exactly the same way that we do the operation again if the 'sc'
>>>fails in the ll/sc case.
>>
>>Not if cmpxchg sees the same value, it won't fail, regardless of how
>>many writes have hit that memory address.
> 
> 
> Don't see anything wrong with that.  If that was a problem, atomic
> implementations using cmpxchg on x86 would be impossible.
> 
> I think you're trying to implement ll/sc semantics on CPUs without
> ll/sc which is exactly not what I'm trying to do.  I'd argue that's
> impossible.

Yes, I did think that from reading your emails. It is not a problem
as such, but it is important to be clear on semantics.

> I'm trying to suggest a better implementation for atomic ops rather
> than just bowing to this x86-centric "cmpxchg is the best, everyone
> must implement it" mentality.

Even if ARM is able to handle any arbitrary C code between the
"load locked" and store conditional API, other architectures can not
by definition.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it 
  2006-12-08 18:46                   ` Linus Torvalds
  2006-12-08 19:04                     ` Russell King
@ 2006-12-11 11:04                     ` David Howells
  1 sibling, 0 replies; 76+ messages in thread
From: David Howells @ 2006-12-11 11:04 UTC (permalink / raw)
  To: Russell King
  Cc: Linus Torvalds, David Howells, Christoph Lameter, Nick Piggin,
	akpm, linux-arm-kernel, linux-kernel, linux-arch

Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> you can do anything you like until you hit the store exclusive.

How come atomic_set() on arm6 is implemented as:

	static inline void atomic_set(atomic_t *v, int i)
	{
		unsigned long tmp;

		__asm__ __volatile__("@ atomic_set\n"
	"1:	ldrex	%0, [%1]\n"
	"	strex	%0, %2, [%1]\n"
	"	teq	%0, #0\n"
	"	bne	1b"
		: "=&r" (tmp)
		: "r" (&v->counter), "r" (i)
		: "cc");
	}

Why LDREX/STREX and not direct assignment?

David

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2006-12-11 11:04 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-06 16:43 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it David Howells
2006-12-06 17:21 ` Linus Torvalds
2006-12-06 18:56 ` Christoph Lameter
2006-12-06 19:00   ` Russell King
2006-12-06 19:16     ` Christoph Lameter
2006-12-06 19:28       ` Linus Torvalds
2006-12-06 19:58       ` Russell King
2006-12-06 21:36         ` Matthew Wilcox
2006-12-06 21:52           ` Christoph Lameter
2006-12-06 22:05             ` Matthew Wilcox
2006-12-06 22:15               ` Christoph Lameter
2006-12-07  0:37               ` Roman Zippel
2006-12-07  0:54                 ` Linus Torvalds
2006-12-07  1:05                   ` Roman Zippel
2006-12-07  1:18                     ` Linus Torvalds
2006-12-07  1:24                       ` Roman Zippel
2006-12-07  1:36                         ` Linus Torvalds
2006-12-07  1:44                           ` Matthew Wilcox
2006-12-07  2:09                             ` Douglas McNaught
2006-12-07  1:52                           ` Roman Zippel
2006-12-07  9:23                   ` Nick Piggin
2006-12-06 22:38             ` Linus Torvalds
2006-12-07  9:31         ` Nick Piggin
2006-12-07 13:20           ` Ivan Kokshaysky
2006-12-07 15:03           ` Russell King
2006-12-08  1:18             ` Nick Piggin
2006-12-08  8:56               ` Russell King
2006-12-08 16:06                 ` Christoph Lameter
2006-12-08 16:31                   ` Russell King
2006-12-08 16:43                     ` Christoph Lameter
2006-12-08 16:47                       ` Russell King
2006-12-08 16:53                         ` Christoph Lameter
2006-12-08 16:58                           ` Russell King
2006-12-08 16:56                 ` David Howells
2006-12-08 17:06                   ` Christoph Lameter
2006-12-08 17:18                     ` Russell King
2006-12-08 17:23                       ` Christoph Lameter
2006-12-08 19:15                         ` Linus Torvalds
2006-12-08 19:31                           ` Russell King
2006-12-08 19:37                             ` Linus Torvalds
2006-12-08 19:43                               ` Russell King
2006-12-08 20:01                             ` Linus Torvalds
2006-12-08 18:46                   ` Linus Torvalds
2006-12-08 19:04                     ` Russell King
2006-12-08 19:35                       ` Linus Torvalds
2006-12-08 19:59                         ` Russell King
2006-12-08 20:34                           ` Linus Torvalds
2006-12-11 11:04                     ` David Howells
2006-12-08 22:33                 ` Nick Piggin
2006-12-07 15:36           ` Linus Torvalds
2006-12-07 16:51           ` Ralf Baechle
2006-12-07  0:46       ` Ralf Baechle
2006-12-06 19:05   ` Linus Torvalds
2006-12-06 19:08     ` Al Viro
2006-12-06 19:25       ` Linus Torvalds
2006-12-06 19:29         ` Matthew Wilcox
2006-12-06 19:54           ` Linus Torvalds
2006-12-06 19:56             ` Linus Torvalds
2006-12-06 19:43         ` David Howells
2006-12-07  1:09       ` David Miller
2006-12-06 19:26     ` Matthew Wilcox
2006-12-06 19:29       ` Christoph Lameter
2006-12-06 19:36         ` Matthew Wilcox
2006-12-06 19:47           ` Christoph Lameter
2006-12-06 19:50             ` Matthew Wilcox
2006-12-06 20:11               ` Christoph Lameter
2006-12-06 20:17                 ` Matthew Wilcox
2006-12-06 19:34       ` Linus Torvalds
2006-12-06 19:41         ` Matthew Wilcox
2006-12-06 19:45       ` David Howells
2006-12-06 20:00     ` Russell King
2006-12-07 15:06     ` Russell King
2006-12-08 15:32       ` Russell King
2006-12-06 19:12 ` Lennert Buytenhek
2006-12-06 19:47 ` David Howells
2006-12-06 20:09   ` Lennert Buytenhek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).