LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
@ 2015-03-13 15:47 Mark Rutland
  2015-03-13 16:29 ` Christoph Lameter
  2015-03-17  1:09 ` [PATCH] " Joonsoo Kim
  0 siblings, 2 replies; 14+ messages in thread
From: Mark Rutland @ 2015-03-13 15:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
removing preemption on/off") introduced an occasional hang for kernels
built with CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload). On arm64 the generated assembly
looks like:

ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
ffffffc00016d3cc:       eb04003f        cmp     x1, x4
ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and and allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations on the same CPU.

The loop itself is somewhat pointless as the thread can be preempted at
any point after the loop before the this_cpu_cmpxchg_double, and the
window for preemption within the loop is far smaller. Given that we
assume accessing c->tid is safe for the loop condition, and we retry
when the cmpxchg fails, we can get rid of the loop entirely and just
access c->tid via the raw_cpu_ptr for s->cpu_slab.

This patch ensures that we reload c->tid even when the compiler would
otherwise assume that it could cache the value, and removes the
redundant (and misleading) loop. Comments are updated to take these
changes into account.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steve Capper <steve.capper@linaro.org>
---
 mm/slub.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

Hi all,

I wasn't entirely sure if the read of c->tid in the loop condition was safe
given the general rules for this_cpu operations. I assume that a read on a
remote CPU is always safe, and hence we can just load tid via the raw_cpu_ptr
and leave it to the cmpxchg to detect when we get preempted. If the remote read
is not safe then as far as I can tell the original patch is broken in this
regard even in the absence of reordering by the compiler.

I've boot-tested this patch a number of times (with SMP and !SMP) on the
platform I noticed the lockup on, and everything seems stable. Do we have any
SL*B stress tests that could give this a better thrashing?

Otherwise any and all comments/review/testing would be appreciated.

Mark.

diff --git a/mm/slub.c b/mm/slub.c
index 6832c4e..2b9a75e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2437,19 +2437,13 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 		return NULL;
 redo:
 	/*
-	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
-	 * enabled. We may switch back and forth between cpus while
-	 * reading from one cpu area. That does not matter as long
-	 * as we end up on the original cpu again when doing the cmpxchg.
-	 *
-	 * We should guarantee that tid and kmem_cache are retrieved on
-	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
-	 * to check if it is matched or not.
+	 * Preemption is enabled. We may switch back and forth between cpus,
+	 * and other threads may access the current cpu's area. That does not
+	 * matter as long as when we reach the cmpxhg we end up on the original
+	 * cpu and tid hasn't changed. We'll retry until that happens.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/*
 	 * Irqless object alloc/free algorithm used here depends on sequence
@@ -2710,15 +2704,13 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 redo:
 	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
+	 * Preemption is enabled. We may switch back and forth between cpus,
+	 * and other threads may access the current cpu's area. That does not
+	 * matter as long as when we reach the cmpxhg we end up on the original
+	 * cpu and tid hasn't changed. We'll retry until that happens.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/* Same with comment on barrier() in slab_alloc_node() */
 	barrier();
-- 
1.9.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-13 15:47 [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels Mark Rutland
@ 2015-03-13 16:29 ` Christoph Lameter
  2015-03-13 18:16   ` Mark Rutland
  2015-03-16 12:45   ` [PATCHv2] " Mark Rutland
  2015-03-17  1:09 ` [PATCH] " Joonsoo Kim
  1 sibling, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-03-13 16:29 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, David Rientjes,
	Jesper Dangaard Brouer, Joonsoo Kim, Linus Torvalds,
	Pekka Enberg, Steve Capper

On Fri, 13 Mar 2015, Mark Rutland wrote:

>  	 */
> -	do {
> -		tid = this_cpu_read(s->cpu_slab->tid);
> -		c = raw_cpu_ptr(s->cpu_slab);
> -	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	c = raw_cpu_ptr(s->cpu_slab);
> +	tid = READ_ONCE(c->tid);
>

Ok that works for the !SMP case. What about SMP and PREEMPT now?

And yes code like this was deemed safe for years and the race condition is
very subtle and difficult to trigger (also given that PREEMPT is rarely
used these days).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-13 16:29 ` Christoph Lameter
@ 2015-03-13 18:16   ` Mark Rutland
  2015-03-13 18:27     ` Christoph Lameter
  2015-03-16 12:45   ` [PATCHv2] " Mark Rutland
  1 sibling, 1 reply; 14+ messages in thread
From: Mark Rutland @ 2015-03-13 18:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, David Rientjes,
	Jesper Dangaard Brouer, Joonsoo Kim, Linus Torvalds,
	Pekka Enberg, Steve Capper

On Fri, Mar 13, 2015 at 04:29:23PM +0000, Christoph Lameter wrote:
> On Fri, 13 Mar 2015, Mark Rutland wrote:
> 
> >  	 */
> > -	do {
> > -		tid = this_cpu_read(s->cpu_slab->tid);
> > -		c = raw_cpu_ptr(s->cpu_slab);
> > -	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	c = raw_cpu_ptr(s->cpu_slab);
> > +	tid = READ_ONCE(c->tid);
> >
> 
> Ok that works for the !SMP case. What about SMP and PREEMPT now?

>From testing on boards I have access to, things seem fine so far with
SMP && PREEMPT. If we have any allocator stress tests I'm more than
happy to give them a go.

As I mentioned, it's not clear to me that the the READ_ONCE(c->tid) is
safe (i.e. it is atomic and non-destructive). If READ_ONCE(c->tid) is
not safe then the code added in 9aabf810a67cd97e is similarly broken
given the access in the loop condition, in addition to the hoisting done
by the compiler.

> And yes code like this was deemed safe for years and the race condition is
> very subtle and difficult to trigger (also given that PREEMPT is rarely
> used these days).

The this_cpu_cmpxchg_double is the saving grace here: if c->tid is read
from a different CPU it will fail and we'll retry the whole thing.
That's exactly what the original patch relied on in the case a
preemption occured after the loop.

w.r.t. CONFIG_PREEMPT, git grep tells me otherwise:

[mark@leverpostej:~/src/linux]% git grep -w 'CONFIG_PREEMPT' -- arch/*/configs/* | wc -l
109
[mark@leverpostej:~/src/linux]% git grep -w 'CONFIG_PREEMPT is not set' -- arch/*/configs/* | wc -l 
2
[mark@leverpostej:~/src/linux]% git grep -w 'CONFIG_PREEMPT=y' -- arch/*/configs/* | wc -l
107
[mark@leverpostej:~/src/linux]% git grep -w 'CONFIG_PREEMPT=n' -- arch/*/configs/* | wc -l
0

Mark.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-13 18:16   ` Mark Rutland
@ 2015-03-13 18:27     ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-03-13 18:27 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, David Rientjes,
	Jesper Dangaard Brouer, Joonsoo Kim, Linus Torvalds,
	Pekka Enberg, Steve Capper

On Fri, 13 Mar 2015, Mark Rutland wrote:

> w.r.t. CONFIG_PREEMPT, git grep tells me otherwise:

Grep does not tell you what is deployed. Its more of a reflection of the
thought of the kernel devs what they think are reasonable configurations.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-13 16:29 ` Christoph Lameter
  2015-03-13 18:16   ` Mark Rutland
@ 2015-03-16 12:45   ` Mark Rutland
  1 sibling, 0 replies; 14+ messages in thread
From: Mark Rutland @ 2015-03-16 12:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, David Rientjes,
	Jesper Dangaard Brouer, Joonsoo Kim, Linus Torvalds,
	Pekka Enberg, Steve Capper

On Fri, Mar 13, 2015 at 04:29:23PM +0000, Christoph Lameter wrote:
> On Fri, 13 Mar 2015, Mark Rutland wrote:
> 
> >  	 */
> > -	do {
> > -		tid = this_cpu_read(s->cpu_slab->tid);
> > -		c = raw_cpu_ptr(s->cpu_slab);
> > -	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	c = raw_cpu_ptr(s->cpu_slab);
> > +	tid = READ_ONCE(c->tid);
> >
> 
> Ok that works for the !SMP case. What about SMP and PREEMPT now?
> 
> And yes code like this was deemed safe for years and the race condition is
> very subtle and difficult to trigger (also given that PREEMPT is rarely
> used these days).

Do you mean if the READ_ONCE(c->tid) gives us a torn value that happens
to be a future value of c->tid?

Are you happy to retain the loop, but with the c->tid access replaced
with READ_ONCE(c->tid)?

If torn values are an issue for the raw access then the loop doesn't
guarantee that c and tid were read on the same CPU as the comment above
it implies. The cmpxchg saves us given the torn value would have to
match some currently active tid, and I guess the loop saves a pointless
cmpxchg when it does detect a mismatch.

Mark.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-13 15:47 [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels Mark Rutland
  2015-03-13 16:29 ` Christoph Lameter
@ 2015-03-17  1:09 ` Joonsoo Kim
  2015-03-17 12:00   ` Mark Rutland
  1 sibling, 1 reply; 14+ messages in thread
From: Joonsoo Kim @ 2015-03-17  1:09 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Linus Torvalds,
	Pekka Enberg, Steve Capper

Hello,

On Fri, Mar 13, 2015 at 03:47:12PM +0000, Mark Rutland wrote:
> Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
> removing preemption on/off") introduced an occasional hang for kernels
> built with CONFIG_PREEMPT && !CONFIG_SMP.
> 
> The problem is the following loop the patch introduced to
> slab_alloc_node and slab_free:
> 
> do {
>         tid = this_cpu_read(s->cpu_slab->tid);
>         c = raw_cpu_ptr(s->cpu_slab);
> } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
> GCC 4.9 has been observed to hoist the load of c and c->tid above the
> loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
> constant and does not force a reload). On arm64 the generated assembly
> looks like:
> 
> ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
> ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
> ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>
> 
> If the thread is preempted between the load of c->tid (into x1) and tid
> (into x4), and and allocation or free occurs in another thread (bumping
> the cpu_slab's tid), the thread will be stuck in the loop until
> s->cpu_slab->tid wraps, which may be forever in the absence of
> allocations on the same CPU.

Is there any method to guarantee refetching these in each loop?

> 
> The loop itself is somewhat pointless as the thread can be preempted at
> any point after the loop before the this_cpu_cmpxchg_double, and the
> window for preemption within the loop is far smaller. Given that we
> assume accessing c->tid is safe for the loop condition, and we retry
> when the cmpxchg fails, we can get rid of the loop entirely and just
> access c->tid via the raw_cpu_ptr for s->cpu_slab.

Hmm... IIUC, loop itself is not pointless. It guarantees that tid and
c (s->cpu_slab) are fetched on right and same processor and this is
for algorithm correctness.

Think about your code.

c = raw_cpu_ptr(s->cpu_slab);
tid = READ_ONCE(c->tid);

This doesn't guarantee that tid is fetched on the cpu where c is
fetched if preemption/migration happens between these operations.

If c->tid, c->freelist, c->page are fetched on the other cpu,
there is no ordering guarantee and c->freelist, c->page could be stale
value even if c->tid is recent one.

Think about following free case with your patch.

Assume initial cpu 0's state as following.
c->tid: 1, c->freelist: NULL, c->page: A

User X: try to free object X for page A
User X: fetch c (s->cpu_slab)

Preemtion and migration happens...
The other allocation/free happens... so cpu 0's state is as following.
c->tid: 3, c->freelist: NULL, c->page: B

User X: read c->tid: 3, c->freelist: NULL, c->page A (stale value)

Because tid and freelist are matched with current ones, free would
succeed, but, current c->page is B and object is for A so this success
is wrong.

Loop prevents this possibility.

Thanks.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-17  1:09 ` [PATCH] " Joonsoo Kim
@ 2015-03-17 12:00   ` Mark Rutland
  2015-03-17 12:15     ` [PATCHv2] " Mark Rutland
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Rutland @ 2015-03-17 12:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Linus Torvalds,
	Pekka Enberg, Steve Capper

Hi,

> On Fri, Mar 13, 2015 at 03:47:12PM +0000, Mark Rutland wrote:
> > Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
> > removing preemption on/off") introduced an occasional hang for kernels
> > built with CONFIG_PREEMPT && !CONFIG_SMP.
> > 
> > The problem is the following loop the patch introduced to
> > slab_alloc_node and slab_free:
> > 
> > do {
> >         tid = this_cpu_read(s->cpu_slab->tid);
> >         c = raw_cpu_ptr(s->cpu_slab);
> > } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > 
> > GCC 4.9 has been observed to hoist the load of c and c->tid above the
> > loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
> > constant and does not force a reload). On arm64 the generated assembly
> > looks like:
> > 
> > ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
> > ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
> > ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> > ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>
> > 
> > If the thread is preempted between the load of c->tid (into x1) and tid
> > (into x4), and and allocation or free occurs in another thread (bumping
> > the cpu_slab's tid), the thread will be stuck in the loop until
> > s->cpu_slab->tid wraps, which may be forever in the absence of
> > allocations on the same CPU.
> 
> Is there any method to guarantee refetching these in each loop?

We can use READ_ONCE(c->tid), e.g.

	while (IS_ENABLED(CONFIG_PREEMPT) &&
	       unlikely(tid != READ_ONCE(c->tid));

I will send a patch to that effect.

I previously thought that READ_ONCE wasn't guaranteed to be atomic, and
thought it could return torn values (even for a single load
instruction). I now understand that this is not the case, and a
READ_ONCE will be sufficient.

[...]

> If c->tid, c->freelist, c->page are fetched on the other cpu,
> there is no ordering guarantee and c->freelist, c->page could be stale
> value even if c->tid is recent one.

Ah. Good point.

> Think about following free case with your patch.
> 
> Assume initial cpu 0's state as following.
> c->tid: 1, c->freelist: NULL, c->page: A
> 
> User X: try to free object X for page A
> User X: fetch c (s->cpu_slab)
> 
> Preemtion and migration happens...
> The other allocation/free happens... so cpu 0's state is as following.
> c->tid: 3, c->freelist: NULL, c->page: B
> 
> User X: read c->tid: 3, c->freelist: NULL, c->page A (stale value)
> 
> Because tid and freelist are matched with current ones, free would
> succeed, but, current c->page is B and object is for A so this success
> is wrong.

Thanks for the example; it's extremely helpful!

Mark.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-17 12:00   ` Mark Rutland
@ 2015-03-17 12:15     ` Mark Rutland
  2015-03-17 18:57       ` Christoph Lameter
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mark Rutland @ 2015-03-17 12:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
removing preemption on/off") introduced an occasional hang for kernels
built with CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload). On arm64 the generated assembly
looks like:

ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
ffffffc00016d3cc:       eb04003f        cmp     x1, x4
ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and an allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations/frees on the same CPU.

This patch changes the loop condition to access c->tid with READ_ONCE.
This ensures that the value is reloaded even when the compiler would
otherwise assume it could cache the value, and also ensures that the
load will not be torn.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steve Capper <steve.capper@linaro.org>
---
 mm/slub.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Since v1 [1]:
* Do not erroneously remove the loop

[1] lkml.kernel.org/r/1426261632-8911-1-git-send-email-mark.rutland@arm.com

diff --git a/mm/slub.c b/mm/slub.c
index 6832c4e..82c4737 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2449,7 +2449,8 @@ redo:
 	do {
 		tid = this_cpu_read(s->cpu_slab->tid);
 		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	} while (IS_ENABLED(CONFIG_PREEMPT) &&
+		 unlikely(tid != READ_ONCE(c->tid)));
 
 	/*
 	 * Irqless object alloc/free algorithm used here depends on sequence
@@ -2718,7 +2719,8 @@ redo:
 	do {
 		tid = this_cpu_read(s->cpu_slab->tid);
 		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	} while (IS_ENABLED(CONFIG_PREEMPT) &&
+		 unlikely(tid != READ_ONCE(c->tid)));
 
 	/* Same with comment on barrier() in slab_alloc_node() */
 	barrier();
-- 
1.9.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-17 12:15     ` [PATCHv2] " Mark Rutland
@ 2015-03-17 18:57       ` Christoph Lameter
  2015-03-18  5:59       ` Joonsoo Kim
  2015-03-24 14:17       ` Ville Syrjälä
  2 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-03-17 18:57 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, David Rientjes,
	Jesper Dangaard Brouer, Joonsoo Kim, Linus Torvalds,
	Pekka Enberg, Steve Capper

On Tue, 17 Mar 2015, Mark Rutland wrote:

> Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
> removing preemption on/off") introduced an occasional hang for kernels
> built with CONFIG_PREEMPT && !CONFIG_SMP.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-17 12:15     ` [PATCHv2] " Mark Rutland
  2015-03-17 18:57       ` Christoph Lameter
@ 2015-03-18  5:59       ` Joonsoo Kim
  2015-03-18 15:21         ` Mark Rutland
  2015-03-24 14:17       ` Ville Syrjälä
  2 siblings, 1 reply; 14+ messages in thread
From: Joonsoo Kim @ 2015-03-18  5:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: LKML, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

2015-03-17 21:15 GMT+09:00 Mark Rutland <mark.rutland@arm.com>:
> Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
> removing preemption on/off") introduced an occasional hang for kernels
> built with CONFIG_PREEMPT && !CONFIG_SMP.
>
> The problem is the following loop the patch introduced to
> slab_alloc_node and slab_free:
>
> do {
>         tid = this_cpu_read(s->cpu_slab->tid);
>         c = raw_cpu_ptr(s->cpu_slab);
> } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>
> GCC 4.9 has been observed to hoist the load of c and c->tid above the
> loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
> constant and does not force a reload). On arm64 the generated assembly
> looks like:
>
> ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
> ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
> ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>
>
> If the thread is preempted between the load of c->tid (into x1) and tid
> (into x4), and an allocation or free occurs in another thread (bumping
> the cpu_slab's tid), the thread will be stuck in the loop until
> s->cpu_slab->tid wraps, which may be forever in the absence of
> allocations/frees on the same CPU.
>
> This patch changes the loop condition to access c->tid with READ_ONCE.
> This ensures that the value is reloaded even when the compiler would
> otherwise assume it could cache the value, and also ensures that the
> load will not be torn.
>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Steve Capper <steve.capper@linaro.org>
> ---
>  mm/slub.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> Since v1 [1]:
> * Do not erroneously remove the loop
>
> [1] lkml.kernel.org/r/1426261632-8911-1-git-send-email-mark.rutland@arm.com
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 6832c4e..82c4737 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2449,7 +2449,8 @@ redo:
>         do {
>                 tid = this_cpu_read(s->cpu_slab->tid);
>                 c = raw_cpu_ptr(s->cpu_slab);
> -       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +       } while (IS_ENABLED(CONFIG_PREEMPT) &&
> +                unlikely(tid != READ_ONCE(c->tid)));
>
>         /*
>          * Irqless object alloc/free algorithm used here depends on sequence
> @@ -2718,7 +2719,8 @@ redo:
>         do {
>                 tid = this_cpu_read(s->cpu_slab->tid);
>                 c = raw_cpu_ptr(s->cpu_slab);
> -       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +       } while (IS_ENABLED(CONFIG_PREEMPT) &&
> +                unlikely(tid != READ_ONCE(c->tid)));
>
>         /* Same with comment on barrier() in slab_alloc_node() */
>         barrier();
> --

Hello,

Could you show me generated code again?

What we need to check is redoing whole things in the loop.
Previous attached code seems to me that it already did
refetching c->tid in the loop and this patch looks only handle
refetching c->tid.
READ_ONCE(c->tid) will trigger redoing 'tid = this_cpu_read(s->cpu_slab->tid)'?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-18  5:59       ` Joonsoo Kim
@ 2015-03-18 15:21         ` Mark Rutland
  2015-03-19 12:13           ` Joonsoo Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Rutland @ 2015-03-18 15:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: LKML, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

Hi,

> >         do {
> >                 tid = this_cpu_read(s->cpu_slab->tid);
> >                 c = raw_cpu_ptr(s->cpu_slab);
> > -       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +       } while (IS_ENABLED(CONFIG_PREEMPT) &&
> > +                unlikely(tid != READ_ONCE(c->tid)));

[...]

> Could you show me generated code again?

The code generated without this patch in !SMP && PREEMPT kernels is:

/* Hoisted load of c->tid */
ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
/* this_cpu_read(s->cpu_slab->tid)) -- buggy, see [1] */
ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
ffffffc00016d3cc:       eb04003f        cmp     x1, x4
ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>

The code generated with this patch in !SMP && PREEMPT kernels is:

/* this_cpu_read(s->cpu_slab->tid)) -- buggy, see [1] */
ffffffc00016d3c4:       f9400401        ldr     x1, [x0,#8]
/* load of c->tid */
ffffffc00016d3c8:       f9400404        ldr     x4, [x0,#8]
ffffffc00016d3cc:       eb04003f        cmp     x1, x4
ffffffc00016d3d0:       54ffffa1        b.ne    ffffffc00016d3c4 <slab_alloc_node.constprop.82+0x2c>

Note that with the patch the branch results in both loads being
performed again.

Given that in !SMP kernels we know that the loads _must_ happen on the
same CPU, I think we could go a bit further with the loop condition:

	while (IS_ENABLED(CONFIG_PREEMPT) &&
	       !IS_ENABLED(CONFIG_SMP) &&
	       unlikely(tid != READ_ONCE(c->tid)));

The barrier afterwards should be sufficient to order the load of the tid
against subsequent accesses to the other cpu_slab fields.

> What we need to check is redoing whole things in the loop.
> Previous attached code seems to me that it already did
> refetching c->tid in the loop and this patch looks only handle
> refetching c->tid.

The refetch in the loop is this_cpu_read(s->cpu_slab->tid), not the load
of c->tid (which is hoisted above the loop).

> READ_ONCE(c->tid) will trigger redoing 'tid = this_cpu_read(s->cpu_slab->tid)'?

I was under the impression that this_cpu operations would always result
in an access, much like the *_ONCE accessors, so we should aways redo
the access for this_cpu_read(s->cpu_slab->tid). Is that not the case?

Mark.

[1] The arm64 this_cpu * operations are currently buggy. We generate the
    percpu address into a register, then perform the access with
    separate instructions (and could be preempted between the two).
    Steve Capper is currently fixing this.

    However, the hoisting of the c->tid load could happen regardless,
    whenever raw_cpu_ptr(c) can be evaluated at compile time.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-18 15:21         ` Mark Rutland
@ 2015-03-19 12:13           ` Joonsoo Kim
  2015-03-19 16:16             ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Joonsoo Kim @ 2015-03-19 12:13 UTC (permalink / raw)
  To: Mark Rutland
  Cc: LKML, Andrew Morton, Catalin Marinas, Christoph Lameter,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

2015-03-19 0:21 GMT+09:00 Mark Rutland <mark.rutland@arm.com>:
> Hi,
>
>> >         do {
>> >                 tid = this_cpu_read(s->cpu_slab->tid);
>> >                 c = raw_cpu_ptr(s->cpu_slab);
>> > -       } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>> > +       } while (IS_ENABLED(CONFIG_PREEMPT) &&
>> > +                unlikely(tid != READ_ONCE(c->tid)));
>
> [...]
>
>> Could you show me generated code again?
>
> The code generated without this patch in !SMP && PREEMPT kernels is:
>
> /* Hoisted load of c->tid */
> ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
> /* this_cpu_read(s->cpu_slab->tid)) -- buggy, see [1] */
> ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
> ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8 <slab_alloc_node.constprop.82+0x30>
>
> The code generated with this patch in !SMP && PREEMPT kernels is:
>
> /* this_cpu_read(s->cpu_slab->tid)) -- buggy, see [1] */
> ffffffc00016d3c4:       f9400401        ldr     x1, [x0,#8]
> /* load of c->tid */
> ffffffc00016d3c8:       f9400404        ldr     x4, [x0,#8]
> ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> ffffffc00016d3d0:       54ffffa1        b.ne    ffffffc00016d3c4 <slab_alloc_node.constprop.82+0x2c>
>
> Note that with the patch the branch results in both loads being
> performed again.
> Given that in !SMP kernels we know that the loads _must_ happen on the
> same CPU, I think we could go a bit further with the loop condition:
>
>         while (IS_ENABLED(CONFIG_PREEMPT) &&
>                !IS_ENABLED(CONFIG_SMP) &&
>                unlikely(tid != READ_ONCE(c->tid)));
>
> The barrier afterwards should be sufficient to order the load of the tid
> against subsequent accesses to the other cpu_slab fields.
>
>> What we need to check is redoing whole things in the loop.
>> Previous attached code seems to me that it already did
>> refetching c->tid in the loop and this patch looks only handle
>> refetching c->tid.
>
> The refetch in the loop is this_cpu_read(s->cpu_slab->tid), not the load
> of c->tid (which is hoisted above the loop).

Okay. Now, I'm fine with your change.

>> READ_ONCE(c->tid) will trigger redoing 'tid = this_cpu_read(s->cpu_slab->tid)'?
>
> I was under the impression that this_cpu operations would always result
> in an access, much like the *_ONCE accessors, so we should aways redo
> the access for this_cpu_read(s->cpu_slab->tid). Is that not the case?

I'm not the expert on that operation. Christoph could answer it.

> Mark.
>
> [1] The arm64 this_cpu * operations are currently buggy. We generate the
>     percpu address into a register, then perform the access with
>     separate instructions (and could be preempted between the two).
>     Steve Capper is currently fixing this.
>
>     However, the hoisting of the c->tid load could happen regardless,
>     whenever raw_cpu_ptr(c) can be evaluated at compile time.

Okay.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-19 12:13           ` Joonsoo Kim
@ 2015-03-19 16:16             ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-03-19 16:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Mark Rutland, LKML, Andrew Morton, Catalin Marinas,
	David Rientjes, Jesper Dangaard Brouer, Joonsoo Kim,
	Linus Torvalds, Pekka Enberg, Steve Capper

On Thu, 19 Mar 2015, Joonsoo Kim wrote:

> > I was under the impression that this_cpu operations would always result
> > in an access, much like the *_ONCE accessors, so we should aways redo
> > the access for this_cpu_read(s->cpu_slab->tid). Is that not the case?
>
> I'm not the expert on that operation. Christoph could answer it.

this_cpu_read() always generates a load via an asm instruction.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHv2] mm/slub: fix lockups on PREEMPT && !SMP kernels
  2015-03-17 12:15     ` [PATCHv2] " Mark Rutland
  2015-03-17 18:57       ` Christoph Lameter
  2015-03-18  5:59       ` Joonsoo Kim
@ 2015-03-24 14:17       ` Ville Syrjälä
  2 siblings, 0 replies; 14+ messages in thread
From: Ville Syrjälä @ 2015-03-24 14:17 UTC (permalink / raw)
  To: linux-kernel

Mark Rutland <mark.rutland <at> arm.com> writes:

> 
> Commit 9aabf810a67cd97e ("mm/slub: optimize alloc/free fastpath by
> removing preemption on/off") introduced an occasional hang for kernels
> built with CONFIG_PREEMPT && !CONFIG_SMP.
> 
> The problem is the following loop the patch introduced to
> slab_alloc_node and slab_free:
> 
> do {
>         tid = this_cpu_read(s->cpu_slab->tid);
>         c = raw_cpu_ptr(s->cpu_slab);
> } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> 
> GCC 4.9 has been observed to hoist the load of c and c->tid above the
> loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
> constant and does not force a reload). On arm64 the generated assembly
> looks like:
> 
> ffffffc00016d3c4:       f9400404        ldr     x4, [x0,#8]
> ffffffc00016d3c8:       f9400401        ldr     x1, [x0,#8]
> ffffffc00016d3cc:       eb04003f        cmp     x1, x4
> ffffffc00016d3d0:       54ffffc1        b.ne    ffffffc00016d3c8
<slab_alloc_node.constprop.82+0x30>

Just FYI I've hit this problem on x86. I think I used gcc 4.7 and 4.8, and
maybe also 4.6 earlier.

Here's the diff in asm after applying the fix:

     240a:      0f 85 ce 00 00 00       jne    24de <kmem_cache_free+0x10e>
     2410:      89 75 f0                mov    %esi,-0x10(%ebp)
     2413:      8b 07                   mov    (%edi),%eax
-    2415:      8b 50 04                mov    0x4(%eax),%edx
-    2418:      8b 48 04                mov    0x4(%eax),%ecx
-    241b:      39 d1                   cmp    %edx,%ecx
-    241d:      75 f9                   jne    2418 <kmem_cache_free+0x48>
+    2415:      8b 48 04                mov    0x4(%eax),%ecx
+    2418:      8b 50 04                mov    0x4(%eax),%edx
+    241b:      39 ca                   cmp    %ecx,%edx
+    241d:      75 f6                   jne    2415 <kmem_cache_free+0x45>
     241f:      8b 4d f0                mov    -0x10(%ebp),%ecx
     2422:      39 48 08                cmp    %ecx,0x8(%eax)
     2425:      0f 85 9a 00 00 00       jne    24c5 <kmem_cache_free+0xf5>

As 3.19+ is broken now, this should go into stable.
Cc: stable@vger.kernel.org


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-03-24 14:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-13 15:47 [PATCH] mm/slub: fix lockups on PREEMPT && !SMP kernels Mark Rutland
2015-03-13 16:29 ` Christoph Lameter
2015-03-13 18:16   ` Mark Rutland
2015-03-13 18:27     ` Christoph Lameter
2015-03-16 12:45   ` [PATCHv2] " Mark Rutland
2015-03-17  1:09 ` [PATCH] " Joonsoo Kim
2015-03-17 12:00   ` Mark Rutland
2015-03-17 12:15     ` [PATCHv2] " Mark Rutland
2015-03-17 18:57       ` Christoph Lameter
2015-03-18  5:59       ` Joonsoo Kim
2015-03-18 15:21         ` Mark Rutland
2015-03-19 12:13           ` Joonsoo Kim
2015-03-19 16:16             ` Christoph Lameter
2015-03-24 14:17       ` Ville Syrjälä

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).