LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
@ 2015-02-02 14:05 Oleg Nesterov
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-02 14:05 UTC (permalink / raw)
  To: Darren Hart, Peter Zijlstra, Thomas Gleixner
  Cc: Jerome Marchand, Larry Woodman, Mateusz Guzik, linux-kernel

Thomas, et al, could you please review?

The change looks trivial, but I simply can not understand this logic,
please help.



First of all, why exactly do we need this mm/PF_KTHREAD check added by
f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
kernel thread to be the owner as the changelog says. But why kthread is
worse than a random user-space task, say, /sbin/init?

IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
if this (k)thread will exit or not. AFAICS, the only problem is that we can
boost the prio of this thread. Or I missed another problem?

I am asking because we need to backport some fixes, and I am trying to
convince myself that I actually understand what I am trying to do ;)




And another question. Lets forget about this ->mm check. I simply can not
understand this

	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN

logic in attach_to_pi_owner(). First of all, why do we need to retry if
PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?

I must have missed something but this looks buggy, I do not see any
preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
forever in this case? (OK, ignoring RT throttling).

IOW. Could you explain why the patch below is wrong correctness-wise?
Lets ignore the fact it is obviously ugly and suboptimal...

Thanks,

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -684,13 +684,7 @@ void do_exit(long code)
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
 		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
+		 * XXX: ADD COMMENT
 		 */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
@@ -699,8 +693,7 @@ void do_exit(long code)
 
 	exit_signals(tsk);  /* sets PF_EXITING */
 	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
+	 * XXX: anobody else needs this ?
 	 */
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
@@ -779,12 +772,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
--- x/kernel/fork.c
+++ x/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk,
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -722,6 +722,7 @@ void exit_pi_state_list(struct task_stru
 	 * versus waiters unqueueing themselves:
 	 */
 	raw_spin_lock_irq(&curr->pi_lock);
+	curr->flags |= PF_EXITPIDONE;
 	while (!list_empty(head)) {
 
 		next = head->next;
@@ -912,17 +913,15 @@ static int attach_to_pi_owner(u32 uval, 
 	 * p->pi_lock:
 	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
 		/*
 		 * The task is on the way out. When PF_EXITPIDONE is
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 14:05 [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Oleg Nesterov
@ 2015-02-02 14:05 ` Oleg Nesterov
  2015-02-04 10:48   ` Peter Zijlstra
                     ` (2 more replies)
  2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
  2015-02-16 20:13 ` [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Oleg Nesterov
  2 siblings, 3 replies; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-02 14:05 UTC (permalink / raw)
  To: Darren Hart, Peter Zijlstra, Thomas Gleixner
  Cc: Jerome Marchand, Larry Woodman, Mateusz Guzik, linux-kernel

attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
this looks doubly wrong:

1. It should actually check PF_KTHREAD, kthread can do use_mm().

2. If this task is not kthread and it is actually the lock owner we can
   wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.

   And note that this wrong EPERM is the likely case unless the exiting
   task is (auto)reaped quickly, we check ->mm before PF_EXITING.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/futex.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 63678b5..b101381 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -900,7 +900,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	if (!p)
 		return -ESRCH;
 
-	if (!p->mm) {
+	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
 		return -EPERM;
 	}
-- 
1.5.5.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 14:05 [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Oleg Nesterov
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
@ 2015-02-02 15:11 ` Peter Zijlstra
  2015-02-02 15:13   ` Peter Zijlstra
                     ` (2 more replies)
  2015-02-16 20:13 ` [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Oleg Nesterov
  2 siblings, 3 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-02 15:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:

> First of all, why exactly do we need this mm/PF_KTHREAD check added by
> f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
> kernel thread to be the owner as the changelog says. But why kthread is
> worse than a random user-space task, say, /sbin/init?

As the changelog says, we _should_ equally disallow other userspace
tasks that do not share the futex value with us, its just that at the
time we could not come up with a sensible (and cheap) way of testing for
this.

> IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
> if this (k)thread will exit or not. AFAICS, the only problem is that we can
> boost the prio of this thread. Or I missed another problem?

No that's it.

> I am asking because we need to backport some fixes, and I am trying to
> convince myself that I actually understand what I am trying to do ;)



> And another question. Lets forget about this ->mm check. I simply can not
> understand this
> 
> 	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
> 
> logic in attach_to_pi_owner(). First of all, why do we need to retry if
> PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
> PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?
> 
> I must have missed something but this looks buggy, I do not see any
> preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> forever in this case? (OK, ignoring RT throttling).

This is not something I've ever looked at before; 778e9a9c3e71
("pi-futex: fix exit races and locking problems") seems to suggest its
possible to get onto tsk->pi_state_list after exit_pi_state_list().

So while the below shows preemption points; those don't actually help
against RT tasks, a FIFO-99 task will always be more eligible to run
than most others.

So yes, I do like your proposal of putting PF_EXITPIDONE under the
->pi_lock section that handles exit_pi_state_list().

I further think we can remove the smp_mb(); raw_spin_unlock_wait() from
do_exit() -- this would offset the new unconditional ->pi_lock
acquisition in exit_pi_state_list(). The comment there suggests robust
futexes are involved but I cannot find any except the PI state muck
testing ->flags.

As for the recursive fault; I think the safer option is to set
EXITPIDONE and not register more PI states, as opposed to allowing more
and more states to be added. Yes we'll leak whatever currently is there,
but no point in allowing it to get worse.


do_exit()
{
	exit_signals(tsk); /* sets PF_EXITING */

	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	exit_mm() {
		mm_release() {
			exit_pi_state_list();
		}
	}

	tsk->flags |= PF_EXITPIDONE;
}

vs

futex_lock_pi()
{
retry:
	...

	ret = futex_lock_pi_atomic() {
		attach_to_pi_owner() {
			raw_spin_lock(&tsk->pi_lock);
			if (PF_EXITING) {
				ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
				raw_spin_unlock(&tsk->pi_lock);
				return ret;
			}
		}
	}
	if (ret) {
		switch(ret) {
		...

		case -EAGAIN:
			...
			cond_resched();
			goto retry;
		}
	}
}

vs

futex_requeue()
{
retry:
	...

	ret = futex_proxy_trylock_atomic() {
		ret = futex_lock_pi_atomic() {
			attach_to_pi_owner() {
				raw_spin_lock(&tsk->pi_lock);
				if (PF_EXITING) {
					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
					raw_spin_unlock(&tsk->pi_lock);
					return ret;
				}
			}
		}
	}

	if (ret > 0) {
		ret = lookup_pi_state() {
			attach_to_pi_owner() {
				raw_spin_lock(&tsk->pi_lock);
				if (PF_EXITING) {
					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
					raw_spin_unlock(&tsk->pi_lock);
					return ret;
				}
			}
		}
	}

	...
	switch(ret) {
		...
	case -EAGAIN:
		...
		cond_resched();
		goto retry;
	}
}

vs



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
@ 2015-02-02 15:13   ` Peter Zijlstra
  2015-02-02 15:14     ` Peter Zijlstra
  2015-02-02 16:20   ` Oleg Nesterov
  2015-02-03 20:09   ` Oleg Nesterov
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-02 15:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

(private noaw)

On Mon, Feb 02, 2015 at 04:11:59PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:
> 
> > First of all, why exactly do we need this mm/PF_KTHREAD check added by
> > f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
> > kernel thread to be the owner as the changelog says. But why kthread is
> > worse than a random user-space task, say, /sbin/init?
> 
> As the changelog says, we _should_ equally disallow other userspace
> tasks that do not share the futex value with us, its just that at the
> time we could not come up with a sensible (and cheap) way of testing for
> this.
> 
> > IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
> > if this (k)thread will exit or not. AFAICS, the only problem is that we can
> > boost the prio of this thread. Or I missed another problem?
> 
> No that's it.

Prio leaks allow (local) DoS attacks. It allows an unpriv user to gain
FIFO and burn silly amounts of cycles.

We should really plug that hole entirely.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 15:13   ` Peter Zijlstra
@ 2015-02-02 15:14     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-02 15:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Mon, Feb 02, 2015 at 04:13:38PM +0100, Peter Zijlstra wrote:
> (private noaw)

OK, I suck.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
  2015-02-02 15:13   ` Peter Zijlstra
@ 2015-02-02 16:20   ` Oleg Nesterov
  2015-02-03 20:09   ` Oleg Nesterov
  2 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-02 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On 02/02, Peter Zijlstra wrote:
>
> On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:
>
> > IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
> > if this (k)thread will exit or not. AFAICS, the only problem is that we can
> > boost the prio of this thread. Or I missed another problem?
>
> No that's it.

OK, thanks Peter. I was afraid I missed another reason for this check.

> > I must have missed something but this looks buggy, I do not see any
> > preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> > preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> > forever in this case? (OK, ignoring RT throttling).
>
> This is not something I've ever looked at before; 778e9a9c3e71
> ("pi-futex: fix exit races and locking problems") seems to suggest its
> possible to get onto tsk->pi_state_list after exit_pi_state_list().
>
> So while the below shows preemption points; those don't actually help
> against RT tasks, a FIFO-99 task will always be more eligible to run
> than most others.

Yes, yes, sorry, "not see any preemption point" looks confusing.

> So yes, I do like your proposal of putting PF_EXITPIDONE under the
> ->pi_lock section that handles exit_pi_state_list().
>
> I further think we can remove the smp_mb(); raw_spin_unlock_wait() from
> do_exit() -- this would offset the new unconditional ->pi_lock
> acquisition in exit_pi_state_list(). The comment there suggests robust
> futexes are involved but I cannot find any except the PI state muck
> testing ->flags.

Yes, probably nothing else needs to sync with PF_EXITING...

> As for the recursive fault; I think the safer option is to set
> EXITPIDONE and not register more PI states, as opposed to allowing more
> and more states to be added. Yes we'll leak whatever currently is there,
> but no point in allowing it to get worse.

Thanks! I'll try to think about the patch tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
  2015-02-02 15:13   ` Peter Zijlstra
  2015-02-02 16:20   ` Oleg Nesterov
@ 2015-02-03 20:09   ` Oleg Nesterov
  2015-02-04 11:12     ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-03 20:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

Peter,

I am getting more confused when I re-read your email today ;) see below.

Btw, do you agree with 1/1? Can you ack/nack it?

On 02/02, Peter Zijlstra wrote:
>
> On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:
>
> > And another question. Lets forget about this ->mm check. I simply can not
> > understand this
> >
> > 	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
> >
> > I must have missed something but this looks buggy, I do not see any
> > preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> > preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> > forever in this case? (OK, ignoring RT throttling).
>
> So yes, I do like your proposal of putting PF_EXITPIDONE under the
> ->pi_lock section that handles exit_pi_state_list().

Probably I was not clear... Let try again just in case.

I believe that the whole "spin waiting for PF_EXITING -> PF_EXITPIDONE
transition" idea is simply wrong. See the test-case I sent.

I think that attach_to_pi_owner() should never check PF_EXITING and never
return -EAGAIN. It should either proceed and add pi_state to the list or
return -ESRCH if exit_pi_state_list() was called.

Do you agree?

Perhaps we can set PF_EXITPIDONE lockless and avoid the unconditional
lock(pi_lock) but this is minor.

The main problem is that I fail to understand why this logic was added
in the first place... To avoid the race with exit_robust_list() ? I do
not see why this is needed...

> As for the recursive fault; I think the safer option is to set
> EXITPIDONE and not register more PI states, as opposed to allowing more
> and more states to be added. Yes we'll leak whatever currently is there,
> but no point in allowing it to get worse.

Not sure I understand... If you mean recursive do_exit() then yes, I think
that we should simply set EXITPIDONE lockless in a best-effort manner, this
is what the current code does. Just the comment should be updated in any
case imo.

But mostly I was confused by the pseudo-code below. Heh, because I thought
that it describes the changes in kernel/futex.c you think we should do. Now
that I finally realized that it outlines the current code I am unconfused a
bit ;)

Oleg.

> do_exit()
> {
> 	exit_signals(tsk); /* sets PF_EXITING */
>
> 	smp_mb();
> 	raw_spin_unlock_wait(&tsk->pi_lock);
>
> 	exit_mm() {
> 		mm_release() {
> 			exit_pi_state_list();
> 		}
> 	}
>
> 	tsk->flags |= PF_EXITPIDONE;
> }
>
> vs
>
> futex_lock_pi()
> {
> retry:
> 	...
>
> 	ret = futex_lock_pi_atomic() {
> 		attach_to_pi_owner() {
> 			raw_spin_lock(&tsk->pi_lock);
> 			if (PF_EXITING) {
> 				ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
> 				raw_spin_unlock(&tsk->pi_lock);
> 				return ret;
> 			}
> 		}
> 	}
> 	if (ret) {
> 		switch(ret) {
> 		...
>
> 		case -EAGAIN:
> 			...
> 			cond_resched();
> 			goto retry;
> 		}
> 	}
> }
>
> vs
>
> futex_requeue()
> {
> retry:
> 	...
>
> 	ret = futex_proxy_trylock_atomic() {
> 		ret = futex_lock_pi_atomic() {
> 			attach_to_pi_owner() {
> 				raw_spin_lock(&tsk->pi_lock);
> 				if (PF_EXITING) {
> 					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
> 					raw_spin_unlock(&tsk->pi_lock);
> 					return ret;
> 				}
> 			}
> 		}
> 	}
>
> 	if (ret > 0) {
> 		ret = lookup_pi_state() {
> 			attach_to_pi_owner() {
> 				raw_spin_lock(&tsk->pi_lock);
> 				if (PF_EXITING) {
> 					ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
> 					raw_spin_unlock(&tsk->pi_lock);
> 					return ret;
> 				}
> 			}
> 		}
> 	}
>
> 	...
> 	switch(ret) {
> 		...
> 	case -EAGAIN:
> 		...
> 		cond_resched();
> 		goto retry;
> 	}
> }
>
> vs
>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
@ 2015-02-04 10:48   ` Peter Zijlstra
  2015-02-14 18:01   ` Davidlohr Bueso
  2015-02-18 17:11   ` [tip:locking/core] locking/futex: Check " tip-bot for Oleg Nesterov
  2 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-04 10:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Mon, Feb 02, 2015 at 03:05:36PM +0100, Oleg Nesterov wrote:
> attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
> this looks doubly wrong:
> 
> 1. It should actually check PF_KTHREAD, kthread can do use_mm().
> 
> 2. If this task is not kthread and it is actually the lock owner we can
>    wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.
> 
>    And note that this wrong EPERM is the likely case unless the exiting
>    task is (auto)reaped quickly, we check ->mm before PF_EXITING.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-03 20:09   ` Oleg Nesterov
@ 2015-02-04 11:12     ` Peter Zijlstra
  2015-02-04 20:25       ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-04 11:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Tue, Feb 03, 2015 at 09:09:16PM +0100, Oleg Nesterov wrote:
> Btw, do you agree with 1/1? Can you ack/nack it?

Done!

> On 02/02, Peter Zijlstra wrote:
> >
> > On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:
> >
> > > And another question. Lets forget about this ->mm check. I simply can not
> > > understand this
> > >
> > > 	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
> > >
> > > I must have missed something but this looks buggy, I do not see any
> > > preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> > > preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> > > forever in this case? (OK, ignoring RT throttling).
> >
> > So yes, I do like your proposal of putting PF_EXITPIDONE under the
> > ->pi_lock section that handles exit_pi_state_list().
> 
> Probably I was not clear... Let try again just in case.
> 
> I believe that the whole "spin waiting for PF_EXITING -> PF_EXITPIDONE
> transition" idea is simply wrong. See the test-case I sent.
> 
> I think that attach_to_pi_owner() should never check PF_EXITING and never
> return -EAGAIN. It should either proceed and add pi_state to the list or
> return -ESRCH if exit_pi_state_list() was called.
> 
> Do you agree?

Yes.

> Perhaps we can set PF_EXITPIDONE lockless and avoid the unconditional
> lock(pi_lock) but this is minor.

Agreed, lets first fix things. We can optimize later.

> The main problem is that I fail to understand why this logic was added
> in the first place... To avoid the race with exit_robust_list() ? I do
> not see why this is needed...

exit_pi_state_list() I think, but 778e9a9c3e71 ("pi-futex: fix exit
races and locking problems") is a big and somewhat confusing patch.

I'm not quite sure why/how all that happened either, it was before I got
sucked into all this.

I'm not entire sure why we need two PF flags for this; once PF_EXITING
is set userspace is _dead_ and it doesn't make sense to keep adding
(futex) PI-state to the task.

> > As for the recursive fault; I think the safer option is to set
> > EXITPIDONE and not register more PI states, as opposed to allowing more
> > and more states to be added. Yes we'll leak whatever currently is there,
> > but no point in allowing it to get worse.
> 
> Not sure I understand... If you mean recursive do_exit() then yes, I think
> that we should simply set EXITPIDONE lockless in a best-effort manner, this
> is what the current code does. Just the comment should be updated in any
> case imo.

Yes, the "Fixing recursive fault..." branch, you had an XXX explain
comment there. I think we agree there.

> But mostly I was confused by the pseudo-code below. Heh, because I thought
> that it describes the changes in kernel/futex.c you think we should do. Now
> that I finally realized that it outlines the current code I am unconfused a
> bit ;)

Yes, it was an attempt to show what the current code does -- which is;
of itself; confusing enough.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-04 11:12     ` Peter Zijlstra
@ 2015-02-04 20:25       ` Oleg Nesterov
  2015-02-05 16:27         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-04 20:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On 02/04, Peter Zijlstra wrote:
>
> On Tue, Feb 03, 2015 at 09:09:16PM +0100, Oleg Nesterov wrote:
> > Btw, do you agree with 1/1? Can you ack/nack it?
>
> Done!

Thanks ;)

> > I think that attach_to_pi_owner() should never check PF_EXITING and never
> > return -EAGAIN. It should either proceed and add pi_state to the list or
> > return -ESRCH if exit_pi_state_list() was called.
> >
> > Do you agree?
>
> Yes.

OK, great.

> > Perhaps we can set PF_EXITPIDONE lockless and avoid the unconditional
> > lock(pi_lock) but this is minor.
>
> Agreed, lets first fix things. We can optimize later.

Yes, agreed. and BTW the current list_empty(&tsk->pi_state_list) check
in mm_release() doesn't look right in theory. It seems that we need
another barrier before this check and list_empty_careful(). Nevermind,
this is only theoretical and we have another unlock_wait(pi_lock) in
do_exit().

> I'm not entire sure why we need two PF flags for this; once PF_EXITING
> is set userspace is _dead_ and it doesn't make sense to keep adding
> (futex) PI-state to the task.

This is what I _seem_ to understand: exit_robust_list(). Although I am
not sure this all is by design...

And this is the reason why I still can't finish the patch. Perhaps I am
totally confused, but I think there is yet another problem here.

Please forget about PF_EXIT.*. attach_to_pi_owner() returns -ESRCH if
futex_find_get_task() and even this looks wrong. Because handle_futex_death()
updates *uaddr lockless and does nothing if "pi". This means that the owner
of PI + robust mutex can go away (or just set PF_EXITPIDONE) and the caller
of futex_lock_pi() can miss unlock.

Peter, could you confirm that this problem does exist, or I missed something?

If yes. perhaps we need another get_futex_value_locked() before "return ESRCH",
or perhaps something like the (ugly) patch below. I'll try to think again...

Oleg.

--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -2815,12 +2815,20 @@ retry:
 		if (nval != uval)
 			goto retry;
 
-		/*
-		 * Wake robust non-PI futexes here. The wakeup of
-		 * PI futexes happens in exit_pi_state():
-		 */
-		if (!pi && (uval & FUTEX_WAITERS))
-			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
+		if (uval & FUTEX_WAITERS) {
+			if (pi) {
+				/*
+				 * Wake robust non-PI futexes here. The wakeup
+				 * of PI futexes happens in exit_pi_stale_list().
+				 * Sync with potential attachers to this list.
+				 */
+				get_futex_key(..., &key, ...);
+				hb = hash_futex(&key);
+				spin_unlock_wait(&hb->lock);
+			} else {
+				futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
+			}
+		}
 	}
 	return 0;
 }


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-04 20:25       ` Oleg Nesterov
@ 2015-02-05 16:27         ` Peter Zijlstra
  2015-02-05 18:10           ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-05 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Wed, Feb 04, 2015 at 09:25:09PM +0100, Oleg Nesterov wrote:
> > I'm not entire sure why we need two PF flags for this; once PF_EXITING
> > is set userspace is _dead_ and it doesn't make sense to keep adding
> > (futex) PI-state to the task.
> 
> This is what I _seem_ to understand: exit_robust_list(). Although I am
> not sure this all is by design...
> 
> And this is the reason why I still can't finish the patch. Perhaps I am
> totally confused, but I think there is yet another problem here.
> 
> Please forget about PF_EXIT.*. attach_to_pi_owner() returns -ESRCH if
> futex_find_get_task() and even this looks wrong. 

You'll have to help me out a little here; where do we unhash the PIDs?
>From what I can find we set PF_EXITING _before_ unhashing ourselves.

In fact, from what I can tell we only unhash after calling both
exit_robust_list and exit_pi_state_list.

> Because handle_futex_death()
> updates *uaddr lockless and does nothing if "pi". This means that the owner
> of PI + robust mutex can go away (or just set PF_EXITPIDONE) and the caller
> of futex_lock_pi() can miss unlock.
> 
> Peter, could you confirm that this problem does exist, or I missed something?

So as long as we unhash _last_ I can't see this happening, we'll always
find the task, the robust list walk doesn't care about PI state.

The exit_pi_state_list() will serialize against any concurrent attach
that might be in progress -- and we nkow there won't be a new one since
we've set PF_EXITING. And kill all the PI owners stuff.

But please, if you suspect, share a little more detail on how you see
this happening, this is not code I've looked at in detail before.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-05 16:27         ` Peter Zijlstra
@ 2015-02-05 18:10           ` Oleg Nesterov
  2015-02-06 10:46             ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-05 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

Let me first say that I simply do not know if PI+robust futex is actually
supposed (or guaranteed) to work.

Documentation/pi-futex.txt says

	'robustness' and 'PI' are two orthogonal
	properties of futexes, and all four combinations are possible: futex,
	robust-futex, PI-futex, robust+PI-futex.

And exit_robust_list() checks bit 0 to detect the "PI" case, so I think
this should work.

However, this comment

	/*
	 * This task is holding PI mutexes at exit time => bad.
	 * Kernel cleans up PI-state, but userspace is likely hosed.
	 * (Robust-futex cleanup is separate and might save the day for userspace.)
	 */

above exit_pi_state_list() looks confusing. In fact it looks wrong if
PI+robust should work. Because handle_futex_death() seems to rely on
exit_pi_state_list.

Now, if it should work,

On 02/05, Peter Zijlstra wrote:
>
> So as long as we unhash _last_ I can't see this happening, we'll always
> find the task, the robust list walk doesn't care about PI state.

and it simply can't take care of PI state. ->pi_state can be NULL by
the time exit_robust_list() is called.

> But please, if you suspect, share a little more detail on how you see
> this happening, this is not code I've looked at in detail before.

Heh, I am reading it for the first time ;) So I can be easily wrong.

But afaics the race/problem is very simple. Suppose a task T locks a PI+robust
mutex and exits. I this case (I presume) sys_futex(uaddr, FUTEX_LOCK_PI)
from another task X must always succeed sooner or later. But

	- X takes queue_lock() and reads *uaddr == T->pid. Need to setup
	  pi_state and wait. FUTEX_WAITERS is set.

	- T exits and calls handle_futex_death(). This clears FUTEX_TID_MASK
	  and sets FUTEX_OWNER_DIED, without any lock.

	  T->pi_state_list is empty, exit_pi_state_list() does nothing.

	  T goes away or simply sets PF_EXITPIDONE (lets ignore PF_EXITING).

	- X calls attach_to_pi_owner() and futex_find_get_task() returns NULL,
	  or we detect PF_EXITPIDONE, this doesn't really matter.

	  What does matter (unless I missed something) is that -ESRCH is wrong
	  in this case. This mutex was unlocked. It is robust, so we should not
	  miss this unlock.

So I think that in this case we either need to recheck that *uaddr is still the
same (and turn -ESRCH into -EAGAIN otherwise), or change handle_futex_death() to
serialize with X so that it can proceed and attach pi_state.

No?

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-05 18:10           ` Oleg Nesterov
@ 2015-02-06 10:46             ` Peter Zijlstra
  2015-02-06 17:04               ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-06 10:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

On Thu, Feb 05, 2015 at 07:10:14PM +0100, Oleg Nesterov wrote:
> Let me first say that I simply do not know if PI+robust futex is actually
> supposed (or guaranteed) to work.

> Now, if it should work,

I 'think' it _should_ work. Afaict the glibc code sees this as a valid
combination.

> On 02/05, Peter Zijlstra wrote:
> >
> > So as long as we unhash _last_ I can't see this happening, we'll always
> > find the task, the robust list walk doesn't care about PI state.
> 
> and it simply can't take care of PI state. ->pi_state can be NULL by
> the time exit_robust_list() is called.
> 
> > But please, if you suspect, share a little more detail on how you see
> > this happening, this is not code I've looked at in detail before.
> 
> Heh, I am reading it for the first time ;) So I can be easily wrong.
> 
> But afaics the race/problem is very simple. Suppose a task T locks a PI+robust
> mutex and exits. I this case (I presume) sys_futex(uaddr, FUTEX_LOCK_PI)
> from another task X must always succeed sooner or later. But
> 
> 	- X takes queue_lock() and reads *uaddr == T->pid. Need to setup
> 	  pi_state and wait. FUTEX_WAITERS is set.
> 
> 	- T exits and calls handle_futex_death(). This clears FUTEX_TID_MASK
> 	  and sets FUTEX_OWNER_DIED, without any lock.
> 
> 	  T->pi_state_list is empty, exit_pi_state_list() does nothing.

Right, because T acquired the lock from userspace and there have not yet
been any waiters, so there's no pi state.

> 	  T goes away or simply sets PF_EXITPIDONE (lets ignore PF_EXITING).
> 
> 	- X calls attach_to_pi_owner() and futex_find_get_task() returns NULL,
> 	  or we detect PF_EXITPIDONE, this doesn't really matter.
> 
> 	  What does matter (unless I missed something) is that -ESRCH is wrong
> 	  in this case. This mutex was unlocked. It is robust, so we should not
> 	  miss this unlock.

Right,..

> So I think that in this case we either need to recheck that *uaddr is still the
> same (and turn -ESRCH into -EAGAIN otherwise), or change handle_futex_death() to
> serialize with X so that it can proceed and attach pi_state.
> 
> No?

I _think_ you're right, doing -ESRCH is wrong without first looking to
see if uval changed and gained an FUTEX_OWNER_DIED.

I don't think making handle_futex_death() wait on hb lock works because
of the -EAGAIN loop releasing that lock.

Now, I think Darren actually had a futex test suite; Darren can you add
a robust-pi test like the above to stress this?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-06 10:46             ` Peter Zijlstra
@ 2015-02-06 17:04               ` Oleg Nesterov
  2015-02-09 20:38                 ` Darren Hart
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-06 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel

Peter, I am spamming you again and again, but I didn't even start the
patches. It turns out I can do nothing until devconf.cz finishes next
week.

On 02/06, Peter Zijlstra wrote:
>
> On Thu, Feb 05, 2015 at 07:10:14PM +0100, Oleg Nesterov wrote:
>
> > So I think that in this case we either need to recheck that *uaddr is still the
> > same (and turn -ESRCH into -EAGAIN otherwise), or change handle_futex_death() to
> > serialize with X so that it can proceed and attach pi_state.
> >
> > No?
>
> I _think_ you're right, doing -ESRCH is wrong without first looking to
> see if uval changed and gained an FUTEX_OWNER_DIED.

OK, thanks.

> I don't think making handle_futex_death() wait on hb lock works because
> of the -EAGAIN loop releasing that lock.

I think this should work... EAGAIN loop will either notice the change in
*uaddr or it will attach to pi list successfully. But please ignore, even
if I am right I do not like this change too.



And there is another thing which looks like design bug to me. I understand
that it is too late and pointless to complain, probably we can't change the
current behaviour, but I simply can't resist...

Suppose that a task T takes a PI futex (non-robust) and exits. Another task
does futex(FUTEX_LOCK_PI).

Now. if futex() is called after T exits it returns -ESRCH, this is correct.
But if it is called before, it succeeds while (I think) it should not.
fixup_owner() treats pi_state->owner == NULL pretty much as "unlocked".

IOW,
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/syscall.h>
	#include <sys/wait.h>
	#include <assert.h>

	#define FUTEX_LOCK_PI	6

	int main(void)
	{
		int mutex, pid, err;

		pid = fork();
		if (!pid) {
			sleep(1);
			return 0;
		}

		mutex = pid;
		err = syscall(__NR_futex, &mutex, FUTEX_LOCK_PI, 0,0,0);
		printf("err=%d %x -> %x %m\n", err, pid, mutex);

		assert(wait(NULL) == pid);
		return 0;
	}

I don't understand why syscall(FUTEX_LOCK_PI) succeeds in this case.
To me it should fail with -ESRCH, this would be much more consistent
imho.

And this means that "PI" implies "robust" to some degree. OK, may be
this is fine. But if this is fine, why we can't do the same if, say,
futex_find_get_task() returns NULL ?

To me the rule should be simple. If the owner dies the next LOCK_PI
should succeed if and only if the futex was robust.

Or at least this should not depend on timing.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-06 17:04               ` Oleg Nesterov
@ 2015-02-09 20:38                 ` Darren Hart
  2015-02-10 11:14                   ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Darren Hart @ 2015-02-09 20:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, LKML, dvhart

Hi Oleg, Peter,

Wow! When it rains it pours.

I'm paging this all in, but unfortunately, like both of you, I'm
pretty new to the robust futexes side of this problem.

I will prepare the test that Peter suggested so we have something to
test with now as well as run for regressions over time.

On Fri, Feb 6, 2015 at 9:04 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Peter, I am spamming you again and again, but I didn't even start the
> patches. It turns out I can do nothing until devconf.cz finishes next
> week.
>
> On 02/06, Peter Zijlstra wrote:
>>
>> On Thu, Feb 05, 2015 at 07:10:14PM +0100, Oleg Nesterov wrote:
>>
>> > So I think that in this case we either need to recheck that *uaddr is still the
>> > same (and turn -ESRCH into -EAGAIN otherwise), or change handle_futex_death() to
>> > serialize with X so that it can proceed and attach pi_state.
>> >
>> > No?
>>
>> I _think_ you're right, doing -ESRCH is wrong without first looking to
>> see if uval changed and gained an FUTEX_OWNER_DIED.
>

Hrm, if we cleared the TID mask, and the pi chain is empty, can we not
clear the waiters? I'm sure there's at least one corner case that
complicates this further... I'll think on this more and get back to
you.

-- 
Darren Hart

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-09 20:38                 ` Darren Hart
@ 2015-02-10 11:14                   ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-10 11:14 UTC (permalink / raw)
  To: Darren Hart
  Cc: Peter Zijlstra, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, LKML, dvhart

Hi Darren,

On 02/09, Darren Hart wrote:
>
> I will prepare the test that Peter suggested so we have something to
> test with now as well as run for regressions over time.

Just in case, I already wrote the stupid test-case:

	#include <stdio.h>
	#include <unistd.h>
	#include <signal.h>
	#include <sys/syscall.h>
	#include <sys/wait.h>
	#include <sys/mman.h>
	#include <assert.h>

	#define FUTEX_LOCK_PI	6

	struct robust_list {
		struct robust_list *next;
	};

	struct robust_list_head {
		struct robust_list list;
		long futex_offset;
		struct robust_list *list_op_pending;
	};

	int main(void)
	{
		int *mutex = mmap(NULL, 4, PROT_READ|PROT_WRITE,
					MAP_ANONYMOUS|MAP_SHARED, -1,0);
		assert((void *)mutex != MAP_FAILED);

		for (;;) {
			int err, pid = fork();

			if (!pid) {
				struct robust_list_head head;

				head.list.next = &head.list;
				head.futex_offset = 0;
				head.list_op_pending = (void *)mutex + 1;

				assert(syscall(__NR_set_robust_list, &head, sizeof(head)) == 0);
				kill(getpid(), SIGSTOP);
				_exit(0);
			}

			assert(waitpid(-1, NULL, WSTOPPED) == pid);

			*mutex = pid;
			kill(pid, SIGKILL);
			err = syscall(__NR_futex, mutex, FUTEX_LOCK_PI, 0,0,0);
			assert(wait(NULL) == pid);

			if (err) {
				printf("err=%d %m\n", err);
				kill(0, SIGKILL);
			}
		}

		return 0;
	}

it needs ~20 secs to fail on my machine. Probably it can be improved.

> Hrm, if we cleared the TID mask, and the pi chain is empty, can we not
> clear the waiters?

In this case the waiter should take care, I guess.


OK. I'll try to make at least the 1st fix today (EXITING -> EXITPIDONE
livelock).

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
  2015-02-04 10:48   ` Peter Zijlstra
@ 2015-02-14 18:01   ` Davidlohr Bueso
  2015-02-14 20:57     ` Oleg Nesterov
  2015-02-18 17:11   ` [tip:locking/core] locking/futex: Check " tip-bot for Oleg Nesterov
  2 siblings, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2015-02-14 18:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Peter Zijlstra, Thomas Gleixner, Jerome Marchand,
	Larry Woodman, Mateusz Guzik, linux-kernel

On Mon, 2015-02-02 at 15:05 +0100, Oleg Nesterov wrote:
> attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
> this looks doubly wrong:
> 
> 1. It should actually check PF_KTHREAD, kthread can do use_mm().
> 
> 2. If this task is not kthread and it is actually the lock owner we can
>    wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.
> 
>    And note that this wrong EPERM is the likely case unless the exiting
>    task is (auto)reaped quickly, we check ->mm before PF_EXITING.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  kernel/futex.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 63678b5..b101381 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -900,7 +900,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
>  	if (!p)
>  		return -ESRCH;
>  
> -	if (!p->mm) {
> +	if (unlikely(p->flags & PF_KTHREAD)) {
>  		put_task_struct(p);
>  		return -EPERM;
>  	}

Futexes aren't the only naughty checkers, a quick search shows that, at
least, the oom killer and proc have this same problem. Should we make
this generic and update accordingly? ie:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..b0d37d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1991,6 +1991,11 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+static inline bool task_is_kthread(struct task_struct *task)
+{
+	return task->flags & PF_KTHREAD;
+}
+
 /* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
  * __GFP_FS is also cleared as it implies __GFP_IO.
  */


Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-14 18:01   ` Davidlohr Bueso
@ 2015-02-14 20:57     ` Oleg Nesterov
  2015-02-14 21:15       ` Davidlohr Bueso
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-14 20:57 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Darren Hart, Peter Zijlstra, Thomas Gleixner, Jerome Marchand,
	Larry Woodman, Mateusz Guzik, linux-kernel

On 02/14, Davidlohr Bueso wrote:
>
> On Mon, 2015-02-02 at 15:05 +0100, Oleg Nesterov wrote:
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -900,7 +900,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
> >  	if (!p)
> >  		return -ESRCH;
> >
> > -	if (!p->mm) {
> > +	if (unlikely(p->flags & PF_KTHREAD)) {
> >  		put_task_struct(p);
> >  		return -EPERM;
> >  	}
>
> Futexes aren't the only naughty checkers,

Yes, this is the common mistake,

> a quick search shows that, at
> least, the oom killer and proc have this same problem.

oom looks fine, note the PF_KTHREAD check in oom_unkillable(). It check ->mm
for another reason, to figure out if this task has passed exit_mm() or not.

proc looks fine too at first glance... we do not care if this mm was adopted
via use_mm() or not.

> Should we make
> this generic and update accordingly? ie:
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..b0d37d6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1991,6 +1991,11 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
>  #define used_math() tsk_used_math(current)
>
> +static inline bool task_is_kthread(struct task_struct *task)
> +{
> +	return task->flags & PF_KTHREAD;
> +}

Perhaps... but PF_KTHREAD looks self-documented too ;)

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-14 20:57     ` Oleg Nesterov
@ 2015-02-14 21:15       ` Davidlohr Bueso
  2015-02-14 21:54         ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2015-02-14 21:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Peter Zijlstra, Thomas Gleixner, Jerome Marchand,
	Larry Woodman, Mateusz Guzik, linux-kernel

On Sat, 2015-02-14 at 21:57 +0100, Oleg Nesterov wrote:
> Perhaps... but PF_KTHREAD looks self-documented too ;)

Well the idea was to standardize how we check for a kthreads, as opposed
to improving readability or such. Just a suggestion.

Thanks.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-14 21:15       ` Davidlohr Bueso
@ 2015-02-14 21:54         ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-14 21:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Darren Hart, Peter Zijlstra, Thomas Gleixner, Jerome Marchand,
	Larry Woodman, Mateusz Guzik, linux-kernel

On 02/14, Davidlohr Bueso wrote:
>
> On Sat, 2015-02-14 at 21:57 +0100, Oleg Nesterov wrote:
> > Perhaps... but PF_KTHREAD looks self-documented too ;)
>
> Well the idea was to standardize how we check for a kthreads, as opposed
> to improving readability or such. Just a suggestion.

I understand, and of course I won't argue if you add a helper.

I meant that checking for a kthread is already standardized: flags & PF_KTREAD.
Not sure this needs a helper, but I again I won't argue.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
  2015-02-02 14:05 [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Oleg Nesterov
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
  2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
@ 2015-02-16 20:13 ` Oleg Nesterov
  2015-02-16 20:13   ` [PATCH 1/1] " Oleg Nesterov
  2 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-16 20:13 UTC (permalink / raw)
  To: Darren Hart, Peter Zijlstra, Thomas Gleixner
  Cc: Jerome Marchand, Larry Woodman, Mateusz Guzik, linux-kernel,
	Alexey Kuznetsov, Pavel Emelyanov

On 02/02, Oleg Nesterov wrote:
>
> And another question. Lets forget about this ->mm check. I simply can not
> understand this
>
>	ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
>
> logic in attach_to_pi_owner(). First of all, why do we need to retry if
> PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
> PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?
>
> I must have missed something but this looks buggy, I do not see any
> preemption point
  ^^^^^^^^^^

I meant synchronization point, sorry for confusion.

> in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> forever in this case? (OK, ignoring RT throttling).

Finally I forced myself to try to make the 1st patch ;) To remind, we have
more problems with robust+pi futexes, this needs another patch(es). Otherwise
we could (probably) even kill PF_EXITPIDONE.

Peter. I have no idea how to test it (except it obviously fixes the test-
case I sent before). IOW: please review.

And I still fail to understand why this PF_EXITING logic was added in the
first place. So I also have the problem with the changelog, it merely says
"don't do this because this is not needed".



On top of "check PF_KTHREAD rather than !p->mm to filter out kthreads" but
doesn't depend on it.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
  2015-02-16 20:13 ` [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Oleg Nesterov
@ 2015-02-16 20:13   ` Oleg Nesterov
  2015-02-27  9:52     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Nesterov @ 2015-02-16 20:13 UTC (permalink / raw)
  To: Darren Hart, Peter Zijlstra, Thomas Gleixner
  Cc: Jerome Marchand, Larry Woodman, Mateusz Guzik, linux-kernel,
	Alexey Kuznetsov, Pavel Emelyanov

It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which
triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE.
This burns CPU for no reason and this can even livelock if the rt_task()
caller preempts a PF_EXITING owner.

Remove the PF_EXITING check altogether. We do not care if it is exiting,
all we need to know is can we rely on exit_pi_state_list() or not. So we
also need to set PF_EXITPIDONE before we flush ->pi_state_list and call
exit_pi_state_list() unconditionally.

Perhaps we can add the fast-path list_empty() check in mm_release() back,
but lets fix the problem first. Besides, in theory this check is already
not correct, at least it should be list_empty_careful() to avoid the race
with free_pi_state() in progress.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c  |   22 +---------------------
 kernel/fork.c  |    3 +--
 kernel/futex.c |   40 ++++++++++------------------------------
 3 files changed, 12 insertions(+), 53 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55..bc969ed 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -683,27 +683,13 @@ void do_exit(long code)
 	 */
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
-		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
-		 */
+		/* Avoid the new additions to ->pi_state_list at least */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
 	}
 
 	exit_signals(tsk);  /* sets PF_EXITING */
-	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic()))
 		pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -779,12 +765,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ec3208e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index b101381..c1104a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
 
 	if (!futex_cmpxchg_enabled)
 		return;
+
 	/*
-	 * We are a ZOMBIE and nobody can enqueue itself on
-	 * pi_state_list anymore, but we have to be careful
-	 * versus waiters unqueueing themselves:
+	 * attach_to_pi_owner() can no longer add the new entry. But
+	 * we have to be careful versus waiters unqueueing themselves.
 	 */
+	curr->flags |= PF_EXITPIDONE;
+
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
@@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		return -EPERM;
 	}
 
-	/*
-	 * We need to look at the task state flags to figure out,
-	 * whether the task is exiting. To protect against the do_exit
-	 * change of the task flags, we do this protected by
-	 * p->pi_lock:
-	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
-		/*
-		 * The task is on the way out. When PF_EXITPIDONE is
-		 * set, we know that the task has finished the
-		 * cleanup:
-		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
+		/* exit_pi_state_list() was already called */
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*
@@ -1633,12 +1623,7 @@ retry_private:
 				goto retry;
 			goto out;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Owner is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			free_pi_state(pi_state);
 			pi_state = NULL;
 			double_unlock_hb(hb1, hb2);
@@ -2295,12 +2280,7 @@ retry_private:
 		case -EFAULT:
 			goto uaddr_faulted;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Task is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
-- 
1.5.5.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip:locking/core] locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
  2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
  2015-02-04 10:48   ` Peter Zijlstra
  2015-02-14 18:01   ` Davidlohr Bueso
@ 2015-02-18 17:11   ` tip-bot for Oleg Nesterov
  2 siblings, 0 replies; 25+ messages in thread
From: tip-bot for Oleg Nesterov @ 2015-02-18 17:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: oleg, hpa, mguzik, linux-kernel, mingo, peterz, catalin.marinas,
	tglx, jmarchan, dave, darren, lwoodman, torvalds

Commit-ID:  a21294644623ee41034db60e93aaebed4db0e57b
Gitweb:     http://git.kernel.org/tip/a21294644623ee41034db60e93aaebed4db0e57b
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Mon, 2 Feb 2015 15:05:36 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 18 Feb 2015 16:57:09 +0100

locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads

attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
this looks doubly wrong:

1. It should actually check PF_KTHREAD, kthread can do use_mm().

2. If this task is not kthread and it is actually the lock owner we can
   wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.

   And note that this wrong EPERM is the likely case unless the exiting
   task is (auto)reaped quickly, we check ->mm before PF_EXITING.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Link: http://lkml.kernel.org/r/20150202140536.GA26406@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/futex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4eeb63d..1f6d646 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -900,7 +900,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	if (!p)
 		return -ESRCH;
 
-	if (!p->mm) {
+	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
 		return -EPERM;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
  2015-02-16 20:13   ` [PATCH 1/1] " Oleg Nesterov
@ 2015-02-27  9:52     ` Peter Zijlstra
  2015-02-27 11:54       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-27  9:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel, Alexey Kuznetsov, Pavel Emelyanov

On Mon, Feb 16, 2015 at 09:13:36PM +0100, Oleg Nesterov wrote:

> diff --git a/kernel/futex.c b/kernel/futex.c
> index b101381..c1104a8 100644
> +++ b/kernel/futex.c
> @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
>  
>  	if (!futex_cmpxchg_enabled)
>  		return;
> +
>  	/*
> +	 * attach_to_pi_owner() can no longer add the new entry. But
> +	 * we have to be careful versus waiters unqueueing themselves.
>  	 */
> +	curr->flags |= PF_EXITPIDONE;
> +
>  	raw_spin_lock_irq(&curr->pi_lock);
>  	while (!list_empty(head)) {
>  

Should we not set PF_EXITPIDONE _inside_ the pi_lock? To properly
serialize against the below check?

> @@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
>  		return -EPERM;
>  	}
>  
>  	raw_spin_lock_irq(&p->pi_lock);
> +	if (unlikely(p->flags & PF_EXITPIDONE)) {
> +		/* exit_pi_state_list() was already called */
>  		raw_spin_unlock_irq(&p->pi_lock);
>  		put_task_struct(p);
> +		return -ESRCH;
>  	}
>  
>  	/*



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
  2015-02-27  9:52     ` Peter Zijlstra
@ 2015-02-27 11:54       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-02-27 11:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Darren Hart, Thomas Gleixner, Jerome Marchand, Larry Woodman,
	Mateusz Guzik, linux-kernel, Alexey Kuznetsov, Pavel Emelyanov

On Fri, Feb 27, 2015 at 10:52:40AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 09:13:36PM +0100, Oleg Nesterov wrote:
> 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index b101381..c1104a8 100644
> > +++ b/kernel/futex.c
> > @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
> >  
> >  	if (!futex_cmpxchg_enabled)
> >  		return;
> > +
> >  	/*
> > +	 * attach_to_pi_owner() can no longer add the new entry. But
> > +	 * we have to be careful versus waiters unqueueing themselves.
> >  	 */
> > +	curr->flags |= PF_EXITPIDONE;
> > +
> >  	raw_spin_lock_irq(&curr->pi_lock);
> >  	while (!list_empty(head)) {
> >  
> 
> Should we not set PF_EXITPIDONE _inside_ the pi_lock? To properly
> serialize against the below check?

It does not matter, n/m.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-02-27 11:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-02 14:05 [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Oleg Nesterov
2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
2015-02-04 10:48   ` Peter Zijlstra
2015-02-14 18:01   ` Davidlohr Bueso
2015-02-14 20:57     ` Oleg Nesterov
2015-02-14 21:15       ` Davidlohr Bueso
2015-02-14 21:54         ` Oleg Nesterov
2015-02-18 17:11   ` [tip:locking/core] locking/futex: Check " tip-bot for Oleg Nesterov
2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
2015-02-02 15:13   ` Peter Zijlstra
2015-02-02 15:14     ` Peter Zijlstra
2015-02-02 16:20   ` Oleg Nesterov
2015-02-03 20:09   ` Oleg Nesterov
2015-02-04 11:12     ` Peter Zijlstra
2015-02-04 20:25       ` Oleg Nesterov
2015-02-05 16:27         ` Peter Zijlstra
2015-02-05 18:10           ` Oleg Nesterov
2015-02-06 10:46             ` Peter Zijlstra
2015-02-06 17:04               ` Oleg Nesterov
2015-02-09 20:38                 ` Darren Hart
2015-02-10 11:14                   ` Oleg Nesterov
2015-02-16 20:13 ` [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Oleg Nesterov
2015-02-16 20:13   ` [PATCH 1/1] " Oleg Nesterov
2015-02-27  9:52     ` Peter Zijlstra
2015-02-27 11:54       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).