LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/4] irq_work: PREEMPT_RT bits
@ 2021-10-06 11:18 Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner

This is v2 of the series. v1 can be found at
   https://lkml.kernel.org/20210927211919.310855-1-bigeasy@linutronix.de

Changelog:
v1…v2:
  - Drop "irq_work: Ensure that irq_work runs in in-IRQ context". This
    triggers if a CPU goes down.

  - #3 now uses per-CPU threads with the lowest RT priority instead of
    the timer softirq.

This are the PREEMPT_RT bits for irq_work. The annotation for
IRQ_WORK_HARD_IRQ was already merged except for one missed annotation.
Patch #3 introduces the required processing split of callbacks with are
really invoked from hard-irq context and those which are processed in a
per-CPU thread. Unless "LAZY" the thread is woken from the irq_work
hardirq.

This has been done as quite a few them acquire locks which need to sleep
on PREEMPT_RT and must not be invoked in hardirq context. We can not
delay all of them since a few need to be invoked hardirq context in
order to work properly (NOHZ, scheduler, …).

Sebastian



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  2021-10-06 11:18 [PATCH v2 0/4] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
@ 2021-10-06 11:18 ` Sebastian Andrzej Siewior
  2021-10-13 22:13   ` Steven Rostedt
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 2/4] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira

The push-IPI logic for RT tasks expects to be invoked from hardirq
context. One reason is that a RT task on the remote CPU would block the
softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
tasks as intended.

Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 4e8698e62f075..3d0157bd4e144 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -526,7 +526,7 @@ static int init_rootdomain(struct root_domain *rd)
 #ifdef HAVE_RT_PUSH_IPI
 	rd->rto_cpu = -1;
 	raw_spin_lock_init(&rd->rto_lock);
-	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+	rd->rto_push_work = IRQ_WORK_INIT_HARD(rto_push_irq_work_func);
 #endif
 
 	rd->visit_gen = 0;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 2/4] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  2021-10-06 11:18 [PATCH v2 0/4] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
@ 2021-10-06 11:18 ` Sebastian Andrzej Siewior
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 4/4] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
  3 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

irq_work() triggers instantly an interrupt if supported by the
architecture. Otherwise the work will be processed on the next timer
tick. In worst case irq_work_sync() could spin up to a jiffy.

irq_work_sync() is usually used in tear down context which is fully
preemptible. Based on review irq_work_sync() is invoked from preemptible
context and there is one waiter at a time. This qualifies it to use
rcuwait for synchronisation.

Let irq_work_sync() synchronize with rcuwait if the architecture
processes irqwork via the timer tick.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h |  3 +++
 kernel/irq_work.c        | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index ec2a47a81e423..b48955e9c920e 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,7 @@
 #define _LINUX_IRQ_WORK_H
 
 #include <linux/smp_types.h>
+#include <linux/rcuwait.h>
 
 /*
  * An entry can be in one of four states:
@@ -16,11 +17,13 @@
 struct irq_work {
 	struct __call_single_node node;
 	void (*func)(struct irq_work *);
+	struct rcuwait irqwait;
 };
 
 #define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
 	.node = { .u_flags = (_flags), },			\
 	.func = (_func),					\
+	.irqwait = __RCUWAIT_INITIALIZER(irqwait),		\
 }
 
 #define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index db8c248ebc8c8..e789beda8297d 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -160,6 +160,9 @@ void irq_work_single(void *arg)
 	 * else claimed it meanwhile.
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
+
+	if (!arch_irq_work_has_interrupt())
+		rcuwait_wake_up(&work->irqwait);
 }
 
 static void irq_work_run_list(struct llist_head *list)
@@ -204,6 +207,13 @@ void irq_work_tick(void)
 void irq_work_sync(struct irq_work *work)
 {
 	lockdep_assert_irqs_enabled();
+	might_sleep();
+
+	if (!arch_irq_work_has_interrupt()) {
+		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
+				   TASK_UNINTERRUPTIBLE);
+		return;
+	}
 
 	while (irq_work_is_busy(work))
 		cpu_relax();
-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-06 11:18 [PATCH v2 0/4] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
  2021-10-06 11:18 ` [PATCH v2 2/4] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
@ 2021-10-06 11:18 ` Sebastian Andrzej Siewior
  2021-10-06 11:35   ` Sebastian Andrzej Siewior
                     ` (2 more replies)
  2021-10-06 11:18 ` [PATCH v2 4/4] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
  3 siblings, 3 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

The irq_work callback is invoked in hard IRQ context. By default all
callbacks are scheduled for invocation right away (given supported by
the architecture) except for the ones marked IRQ_WORK_LAZY which are
delayed until the next timer-tick.

While looking over the callbacks, some of them may acquire locks
(spinlock_t, rwlock_t) which are transformed into sleeping locks on
PREEMPT_RT and must not be acquired in hard IRQ context.
Changing the locks into locks which could be acquired in this context
will lead to other problems such as increased latencies if everything
in the chain has IRQ-off locks. This will not solve all the issues as
one callback has been noticed which invoked kref_put() and its callback
invokes kfree() and this can not be invoked in hardirq context.

Some callbacks are required to be invoked in hardirq context even on
PREEMPT_RT to work properly. This includes for instance the NO_HZ
callback which needs to be able to observe the idle context.

The callbacks which require to be run in hardirq have already been
marked. Use this information to split the callbacks onto the two lists
on PREEMPT_RT:
- lazy_list
  Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
  to this list. Callbacks on this list will be invoked from a per-CPU
  thread.
  The handler here may acquire sleeping locks such as spinlock_t and
  invoke kfree().

- raised_list
  Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
  this list. They will be invoked in hardirq context and must not
  acquire any sleeping locks.

The wake up of the per-CPU thread occurs from irq_work handler/
hardirq context. The thread runs with lowest RT priority to ensure it
runs before any SCHED_OTHER tasks do.

[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
	  hard and soft variant. Collected fixes over time from Steven
	  Rostedt and Mike Galbraith. Move to per-CPU threads instead of
	  softirq as suggested by PeterZ.]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/irq_work.c | 117 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 104 insertions(+), 13 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index e789beda8297d..daa5d12522faa 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,11 +18,34 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/smpboot.h>
 #include <asm/processor.h>
 #include <linux/kasan.h>
 
 static DEFINE_PER_CPU(struct llist_head, raised_list);
 static DEFINE_PER_CPU(struct llist_head, lazy_list);
+static DEFINE_PER_CPU(struct task_struct *, irq_workd);
+
+static void wake_irq_workd(void)
+{
+	struct task_struct *tsk = __this_cpu_read(irq_workd);
+
+	if (!llist_empty(this_cpu_ptr(&lazy_list)) && tsk)
+		wake_up_process(tsk);
+}
+
+static void irq_work_wake(struct irq_work *entry)
+{
+	wake_irq_workd();
+}
+
+static DEFINE_PER_CPU(struct irq_work, irq_work_pending) =
+	IRQ_WORK_INIT_HARD(irq_work_wake);
+
+static int irq_workd_should_run(unsigned int cpu)
+{
+	return !llist_empty(this_cpu_ptr(&lazy_list));
+}
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -52,15 +75,29 @@ void __weak arch_irq_work_raise(void)
 /* Enqueue on current CPU, work must already be claimed and preempt disabled */
 static void __irq_work_queue_local(struct irq_work *work)
 {
+	struct llist_head *list;
+	bool rt_lazy_work = false;
+	bool lazy_work = false;
+	int work_flags;
+
+	work_flags = atomic_read(&work->node.a_flags);
+	if (work_flags & IRQ_WORK_LAZY)
+		lazy_work = true;
+	else if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		 !(work_flags & IRQ_WORK_HARD_IRQ))
+		rt_lazy_work = true;
+
+	if (lazy_work || rt_lazy_work)
+		list = this_cpu_ptr(&lazy_list);
+	else
+		list = this_cpu_ptr(&raised_list);
+
+	if (!llist_add(&work->node.llist, list))
+		return;
+
 	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
-			arch_irq_work_raise();
-	}
+	if (!lazy_work || tick_nohz_tick_stopped())
+		arch_irq_work_raise();
 }
 
 /* Enqueue the irq work @work on the current CPU */
@@ -104,7 +141,24 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		__smp_call_single_queue(cpu, &work->node.llist);
+
+		/*
+		 * On PREEMPT_RT the items which are not marked as
+		 * IRQ_WORK_HARD_IRQ are added to the lazy list and a HARD work
+		 * item is used on the remote CPU to wake the thread.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ) &&
+		    llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) {
+			struct irq_work *wake_work;
+
+			wake_work = &per_cpu(irq_work_pending, cpu);
+			if (irq_work_claim(wake_work))
+				__smp_call_single_queue(cpu,
+							&wake_work->node.llist);
+		} else {
+			__smp_call_single_queue(cpu, &work->node.llist);
+		}
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -114,7 +168,6 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 #endif /* CONFIG_SMP */
 }
 
-
 bool irq_work_needs_cpu(void)
 {
 	struct llist_head *raised, *lazy;
@@ -170,7 +223,12 @@ static void irq_work_run_list(struct llist_head *list)
 	struct irq_work *work, *tmp;
 	struct llist_node *llnode;
 
-	BUG_ON(!irqs_disabled());
+	/*
+	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
+	 * in a per-CPU thread in preemptible context. Only the items which are
+	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
+	 */
+	BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT));
 
 	if (llist_empty(list))
 		return;
@@ -187,7 +245,10 @@ static void irq_work_run_list(struct llist_head *list)
 void irq_work_run(void)
 {
 	irq_work_run_list(this_cpu_ptr(&raised_list));
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
@@ -197,7 +258,11 @@ void irq_work_tick(void)
 
 	if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
 		irq_work_run_list(raised);
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 
 /*
@@ -219,3 +284,29 @@ void irq_work_sync(struct irq_work *work)
 		cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
+
+static void run_irq_workd(unsigned int cpu)
+{
+	irq_work_run_list(this_cpu_ptr(&lazy_list));
+}
+
+static void irq_workd_setup(unsigned int cpu)
+{
+	sched_set_fifo_low(current);
+}
+
+static struct smp_hotplug_thread irqwork_threads = {
+	.store                  = &irq_workd,
+	.setup			= irq_workd_setup,
+	.thread_should_run      = irq_workd_should_run,
+	.thread_fn              = run_irq_workd,
+	.thread_comm            = "irq_work/%u",
+};
+
+static __init int irq_work_init_threads(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		BUG_ON(smpboot_register_percpu_thread(&irqwork_threads));
+	return 0;
+}
+early_initcall(irq_work_init_threads);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 4/4] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  2021-10-06 11:18 [PATCH v2 0/4] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-10-06 11:18 ` Sebastian Andrzej Siewior
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
  3 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

On PREEMPT_RT most items are processed as LAZY via softirq context.
Avoid to spin-wait for them because irq_work_sync() could have higher
priority and not allow the irq-work to be completed.

Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h | 5 +++++
 kernel/irq_work.c        | 6 ++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index b48955e9c920e..8cd11a2232605 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -49,6 +49,11 @@ static inline bool irq_work_is_busy(struct irq_work *work)
 	return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
 }
 
+static inline bool irq_work_is_hard(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ;
+}
+
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
 
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index daa5d12522faa..1833f9e23f224 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -214,7 +214,8 @@ void irq_work_single(void *arg)
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
 
-	if (!arch_irq_work_has_interrupt())
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt())
 		rcuwait_wake_up(&work->irqwait);
 }
 
@@ -274,7 +275,8 @@ void irq_work_sync(struct irq_work *work)
 	lockdep_assert_irqs_enabled();
 	might_sleep();
 
-	if (!arch_irq_work_has_interrupt()) {
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt()) {
 		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
 				   TASK_UNINTERRUPTIBLE);
 		return;
-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-10-06 11:35   ` Sebastian Andrzej Siewior
  2021-10-06 16:26   ` kernel test robot
  2021-10-07  8:50   ` Peter Zijlstra
  2 siblings, 0 replies; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-06 11:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Thomas Gleixner

On 2021-10-06 13:18:51 [+0200], To linux-kernel@vger.kernel.org wrote:
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index e789beda8297d..daa5d12522faa 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -18,11 +18,34 @@
>  #include <linux/cpu.h>
>  #include <linux/notifier.h>
>  #include <linux/smp.h>
> +#include <linux/smpboot.h>
>  #include <asm/processor.h>
>  #include <linux/kasan.h>
>  
>  static DEFINE_PER_CPU(struct llist_head, raised_list);
>  static DEFINE_PER_CPU(struct llist_head, lazy_list);
> +static DEFINE_PER_CPU(struct task_struct *, irq_workd);
> +
> +static void wake_irq_workd(void)
> +{
> +	struct task_struct *tsk = __this_cpu_read(irq_workd);
> +
> +	if (!llist_empty(this_cpu_ptr(&lazy_list)) && tsk)
> +		wake_up_process(tsk);
> +}

#ifdef CONFIG_SMP

> +static void irq_work_wake(struct irq_work *entry)
> +{
> +	wake_irq_workd();
> +}
> +
> +static DEFINE_PER_CPU(struct irq_work, irq_work_pending) =
> +	IRQ_WORK_INIT_HARD(irq_work_wake);

#endif

> +static int irq_workd_should_run(unsigned int cpu)
> +{
> +	return !llist_empty(this_cpu_ptr(&lazy_list));
> +}
>  
>  /*
>   * Claim the entry so that no one else will poke at it.

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
  2021-10-06 11:35   ` Sebastian Andrzej Siewior
@ 2021-10-06 16:26   ` kernel test robot
  2021-10-07  8:50   ` Peter Zijlstra
  2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2021-10-06 16:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: kbuild-all, Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]

Hi Sebastian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/master]
[also build test ERROR on linux/master tip/sched/core linus/master v5.15-rc3 next-20210922]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Sebastian-Andrzej-Siewior/irq_work-PREEMPT_RT-bits/20211006-192111
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git d4bfebd9ef497ee0afb498f6028a5074a6ccf307
config: powerpc-randconfig-r025-20211004 (attached as .config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/0906cdcac588712393e29dd4ce2be2057393c561
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Sebastian-Andrzej-Siewior/irq_work-PREEMPT_RT-bits/20211006-192111
        git checkout 0906cdcac588712393e29dd4ce2be2057393c561
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> powerpc-linux-ld: kernel/irq_work.o:(.discard+0x0): multiple definition of `__pcpu_unique_irq_work_pending'; arch/powerpc/kernel/time.o:(.discard+0x0): first defined here

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41796 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
  2021-10-06 11:35   ` Sebastian Andrzej Siewior
  2021-10-06 16:26   ` kernel test robot
@ 2021-10-07  8:50   ` Peter Zijlstra
  2021-10-07  9:08     ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2021-10-07  8:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel, Thomas Gleixner

On Wed, Oct 06, 2021 at 01:18:51PM +0200, Sebastian Andrzej Siewior wrote:
> @@ -104,7 +141,24 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
>  	if (cpu != smp_processor_id()) {
>  		/* Arch remote IPI send/receive backend aren't NMI safe */
>  		WARN_ON_ONCE(in_nmi());
> -		__smp_call_single_queue(cpu, &work->node.llist);
> +
> +		/*
> +		 * On PREEMPT_RT the items which are not marked as
> +		 * IRQ_WORK_HARD_IRQ are added to the lazy list and a HARD work
> +		 * item is used on the remote CPU to wake the thread.
> +		 */
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> +		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ) &&
> +		    llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) {

So if the llist not empty, we'll fail here, and go to the else branch

> +			struct irq_work *wake_work;
> +
> +			wake_work = &per_cpu(irq_work_pending, cpu);
> +			if (irq_work_claim(wake_work))
> +				__smp_call_single_queue(cpu,
> +							&wake_work->node.llist);
> +		} else {

And do this,.. that seems wrong.

> +			__smp_call_single_queue(cpu, &work->node.llist);
> +		}
>  	} else {
>  		__irq_work_queue_local(work);
>  	}

How's this instead?

(work rename due to there already being a percpu variable of that same
name on a number or archs, per 0day)


Index: linux-2.6/kernel/irq_work.c
===================================================================
--- linux-2.6.orig/kernel/irq_work.c
+++ linux-2.6/kernel/irq_work.c
@@ -39,7 +39,7 @@ static void irq_work_wake(struct irq_wor
 	wake_irq_workd();
 }
 
-static DEFINE_PER_CPU(struct irq_work, irq_work_pending) =
+static DEFINE_PER_CPU(struct irq_work, irq_work_wakeup) =
 	IRQ_WORK_INIT_HARD(irq_work_wake);
 
 static int irq_workd_should_run(unsigned int cpu)
@@ -148,20 +148,21 @@ bool irq_work_queue_on(struct irq_work *
 		 * item is used on the remote CPU to wake the thread.
 		 */
 		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
-		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ) &&
-		    llist_add(&work->node.llist, &per_cpu(lazy_list, cpu))) {
-			struct irq_work *wake_work;
-
-			wake_work = &per_cpu(irq_work_pending, cpu);
-			if (irq_work_claim(wake_work))
-				__smp_call_single_queue(cpu,
-							&wake_work->node.llist);
-		} else {
-			__smp_call_single_queue(cpu, &work->node.llist);
+		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
+
+			if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
+				goto out;
+
+			work = &per_cpu(irq_work_wakeup, cpu);
+			if (!irq_work_claim(wake_work))
+				goto out;
 		}
+
+		__smp_call_single_queue(cpu, &work->node.llist);
 	} else {
 		__irq_work_queue_local(work);
 	}
+out:
 	preempt_enable();
 
 	return true;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-07  8:50   ` Peter Zijlstra
@ 2021-10-07  9:08     ` Sebastian Andrzej Siewior
  2021-10-07  9:26       ` [PATCH v3 " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-07  9:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner

On 2021-10-07 10:50:23 [+0200], Peter Zijlstra wrote:
> How's this instead?

Yeah. The last minute move-up of the llist_add() was…

> (work rename due to there already being a percpu variable of that same
> name on a number or archs, per 0day)

Let me grab that, add that ifdef smp and repost 3/4.

Sebastian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-07  9:08     ` Sebastian Andrzej Siewior
@ 2021-10-07  9:26       ` Sebastian Andrzej Siewior
  2021-10-15  9:44         ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 1 reply; 15+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-10-07  9:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner

The irq_work callback is invoked in hard IRQ context. By default all
callbacks are scheduled for invocation right away (given supported by
the architecture) except for the ones marked IRQ_WORK_LAZY which are
delayed until the next timer-tick.

While looking over the callbacks, some of them may acquire locks
(spinlock_t, rwlock_t) which are transformed into sleeping locks on
PREEMPT_RT and must not be acquired in hard IRQ context.
Changing the locks into locks which could be acquired in this context
will lead to other problems such as increased latencies if everything
in the chain has IRQ-off locks. This will not solve all the issues as
one callback has been noticed which invoked kref_put() and its callback
invokes kfree() and this can not be invoked in hardirq context.

Some callbacks are required to be invoked in hardirq context even on
PREEMPT_RT to work properly. This includes for instance the NO_HZ
callback which needs to be able to observe the idle context.

The callbacks which require to be run in hardirq have already been
marked. Use this information to split the callbacks onto the two lists
on PREEMPT_RT:
- lazy_list
  Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
  to this list. Callbacks on this list will be invoked from a per-CPU
  thread.
  The handler here may acquire sleeping locks such as spinlock_t and
  invoke kfree().

- raised_list
  Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
  this list. They will be invoked in hardirq context and must not
  acquire any sleeping locks.

The wake up of the per-CPU thread occurs from irq_work handler/
hardirq context. The thread runs with lowest RT priority to ensure it
runs before any SCHED_OTHER tasks do.

[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
	  hard and soft variant. Collected fixes over time from Steven
	  Rostedt and Mike Galbraith. Move to per-CPU threads instead of
	  softirq as suggested by PeterZ.]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/irq_work.c | 118 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 106 insertions(+), 12 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index e789beda8297d..90b6b56f92e95 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,11 +18,36 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/smpboot.h>
 #include <asm/processor.h>
 #include <linux/kasan.h>
 
 static DEFINE_PER_CPU(struct llist_head, raised_list);
 static DEFINE_PER_CPU(struct llist_head, lazy_list);
+static DEFINE_PER_CPU(struct task_struct *, irq_workd);
+
+static void wake_irq_workd(void)
+{
+	struct task_struct *tsk = __this_cpu_read(irq_workd);
+
+	if (!llist_empty(this_cpu_ptr(&lazy_list)) && tsk)
+		wake_up_process(tsk);
+}
+
+#ifdef CONFIG_SMP
+static void irq_work_wake(struct irq_work *entry)
+{
+	wake_irq_workd();
+}
+
+static DEFINE_PER_CPU(struct irq_work, irq_work_wakeup) =
+	IRQ_WORK_INIT_HARD(irq_work_wake);
+#endif
+
+static int irq_workd_should_run(unsigned int cpu)
+{
+	return !llist_empty(this_cpu_ptr(&lazy_list));
+}
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -52,15 +77,29 @@ void __weak arch_irq_work_raise(void)
 /* Enqueue on current CPU, work must already be claimed and preempt disabled */
 static void __irq_work_queue_local(struct irq_work *work)
 {
+	struct llist_head *list;
+	bool rt_lazy_work = false;
+	bool lazy_work = false;
+	int work_flags;
+
+	work_flags = atomic_read(&work->node.a_flags);
+	if (work_flags & IRQ_WORK_LAZY)
+		lazy_work = true;
+	else if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		 !(work_flags & IRQ_WORK_HARD_IRQ))
+		rt_lazy_work = true;
+
+	if (lazy_work || rt_lazy_work)
+		list = this_cpu_ptr(&lazy_list);
+	else
+		list = this_cpu_ptr(&raised_list);
+
+	if (!llist_add(&work->node.llist, list))
+		return;
+
 	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
-			arch_irq_work_raise();
-	}
+	if (!lazy_work || tick_nohz_tick_stopped())
+		arch_irq_work_raise();
 }
 
 /* Enqueue the irq work @work on the current CPU */
@@ -104,17 +143,34 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
+
+		/*
+		 * On PREEMPT_RT the items which are not marked as
+		 * IRQ_WORK_HARD_IRQ are added to the lazy list and a HARD work
+		 * item is used on the remote CPU to wake the thread.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
+
+			if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
+				goto out;
+
+			work = &per_cpu(irq_work_wakeup, cpu);
+			if (!irq_work_claim(work))
+				goto out;
+		}
+
 		__smp_call_single_queue(cpu, &work->node.llist);
 	} else {
 		__irq_work_queue_local(work);
 	}
+out:
 	preempt_enable();
 
 	return true;
 #endif /* CONFIG_SMP */
 }
 
-
 bool irq_work_needs_cpu(void)
 {
 	struct llist_head *raised, *lazy;
@@ -170,7 +226,12 @@ static void irq_work_run_list(struct llist_head *list)
 	struct irq_work *work, *tmp;
 	struct llist_node *llnode;
 
-	BUG_ON(!irqs_disabled());
+	/*
+	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
+	 * in a per-CPU thread in preemptible context. Only the items which are
+	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
+	 */
+	BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT));
 
 	if (llist_empty(list))
 		return;
@@ -187,7 +248,10 @@ static void irq_work_run_list(struct llist_head *list)
 void irq_work_run(void)
 {
 	irq_work_run_list(this_cpu_ptr(&raised_list));
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
@@ -197,7 +261,11 @@ void irq_work_tick(void)
 
 	if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
 		irq_work_run_list(raised);
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 
 /*
@@ -219,3 +287,29 @@ void irq_work_sync(struct irq_work *work)
 		cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
+
+static void run_irq_workd(unsigned int cpu)
+{
+	irq_work_run_list(this_cpu_ptr(&lazy_list));
+}
+
+static void irq_workd_setup(unsigned int cpu)
+{
+	sched_set_fifo_low(current);
+}
+
+static struct smp_hotplug_thread irqwork_threads = {
+	.store                  = &irq_workd,
+	.setup			= irq_workd_setup,
+	.thread_should_run      = irq_workd_should_run,
+	.thread_fn              = run_irq_workd,
+	.thread_comm            = "irq_work/%u",
+};
+
+static __init int irq_work_init_threads(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		BUG_ON(smpboot_register_percpu_thread(&irqwork_threads));
+	return 0;
+}
+early_initcall(irq_work_init_threads);
-- 
2.33.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
@ 2021-10-13 22:13   ` Steven Rostedt
  2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2021-10-13 22:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira

On Wed,  6 Oct 2021 13:18:49 +0200
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> The push-IPI logic for RT tasks expects to be invoked from hardirq
> context. One reason is that a RT task on the remote CPU would block the
> softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
> tasks as intended.
> 
> Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 4e8698e62f075..3d0157bd4e144 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -526,7 +526,7 @@ static int init_rootdomain(struct root_domain *rd)
>  #ifdef HAVE_RT_PUSH_IPI
>  	rd->rto_cpu = -1;
>  	raw_spin_lock_init(&rd->rto_lock);
> -	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
> +	rd->rto_push_work = IRQ_WORK_INIT_HARD(rto_push_irq_work_func);

Should we not make an "init_irq_work_hard()" helper?

-- Steve

>  #endif
>  
>  	rd->visit_gen = 0;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  2021-10-06 11:18 ` [PATCH v2 4/4] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
@ 2021-10-15  9:44   ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     09089db79859cbccccd8df95b034f36f7027efa6
Gitweb:        https://git.kernel.org/tip/09089db79859cbccccd8df95b034f36f7027efa6
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Wed, 06 Oct 2021 13:18:52 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:18 +02:00

irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT

On PREEMPT_RT most items are processed as LAZY via softirq context.
Avoid to spin-wait for them because irq_work_sync() could have higher
priority and not allow the irq-work to be completed.

Wait additionally for !IRQ_WORK_HARD_IRQ irq_work items on PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-5-bigeasy@linutronix.de
---
 include/linux/irq_work.h | 5 +++++
 kernel/irq_work.c        | 6 ++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index b48955e..8cd11a2 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -49,6 +49,11 @@ static inline bool irq_work_is_busy(struct irq_work *work)
 	return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
 }
 
+static inline bool irq_work_is_hard(struct irq_work *work)
+{
+	return atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ;
+}
+
 bool irq_work_queue(struct irq_work *work);
 bool irq_work_queue_on(struct irq_work *work, int cpu);
 
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 90b6b56..f7df715 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -217,7 +217,8 @@ void irq_work_single(void *arg)
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
 
-	if (!arch_irq_work_has_interrupt())
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt())
 		rcuwait_wake_up(&work->irqwait);
 }
 
@@ -277,7 +278,8 @@ void irq_work_sync(struct irq_work *work)
 	lockdep_assert_irqs_enabled();
 	might_sleep();
 
-	if (!arch_irq_work_has_interrupt()) {
+	if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
+	    !arch_irq_work_has_interrupt()) {
 		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
 				   TASK_UNINTERRUPTIBLE);
 		return;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  2021-10-07  9:26       ` [PATCH v3 " Sebastian Andrzej Siewior
@ 2021-10-15  9:44         ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b4c6f86ec2f648b5e6d4b04564fbc6d5351160a8
Gitweb:        https://git.kernel.org/tip/b4c6f86ec2f648b5e6d4b04564fbc6d5351160a8
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Thu, 07 Oct 2021 11:26:46 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:17 +02:00

irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT

The irq_work callback is invoked in hard IRQ context. By default all
callbacks are scheduled for invocation right away (given supported by
the architecture) except for the ones marked IRQ_WORK_LAZY which are
delayed until the next timer-tick.

While looking over the callbacks, some of them may acquire locks
(spinlock_t, rwlock_t) which are transformed into sleeping locks on
PREEMPT_RT and must not be acquired in hard IRQ context.
Changing the locks into locks which could be acquired in this context
will lead to other problems such as increased latencies if everything
in the chain has IRQ-off locks. This will not solve all the issues as
one callback has been noticed which invoked kref_put() and its callback
invokes kfree() and this can not be invoked in hardirq context.

Some callbacks are required to be invoked in hardirq context even on
PREEMPT_RT to work properly. This includes for instance the NO_HZ
callback which needs to be able to observe the idle context.

The callbacks which require to be run in hardirq have already been
marked. Use this information to split the callbacks onto the two lists
on PREEMPT_RT:
- lazy_list
  Work items which are not marked with IRQ_WORK_HARD_IRQ will be added
  to this list. Callbacks on this list will be invoked from a per-CPU
  thread.
  The handler here may acquire sleeping locks such as spinlock_t and
  invoke kfree().

- raised_list
  Work items which are marked with IRQ_WORK_HARD_IRQ will be added to
  this list. They will be invoked in hardirq context and must not
  acquire any sleeping locks.

The wake up of the per-CPU thread occurs from irq_work handler/
hardirq context. The thread runs with lowest RT priority to ensure it
runs before any SCHED_OTHER tasks do.

[bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a
	  hard and soft variant. Collected fixes over time from Steven
	  Rostedt and Mike Galbraith. Move to per-CPU threads instead of
	  softirq as suggested by PeterZ.]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211007092646.uhshe3ut2wkrcfzv@linutronix.de
---
 kernel/irq_work.c | 118 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 106 insertions(+), 12 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index e789bed..90b6b56 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -18,11 +18,36 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/smp.h>
+#include <linux/smpboot.h>
 #include <asm/processor.h>
 #include <linux/kasan.h>
 
 static DEFINE_PER_CPU(struct llist_head, raised_list);
 static DEFINE_PER_CPU(struct llist_head, lazy_list);
+static DEFINE_PER_CPU(struct task_struct *, irq_workd);
+
+static void wake_irq_workd(void)
+{
+	struct task_struct *tsk = __this_cpu_read(irq_workd);
+
+	if (!llist_empty(this_cpu_ptr(&lazy_list)) && tsk)
+		wake_up_process(tsk);
+}
+
+#ifdef CONFIG_SMP
+static void irq_work_wake(struct irq_work *entry)
+{
+	wake_irq_workd();
+}
+
+static DEFINE_PER_CPU(struct irq_work, irq_work_wakeup) =
+	IRQ_WORK_INIT_HARD(irq_work_wake);
+#endif
+
+static int irq_workd_should_run(unsigned int cpu)
+{
+	return !llist_empty(this_cpu_ptr(&lazy_list));
+}
 
 /*
  * Claim the entry so that no one else will poke at it.
@@ -52,15 +77,29 @@ void __weak arch_irq_work_raise(void)
 /* Enqueue on current CPU, work must already be claimed and preempt disabled */
 static void __irq_work_queue_local(struct irq_work *work)
 {
+	struct llist_head *list;
+	bool rt_lazy_work = false;
+	bool lazy_work = false;
+	int work_flags;
+
+	work_flags = atomic_read(&work->node.a_flags);
+	if (work_flags & IRQ_WORK_LAZY)
+		lazy_work = true;
+	else if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		 !(work_flags & IRQ_WORK_HARD_IRQ))
+		rt_lazy_work = true;
+
+	if (lazy_work || rt_lazy_work)
+		list = this_cpu_ptr(&lazy_list);
+	else
+		list = this_cpu_ptr(&raised_list);
+
+	if (!llist_add(&work->node.llist, list))
+		return;
+
 	/* If the work is "lazy", handle it from next tick if any */
-	if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
-			arch_irq_work_raise();
-	}
+	if (!lazy_work || tick_nohz_tick_stopped())
+		arch_irq_work_raise();
 }
 
 /* Enqueue the irq work @work on the current CPU */
@@ -104,17 +143,34 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
+
+		/*
+		 * On PREEMPT_RT the items which are not marked as
+		 * IRQ_WORK_HARD_IRQ are added to the lazy list and a HARD work
+		 * item is used on the remote CPU to wake the thread.
+		 */
+		if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+		    !(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
+
+			if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
+				goto out;
+
+			work = &per_cpu(irq_work_wakeup, cpu);
+			if (!irq_work_claim(work))
+				goto out;
+		}
+
 		__smp_call_single_queue(cpu, &work->node.llist);
 	} else {
 		__irq_work_queue_local(work);
 	}
+out:
 	preempt_enable();
 
 	return true;
 #endif /* CONFIG_SMP */
 }
 
-
 bool irq_work_needs_cpu(void)
 {
 	struct llist_head *raised, *lazy;
@@ -170,7 +226,12 @@ static void irq_work_run_list(struct llist_head *list)
 	struct irq_work *work, *tmp;
 	struct llist_node *llnode;
 
-	BUG_ON(!irqs_disabled());
+	/*
+	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
+	 * in a per-CPU thread in preemptible context. Only the items which are
+	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
+	 */
+	BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT));
 
 	if (llist_empty(list))
 		return;
@@ -187,7 +248,10 @@ static void irq_work_run_list(struct llist_head *list)
 void irq_work_run(void)
 {
 	irq_work_run_list(this_cpu_ptr(&raised_list));
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 EXPORT_SYMBOL_GPL(irq_work_run);
 
@@ -197,7 +261,11 @@ void irq_work_tick(void)
 
 	if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
 		irq_work_run_list(raised);
-	irq_work_run_list(this_cpu_ptr(&lazy_list));
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		irq_work_run_list(this_cpu_ptr(&lazy_list));
+	else
+		wake_irq_workd();
 }
 
 /*
@@ -219,3 +287,29 @@ void irq_work_sync(struct irq_work *work)
 		cpu_relax();
 }
 EXPORT_SYMBOL_GPL(irq_work_sync);
+
+static void run_irq_workd(unsigned int cpu)
+{
+	irq_work_run_list(this_cpu_ptr(&lazy_list));
+}
+
+static void irq_workd_setup(unsigned int cpu)
+{
+	sched_set_fifo_low(current);
+}
+
+static struct smp_hotplug_thread irqwork_threads = {
+	.store                  = &irq_workd,
+	.setup			= irq_workd_setup,
+	.thread_should_run      = irq_workd_should_run,
+	.thread_fn              = run_irq_workd,
+	.thread_comm            = "irq_work/%u",
+};
+
+static __init int irq_work_init_threads(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		BUG_ON(smpboot_register_percpu_thread(&irqwork_threads));
+	return 0;
+}
+early_initcall(irq_work_init_threads);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  2021-10-06 11:18 ` [PATCH v2 2/4] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
@ 2021-10-15  9:44   ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     810979682ccc98dbd83f341c18a2e556c30a7164
Gitweb:        https://git.kernel.org/tip/810979682ccc98dbd83f341c18a2e556c30a7164
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Wed, 06 Oct 2021 13:18:50 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:17 +02:00

irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.

irq_work() triggers instantly an interrupt if supported by the
architecture. Otherwise the work will be processed on the next timer
tick. In worst case irq_work_sync() could spin up to a jiffy.

irq_work_sync() is usually used in tear down context which is fully
preemptible. Based on review irq_work_sync() is invoked from preemptible
context and there is one waiter at a time. This qualifies it to use
rcuwait for synchronisation.

Let irq_work_sync() synchronize with rcuwait if the architecture
processes irqwork via the timer tick.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-3-bigeasy@linutronix.de
---
 include/linux/irq_work.h |  3 +++
 kernel/irq_work.c        | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index ec2a47a..b48955e 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -3,6 +3,7 @@
 #define _LINUX_IRQ_WORK_H
 
 #include <linux/smp_types.h>
+#include <linux/rcuwait.h>
 
 /*
  * An entry can be in one of four states:
@@ -16,11 +17,13 @@
 struct irq_work {
 	struct __call_single_node node;
 	void (*func)(struct irq_work *);
+	struct rcuwait irqwait;
 };
 
 #define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){	\
 	.node = { .u_flags = (_flags), },			\
 	.func = (_func),					\
+	.irqwait = __RCUWAIT_INITIALIZER(irqwait),		\
 }
 
 #define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index db8c248..e789bed 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -160,6 +160,9 @@ void irq_work_single(void *arg)
 	 * else claimed it meanwhile.
 	 */
 	(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
+
+	if (!arch_irq_work_has_interrupt())
+		rcuwait_wake_up(&work->irqwait);
 }
 
 static void irq_work_run_list(struct llist_head *list)
@@ -204,6 +207,13 @@ void irq_work_tick(void)
 void irq_work_sync(struct irq_work *work)
 {
 	lockdep_assert_irqs_enabled();
+	might_sleep();
+
+	if (!arch_irq_work_has_interrupt()) {
+		rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
+				   TASK_UNINTERRUPTIBLE);
+		return;
+	}
 
 	while (irq_work_is_busy(work))
 		cpu_relax();

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
  2021-10-13 22:13   ` Steven Rostedt
@ 2021-10-15  9:44   ` tip-bot2 for Sebastian Andrzej Siewior
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2021-10-15  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     da6ff09943491819e077b94c284bf0a6b751c9b8
Gitweb:        https://git.kernel.org/tip/da6ff09943491819e077b94c284bf0a6b751c9b8
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Wed, 06 Oct 2021 13:18:49 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 15 Oct 2021 11:25:16 +02:00

sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ

The push-IPI logic for RT tasks expects to be invoked from hardirq
context. One reason is that a RT task on the remote CPU would block the
softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
tasks as intended.

Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c1729f9..e812467 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -526,7 +526,7 @@ static int init_rootdomain(struct root_domain *rd)
 #ifdef HAVE_RT_PUSH_IPI
 	rd->rto_cpu = -1;
 	raw_spin_lock_init(&rd->rto_lock);
-	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+	rd->rto_push_work = IRQ_WORK_INIT_HARD(rto_push_irq_work_func);
 #endif
 
 	rd->visit_gen = 0;

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-10-15  9:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-06 11:18 [PATCH v2 0/4] irq_work: PREEMPT_RT bits Sebastian Andrzej Siewior
2021-10-06 11:18 ` [PATCH v2 1/4] sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ Sebastian Andrzej Siewior
2021-10-13 22:13   ` Steven Rostedt
2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
2021-10-06 11:18 ` [PATCH v2 2/4] irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support Sebastian Andrzej Siewior
2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
2021-10-06 11:18 ` [PATCH v2 3/4] irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT Sebastian Andrzej Siewior
2021-10-06 11:35   ` Sebastian Andrzej Siewior
2021-10-06 16:26   ` kernel test robot
2021-10-07  8:50   ` Peter Zijlstra
2021-10-07  9:08     ` Sebastian Andrzej Siewior
2021-10-07  9:26       ` [PATCH v3 " Sebastian Andrzej Siewior
2021-10-15  9:44         ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior
2021-10-06 11:18 ` [PATCH v2 4/4] irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ " Sebastian Andrzej Siewior
2021-10-15  9:44   ` [tip: sched/core] " tip-bot2 for Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).