LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2] timer: Migrate running timers
@ 2015-03-31  6:55 Viresh Kumar
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Viresh Kumar @ 2015-03-31  6:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: linaro-kernel, linux-kernel, Viresh Kumar, Steven Miao

Hi Ingo/Thomas/Peter,

While queuing a timer, we try to migrate it to a non-idle core if the local core
is idle, but we don't try that if the timer is re-armed from its handler.

There were few unsolved problems due to which it was avoided until now. But
there are cases where solving these problems can be useful. When the timer is
always re-armed from its handler, it never migrates to other cores. And many a
times, it ends up waking an idle core to just service the timer, which could
have been handled by a non-idle core.

Peter suggested [1] few changes which can make that work and the first patch
does exactly that. The second one is a minor improvement, that replaces
'running_timer' pointer with 'busy'. That variable was required as part of a
sanity check during CPU hot-unplug operation. I was not sure if we should drop
this extra variable ('running_timer' or 'busy') and the sanity check.

Because we are using another bit from base pointer to keep track of running
status of timer, we get a warning on blackfin, as it doesn't respect
____cacheline_aligned [2].

kernel/time/timer.c: In function 'init_timers':
kernel/time/timer.c:1731:2: error: call to '__compiletime_assert_1731' declared
	with attribute error: BUILD_BUG_ON failed: __alignof__(struct tvec_base)
	& TIMER_FLAG_MASK

--
viresh

[1] https://lkml.org/lkml/2015/3/28/32
[2] https://lkml.org/lkml/2015/3/29/178

Cc: Steven Miao <realmz6@gmail.com>

Viresh Kumar (2):
  timer: Avoid waking up an idle-core by migrate running timer
  timer: Replace base-> 'running_timer' with 'busy'

 include/linux/timer.h |   3 +-
 kernel/time/timer.c   | 102 ++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 81 insertions(+), 24 deletions(-)

-- 
2.3.0.rc0.44.ga94655d


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-03-31  6:55 [PATCH 0/2] timer: Migrate running timers Viresh Kumar
@ 2015-03-31  6:55 ` Viresh Kumar
  2015-03-31 14:53   ` Peter Zijlstra
                     ` (2 more replies)
  2015-03-31  6:55 ` [PATCH 2/2] timer: Replace base-> 'running_timer' with 'busy' Viresh Kumar
  2015-03-31 15:01 ` [PATCH 0/2] timer: Migrate running timers Viresh Kumar
  2 siblings, 3 replies; 21+ messages in thread
From: Viresh Kumar @ 2015-03-31  6:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: linaro-kernel, linux-kernel, Viresh Kumar, Steven Miao

While queuing a timer, we try to migrate it to a non-idle core if the local core
is idle, but we don't try that if the timer is re-armed from its handler.

There were few unsolved problems due to which it was avoided until now. But
there are cases where solving these problems can be useful. When the timer is
always re-armed from its handler, it never migrates to other cores. And many a
times, it ends up waking an idle core to just service the timer, which could
have been handled by a non-idle core.

Problems to solve, if we allow such migrations:

- Serialization of timer with itself. Its possible that timer may fire on the
  new base, before the timer handler finishes on old base.

- del_timer_sync() can't detect that the timer's handler has not finished.

__mod_timer migrates the timer with following code:

	spin_lock(&old_base->lock);
	timer_set_base(timer, NULL);
	spin_unlock(&old_base->lock);

	spin_lock(&new_base->lock);
	timer_set_base(timer, new_base);
	spin_unlock(&new_base->lock);

Once the new_base->lock is released, del_timer_sync() can take the
new_base->lock and will get new_base->running_timer != timer. del_timer_sync()
will then remove the timer and return while timer's handler hasn't finished yet
on the old base.

To fix these issues, lets use another bit from base pointer to mark if a timer's
handler is currently running or not. This can be used to verify timer's state in
del_timer_sync().

Before running timer's handler we must ensure timer isn't running on any other
CPU. If it is, then process other expired timers first, if any, and then wait
until the handler finishes.

Cc: Steven Miao <realmz6@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 include/linux/timer.h |  3 +-
 kernel/time/timer.c   | 93 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 76 insertions(+), 20 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..68bf09d69352 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases;
  */
 #define TIMER_DEFERRABLE		0x1LU
 #define TIMER_IRQSAFE			0x2LU
+#define TIMER_RUNNING			0x4LU
 
-#define TIMER_FLAG_MASK			0x3LU
+#define TIMER_FLAG_MASK			0x7LU
 
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
 		.entry = { .prev = TIMER_ENTRY_STATIC },	\
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..364644811485 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -105,6 +105,21 @@ static inline unsigned int tbase_get_irqsafe(struct tvec_base *base)
 	return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE);
 }
 
+static inline unsigned int timer_running(struct timer_list *timer)
+{
+	return ((unsigned int)(unsigned long)timer->base & TIMER_RUNNING);
+}
+
+static inline void timer_set_running(struct timer_list *timer)
+{
+	timer->base = (struct tvec_base *)((unsigned long)timer->base | TIMER_RUNNING);
+}
+
+static inline void timer_clear_running(struct timer_list *timer)
+{
+	timer->base = (struct tvec_base *)((unsigned long)timer->base & ~TIMER_RUNNING);
+}
+
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
 	return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK));
@@ -781,21 +796,12 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
 	new_base = per_cpu(tvec_bases, cpu);
 
 	if (base != new_base) {
-		/*
-		 * We are trying to schedule the timer on the local CPU.
-		 * However we can't change timer's base while it is running,
-		 * otherwise del_timer_sync() can't detect that the timer's
-		 * handler yet has not finished. This also guarantees that
-		 * the timer is serialized wrt itself.
-		 */
-		if (likely(base->running_timer != timer)) {
-			/* See the comment in lock_timer_base() */
-			timer_set_base(timer, NULL);
-			spin_unlock(&base->lock);
-			base = new_base;
-			spin_lock(&base->lock);
-			timer_set_base(timer, base);
-		}
+		/* See the comment in lock_timer_base() */
+		timer_set_base(timer, NULL);
+		spin_unlock(&base->lock);
+		base = new_base;
+		spin_lock(&base->lock);
+		timer_set_base(timer, base);
 	}
 
 	timer->expires = expires;
@@ -1016,7 +1022,7 @@ int try_to_del_timer_sync(struct timer_list *timer)
 
 	base = lock_timer_base(timer, &flags);
 
-	if (base->running_timer != timer) {
+	if (!timer_running(timer)) {
 		timer_stats_timer_clear_start_info(timer);
 		ret = detach_if_pending(timer, base, true);
 	}
@@ -1050,12 +1056,12 @@ EXPORT_SYMBOL(try_to_del_timer_sync);
  *    ----                             ----
  *                                   <SOFTIRQ>
  *                                   call_timer_fn();
- *                                     base->running_timer = mytimer;
+ *                                     timer_set_running(mytimer);
  *  spin_lock_irq(somelock);
  *                                     <IRQ>
  *                                        spin_lock(somelock);
  *  del_timer_sync(mytimer);
- *   while (base->running_timer == mytimer);
+ *   while (timer_running(mytimer));
  *
  * Now del_timer_sync() will never return and never release somelock.
  * The interrupt on the other CPU is waiting to grab somelock but
@@ -1189,12 +1195,41 @@ static inline void __run_timers(struct tvec_base *base)
 			cascade(base, &base->tv5, INDEX(3));
 		++base->timer_jiffies;
 		list_replace_init(base->tv1.vec + index, head);
+
+again:
 		while (!list_empty(head)) {
 			void (*fn)(unsigned long);
 			unsigned long data;
 			bool irqsafe;
 
-			timer = list_first_entry(head, struct timer_list,entry);
+			timer = list_first_entry(head, struct timer_list, entry);
+
+			if (unlikely(timer_running(timer))) {
+				/*
+				 * The only way to get here is if the handler,
+				 * running on another base, re-queued itself on
+				 * this base, and the handler hasn't finished
+				 * yet.
+				 */
+
+				if (list_is_last(&timer->entry, head)) {
+					/*
+					 * Drop lock, so that TIMER_RUNNING can
+					 * be cleared on another base.
+					 */
+					spin_unlock(&base->lock);
+
+					while (timer_running(timer))
+						cpu_relax();
+
+					spin_lock(&base->lock);
+				} else {
+					/* Rotate the list and try someone else */
+					list_move_tail(&timer->entry, head);
+				}
+				goto again;
+			}
+
 			fn = timer->function;
 			data = timer->data;
 			irqsafe = tbase_get_irqsafe(timer->base);
@@ -1202,6 +1237,7 @@ static inline void __run_timers(struct tvec_base *base)
 			timer_stats_account_timer(timer);
 
 			base->running_timer = timer;
+			timer_set_running(timer);
 			detach_expired_timer(timer, base);
 
 			if (irqsafe) {
@@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base)
 				call_timer_fn(timer, fn, data);
 				spin_lock_irq(&base->lock);
 			}
+
+			/*
+			 * Handler running on this base, re-queued itself on
+			 * another base ?
+			 */
+			if (unlikely(timer->base != base)) {
+				unsigned long flags;
+				struct tvec_base *tbase;
+
+				spin_unlock(&base->lock);
+
+				tbase = lock_timer_base(timer, &flags);
+				timer_clear_running(timer);
+				spin_unlock(&tbase->lock);
+
+				spin_lock(&base->lock);
+			} else {
+				timer_clear_running(timer);
+			}
 		}
 	}
 	base->running_timer = NULL;
-- 
2.3.0.rc0.44.ga94655d


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/2] timer: Replace base-> 'running_timer' with 'busy'
  2015-03-31  6:55 [PATCH 0/2] timer: Migrate running timers Viresh Kumar
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
@ 2015-03-31  6:55 ` Viresh Kumar
  2015-03-31 15:01 ` [PATCH 0/2] timer: Migrate running timers Viresh Kumar
  2 siblings, 0 replies; 21+ messages in thread
From: Viresh Kumar @ 2015-03-31  6:55 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: linaro-kernel, linux-kernel, Viresh Kumar

We don't need to track the running timer for a cpu base anymore, but sill need
to check business of base for sanity checking during CPU hotplug.

Lets replace 'running_timer' with 'busy' for handle that efficiently.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/time/timer.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 364644811485..2db05206594b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -77,7 +77,7 @@ struct tvec_root {
 
 struct tvec_base {
 	spinlock_t lock;
-	struct timer_list *running_timer;
+	bool busy;
 	unsigned long timer_jiffies;
 	unsigned long next_timer;
 	unsigned long active_timers;
@@ -1180,6 +1180,8 @@ static inline void __run_timers(struct tvec_base *base)
 		spin_unlock_irq(&base->lock);
 		return;
 	}
+
+	base->busy = true;
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
 		struct list_head work_list;
 		struct list_head *head = &work_list;
@@ -1236,7 +1238,6 @@ static inline void __run_timers(struct tvec_base *base)
 
 			timer_stats_account_timer(timer);
 
-			base->running_timer = timer;
 			timer_set_running(timer);
 			detach_expired_timer(timer, base);
 
@@ -1270,7 +1271,7 @@ static inline void __run_timers(struct tvec_base *base)
 			}
 		}
 	}
-	base->running_timer = NULL;
+	base->busy = false;
 	spin_unlock_irq(&base->lock);
 }
 
@@ -1675,7 +1676,7 @@ static void migrate_timers(int cpu)
 	spin_lock_irq(&new_base->lock);
 	spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
 
-	BUG_ON(old_base->running_timer);
+	BUG_ON(old_base->busy);
 
 	for (i = 0; i < TVR_SIZE; i++)
 		migrate_timer_list(new_base, old_base->tv1.vec + i);
-- 
2.3.0.rc0.44.ga94655d


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
@ 2015-03-31 14:53   ` Peter Zijlstra
  2015-04-14 23:13   ` Thomas Gleixner
  2015-04-15 15:54   ` Thomas Gleixner
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-03-31 14:53 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Thomas Gleixner, linaro-kernel, linux-kernel, Steven Miao

On Tue, Mar 31, 2015 at 12:25:16PM +0530, Viresh Kumar wrote:
> Cc: Steven Miao <realmz6@gmail.com>


> +#define TIMER_FLAG_MASK			0x7LU

So Steven, this will break compilation on blackfin because that makes
____cacheline_aligned a NOP while we assume it will generate
__attribute__((__aligned__(SMP_CACHE_BYTES))) with the further
assumption that SMP_CACHE_BYTES >= 8.

Can you explain why blackfin chooses to break ____cacheline_align for
UP? We have the *_in_smp variants to distinguish these cases.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] timer: Migrate running timers
  2015-03-31  6:55 [PATCH 0/2] timer: Migrate running timers Viresh Kumar
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
  2015-03-31  6:55 ` [PATCH 2/2] timer: Replace base-> 'running_timer' with 'busy' Viresh Kumar
@ 2015-03-31 15:01 ` Viresh Kumar
  2 siblings, 0 replies; 21+ messages in thread
From: Viresh Kumar @ 2015-03-31 15:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner
  Cc: Linaro Kernel Mailman List, Linux Kernel Mailing List,
	Viresh Kumar, Steven Miao

On 31 March 2015 at 12:25, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> While queuing a timer, we try to migrate it to a non-idle core if the local core
> is idle, but we don't try that if the timer is re-armed from its handler.
>
> There were few unsolved problems due to which it was avoided until now. But
> there are cases where solving these problems can be useful. When the timer is
> always re-armed from its handler, it never migrates to other cores. And many a
> times, it ends up waking an idle core to just service the timer, which could
> have been handled by a non-idle core.
>
> Peter suggested [1] few changes which can make that work and the first patch
> does exactly that. The second one is a minor improvement, that replaces
> 'running_timer' pointer with 'busy'. That variable was required as part of a
> sanity check during CPU hot-unplug operation. I was not sure if we should drop
> this extra variable ('running_timer' or 'busy') and the sanity check.

Peter reminded me that I failed to tell if it really worked or not :)

So yes it worked. I tested this on a Dual-core ARM cortex-A15 board with 5
timers getting re-armed 100 times each from their handler.

Most of the time the remote CPU was idle (along with the local one) and
so migration didn't happen.

But as and when the local CPU was idle and remote one wasn't, timers got
successfully migrated.

My branches are also tested by Fengguang's build-bot, and in case of any
of wreckage on other machines, we will be informed :)

--
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
  2015-03-31 14:53   ` Peter Zijlstra
@ 2015-04-14 23:13   ` Thomas Gleixner
  2015-04-17  8:12     ` viresh kumar
  2015-04-15 15:54   ` Thomas Gleixner
  2 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-14 23:13 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Peter Zijlstra, linaro-kernel, linux-kernel, Steven Miao

On Tue, 31 Mar 2015, Viresh Kumar wrote:
> @@ -1189,12 +1195,41 @@ static inline void __run_timers(struct tvec_base *base)
>  			cascade(base, &base->tv5, INDEX(3));
>  		++base->timer_jiffies;
>  		list_replace_init(base->tv1.vec + index, head);
> +
> +again:
>  		while (!list_empty(head)) {
>  			void (*fn)(unsigned long);
>  			unsigned long data;
>  			bool irqsafe;
>  
> -			timer = list_first_entry(head, struct timer_list,entry);
> +			timer = list_first_entry(head, struct timer_list, entry);
> +
> +			if (unlikely(timer_running(timer))) {
> +				/*
> +				 * The only way to get here is if the handler,
> +				 * running on another base, re-queued itself on
> +				 * this base, and the handler hasn't finished
> +				 * yet.
> +				 */
> +
> +				if (list_is_last(&timer->entry, head)) {
> +					/*
> +					 * Drop lock, so that TIMER_RUNNING can
> +					 * be cleared on another base.
> +					 */
> +					spin_unlock(&base->lock);
> +
> +					while (timer_running(timer))
> +						cpu_relax();
> +
> +					spin_lock(&base->lock);
> +				} else {
> +					/* Rotate the list and try someone else */
> +					list_move_tail(&timer->entry, head);
> +				}

Can we please stick that timer into the next bucket and be done with it?

> +				goto again;

"continue;" is old school, right?

> +			}
> +
>  			fn = timer->function;
>  			data = timer->data;
>  			irqsafe = tbase_get_irqsafe(timer->base);
> @@ -1202,6 +1237,7 @@ static inline void __run_timers(struct tvec_base *base)
>  			timer_stats_account_timer(timer);
>  
>  			base->running_timer = timer;
> +			timer_set_running(timer);
>  			detach_expired_timer(timer, base);
>  
>  			if (irqsafe) {
> @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base)
>  				call_timer_fn(timer, fn, data);
>  				spin_lock_irq(&base->lock);
>  			}
> +
> +			/*
> +			 * Handler running on this base, re-queued itself on
> +			 * another base ?
> +			 */
> +			if (unlikely(timer->base != base)) {
> +				unsigned long flags;
> +				struct tvec_base *tbase;
> +
> +				spin_unlock(&base->lock);
> +
> +				tbase = lock_timer_base(timer, &flags);
> +				timer_clear_running(timer);
> +				spin_unlock(&tbase->lock);
> +
> +				spin_lock(&base->lock);
> +			} else {
> +				timer_clear_running(timer);
> +			}

Aside of the above this is really horrible. Why not doing the obvious:

__mod_timer()

	if (base != newbase) {
	   	if (timer_running()) {
		   list_add(&base->migration_list);
		   goto out_unlock;
		}
		.....

__run_timers()

	while(!list_empty(head)) {
		....
	}

	if (unlikely(!list_empty(&base->migration_list)) {
		/* dequeue and requeue again */
	}

Simple, isn't it?

No new flags in the timer base, no concurrent expiry, no extra
conditionals in the expiry path. Just a single extra check at the end
of the softirq handler for this rare condition instead of imposing all
that nonsense to the hotpath.

We might even optimize that:

	  	if (timer_running()) {
		   list_add(&base->migration_list);
		   base->preferred_target = newbase;
		   goto out_unlock;
		}

	if (unlikely(!list_empty(&base->migration_list)) {
		/* dequeue and requeue again */
		while (!list_empty(&base->migration_list)) {
			dequeue_timer();
			newbase = base->preferred_target;
			unlock(base);
			lock(newbase);
			enqueue(newbase);
			unlock(newbase);
			lock(base);
		}
	}

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
  2015-03-31 14:53   ` Peter Zijlstra
  2015-04-14 23:13   ` Thomas Gleixner
@ 2015-04-15 15:54   ` Thomas Gleixner
  2 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-15 15:54 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Peter Zijlstra, linaro-kernel, linux-kernel, Steven Miao

On Tue, 31 Mar 2015, Viresh Kumar wrote:
> @@ -1213,6 +1249,25 @@ static inline void __run_timers(struct tvec_base *base)
>  				call_timer_fn(timer, fn, data);
>  				spin_lock_irq(&base->lock);
>  			}
> +
> +			/*
> +			 * Handler running on this base, re-queued itself on
> +			 * another base ?
> +			 */
> +			if (unlikely(timer->base != base)) {
> +				unsigned long flags;
> +				struct tvec_base *tbase;
> +
> +				spin_unlock(&base->lock);
> +
> +				tbase = lock_timer_base(timer, &flags);
> +				timer_clear_running(timer);
> +				spin_unlock(&tbase->lock);
> +
> +				spin_lock(&base->lock);
> +			} else {
> +				timer_clear_running(timer);
> +			}

And just for the record:

Dereferencing timer _AFTER_ the callback function is a big NONO. The
callback function is allowed to free the timer. See the comment in
call_timer_fn()

Oh well,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-14 23:13   ` Thomas Gleixner
@ 2015-04-17  8:12     ` viresh kumar
  2015-04-17  8:32       ` Ingo Molnar
  2015-04-21 21:32       ` Thomas Gleixner
  0 siblings, 2 replies; 21+ messages in thread
From: viresh kumar @ 2015-04-17  8:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, linaro-kernel, linux-kernel,
	Steven Miao, shashim


Hi Thomas,

On Wednesday 15 April 2015 04:43 AM, Thomas Gleixner wrote:
> No new flags in the timer base, no concurrent expiry, no extra
> conditionals in the expiry path. Just a single extra check at the end
> of the softirq handler for this rare condition instead of imposing all
> that nonsense to the hotpath.

Here is what I could get to based on your suggestions:

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..c412511d34b8 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -78,6 +78,8 @@ struct tvec_root {
 struct tvec_base {
        spinlock_t lock;
        struct timer_list *running_timer;
+       struct list_head migration_list;
+       struct tvec_base *preferred_target;
        unsigned long timer_jiffies;
        unsigned long next_timer;
        unsigned long active_timers;
@@ -785,10 +787,18 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
                 * We are trying to schedule the timer on the local CPU.
                 * However we can't change timer's base while it is running,
                 * otherwise del_timer_sync() can't detect that the timer's
-                * handler yet has not finished. This also guarantees that
-                * the timer is serialized wrt itself.
+                * handler yet has not finished.
+                *
+                * Move timer to migration_list which can be processed after all
+                * timers in current cycle are serviced.  This also guarantees
+                * that the timer is serialized wrt itself.
                 */
-               if (likely(base->running_timer != timer)) {
+               if (unlikely(base->running_timer == timer)) {
+                       timer->expires = expires;
+                       base->preferred_target = new_base;
+                       list_add_tail(&timer->entry, &base->migration_list);
+                       goto out_unlock;
+               } else {
                        /* See the comment in lock_timer_base() */
                        timer_set_base(timer, NULL);
                        spin_unlock(&base->lock);
@@ -1214,6 +1224,33 @@ static inline void __run_timers(struct tvec_base *base)
                                spin_lock_irq(&base->lock);
                        }
                }
+
+               /*
+                * Timer handler re-armed timer and we didn't wanted to add that
+                * on a idle local CPU. Its handler has finished now, lets
+                * enqueue it again.
+                */
+               if (unlikely(!list_empty(&base->migration_list))) {
+                       struct tvec_base *new_base = base->preferred_target;
+
+                       do {
+                               timer = list_first_entry(&base->migration_list,
+                                                        struct timer_list, entry);
+
+                               __list_del(timer->entry.prev, timer->entry.next);
+
+                               /* See the comment in lock_timer_base() */
+                               timer_set_base(timer, NULL);
+                               spin_unlock(&base->lock);
+
+                               spin_lock(&new_base->lock);
+                               timer_set_base(timer, new_base);
+                               internal_add_timer(new_base, timer);
+                               spin_unlock(&new_base->lock);
+
+                               spin_lock(&base->lock);
+                       } while (!list_empty(&base->migration_list));
+               }
        }
        base->running_timer = NULL;
        spin_unlock_irq(&base->lock);
@@ -1583,6 +1620,7 @@ static int init_timers_cpu(int cpu)
        for (j = 0; j < TVR_SIZE; j++)
                INIT_LIST_HEAD(base->tv1.vec + j);

+       INIT_LIST_HEAD(&base->migration_list);
        base->timer_jiffies = jiffies;
        base->next_timer = base->timer_jiffies;
        base->active_timers = 0;


Does this look any better ?

I tested this on Exynos (Dual ARM Cortex-A9) board, with ubuntu over it.
System was fairly idle and I did 'dmesg > /dev/null' on one of the CPUs
in an infinite loop, to get CPUs out of idle.


I have got few more concerns/diffs over this to further optimize things,
but kept them separate so that I can drop them if they aren't correct.

1.) Should the above list_empty(migration_list) block be added out of the

	while (time_after_eq(jiffies, base->timer_jiffies))

    block ? So that we check it only once per timer interrupt.

    Also ->running_timer is set to the last serviced timer and a
    del_timer_sync() might be waiting to remove it. But we continue with
    the migration list first, without clearing it first. Not sure if this
    is important at all..


2.) By the time we finish serving all pending timers, local CPU might not be
    idle anymore OR the target CPU may become idle.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c412511d34b8..d75d31e9dc49 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1233,6 +1233,14 @@ static inline void __run_timers(struct tvec_base *base)
                if (unlikely(!list_empty(&base->migration_list))) {
                        struct tvec_base *new_base = base->preferred_target;

+                       if (!idle_cpu(base->cpu)) {
+                               /* Local CPU not idle anymore */
+                               new_base = base;
+                       } else if (idle_cpu(new_base->cpu)) {
+                               /* Re-evaluate base, target CPU has gone idle */
+                               new_base = per_cpu(tvec_bases, get_nohz_timer_target(false));
+                       }
+
                        do {
                                timer = list_first_entry(&base->migration_list,
                                                         struct timer_list, entry);


The first if block is getting hit a lot of times in my tests. But
the second one has never been hit (Probably because of only two CPUs,
not sure).


2.) It might be better to update preferred_target every time we choose a
difference base. This may help us avoid calling get_nohz_timer_target()
in the second if block above.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d75d31e9dc49..558cd98abd87 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -783,6 +783,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
        new_base = per_cpu(tvec_bases, cpu);

        if (base != new_base) {
+               base->preferred_target = new_base;
+
                /*
                 * We are trying to schedule the timer on the local CPU.
                 * However we can't change timer's base while it is running,
@@ -795,7 +797,6 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
                 */
                if (unlikely(base->running_timer == timer)) {
                        timer->expires = expires;
-                       base->preferred_target = new_base;
                        list_add_tail(&timer->entry, &base->migration_list);
                        goto out_unlock;
                } else {



--
viresh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-17  8:12     ` viresh kumar
@ 2015-04-17  8:32       ` Ingo Molnar
  2015-04-21 21:32       ` Thomas Gleixner
  1 sibling, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2015-04-17  8:32 UTC (permalink / raw)
  To: viresh kumar
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, linaro-kernel,
	linux-kernel, Steven Miao, shashim


* viresh kumar <viresh.kumar@linaro.org> wrote:

> +               /*
> +                * Timer handler re-armed timer and we didn't wanted to add that
> +                * on a idle local CPU. Its handler has finished now, lets
> +                * enqueue it again.
> +                */

That single comment has 5 grammatical errors!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-17  8:12     ` viresh kumar
  2015-04-17  8:32       ` Ingo Molnar
@ 2015-04-21 21:32       ` Thomas Gleixner
  2015-04-21 21:54         ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-21 21:32 UTC (permalink / raw)
  To: viresh kumar
  Cc: Ingo Molnar, Peter Zijlstra, linaro-kernel, linux-kernel,
	Steven Miao, shashim

On Fri, 17 Apr 2015, viresh kumar wrote:
> Does this look any better ?

Yes, but:
 
> 1.) Should the above list_empty(migration_list) block be added out of the
> 
> 	while (time_after_eq(jiffies, base->timer_jiffies))
> 
>     block ? So that we check it only once per timer interrupt.

That's what I suggested.
 
>     Also ->running_timer is set to the last serviced timer and a
>     del_timer_sync() might be waiting to remove it. But we continue with
>     the migration list first, without clearing it first. Not sure if this
>     is important at all..

Of course it is and doing it outside of the loop solves that issue.
 
> 2.) By the time we finish serving all pending timers, local CPU might not be
>     idle anymore OR the target CPU may become idle.

That's another good reason to move that migration list outside of the
while loop.
 
> 2.) It might be better to update preferred_target every time we choose a
> difference base. This may help us avoid calling get_nohz_timer_target()
> in the second if block above.

Yuck no. You can do that in that migration code and not add more
pointless crap to the normal case.

Are you realizing that __mod_timer() is a massive hotpath for network
heavy workloads?

This stuff needs to be debloated and not mindlessly packed with more
corner case features.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-21 21:32       ` Thomas Gleixner
@ 2015-04-21 21:54         ` Eric Dumazet
  2015-04-22 15:29           ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-04-21 21:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: viresh kumar, Ingo Molnar, Peter Zijlstra, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:

> 
> Are you realizing that __mod_timer() is a massive hotpath for network
> heavy workloads?

BTW I was considering using mod_timer_pinned() from these networking
timers (ie sk_reset_timer())

get_nohz_timer_target() sounds cool for laptop users, but is one cause
for bad responses to DDOS, when the selected cpu gets stressed.

This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae
("tcp/dccp: get rid of central timewait timer")




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-21 21:54         ` Eric Dumazet
@ 2015-04-22 15:29           ` Peter Zijlstra
  2015-04-22 16:02             ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-04-22 15:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Tue, Apr 21, 2015 at 02:54:55PM -0700, Eric Dumazet wrote:
> On Tue, 2015-04-21 at 23:32 +0200, Thomas Gleixner wrote:
> 
> > 
> > Are you realizing that __mod_timer() is a massive hotpath for network
> > heavy workloads?
> 
> BTW I was considering using mod_timer_pinned() from these networking
> timers (ie sk_reset_timer())
> 
> get_nohz_timer_target() sounds cool for laptop users, but is one cause
> for bad responses to DDOS, when the selected cpu gets stressed.
> 
> This is the reason I used mod_timer_pinned() in commit 789f558cfb3680ae
> ("tcp/dccp: get rid of central timewait timer")

Hmm, that sounds unfortunate, this would wreck life for the power aware
laptop/tablet etc.. people.

There is already a sysctl to manage this, is that not enough to mitigate
this problem on the server side of things?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-22 15:29           ` Peter Zijlstra
@ 2015-04-22 16:02             ` Eric Dumazet
  2015-04-22 18:56               ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-04-22 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Wed, 2015-04-22 at 17:29 +0200, Peter Zijlstra wrote:

> Hmm, that sounds unfortunate, this would wreck life for the power aware
> laptop/tablet etc.. people.
> 
> There is already a sysctl to manage this, is that not enough to mitigate
> this problem on the server side of things?

The thing is : 99% of networking timers never fire.

But when they _do_ fire, and host is under attack, they all fire on
unrelated cpu and this one can not keep up.

Added latencies fire monitoring alerts.

Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
for a specific example of the problems that can be raised.

When we set a timer to fire in 10 seconds, knowing the _current_ idle
state for cpus is of no help.

Add to this that softirq processing is not considered as making current
cpu as non idle.

networking tried hard to use cpu affinities (and all techniques
described in Documentation/networking/scaling.txt),
but /proc/sys/kernel/timer_migration adds a fair overhead in many
workloads.

get_nohz_timer_target() has to touch 3 cache lines per cpu...

Its in the top 10 in "perf top" profiles on servers with 72 threads.

This /proc/sys/kernel/timer_migration should have been instead :

/proc/sys/kernel/timer_on_a_single_cpu_for_laptop_sake




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-22 16:02             ` Eric Dumazet
@ 2015-04-22 18:56               ` Thomas Gleixner
  2015-04-22 19:59                 ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-22 18:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Wed, 22 Apr 2015, Eric Dumazet wrote:
> Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> for a specific example of the problems that can be raised.

If you have a problem with the core timer code then it should be fixed
there and not worked around in some place which will ruin stuff for
power saving interested users. I'm so tired of this 'I fix it in my
sandbox' attitude, really. If the core code has a shortcoming we fix
it there right away because you are probably not the only one who runs
into that shortcoming. So if we don't fix it in the core we end up
with a metric ton of slightly different (or broken) workarounds which
affect the workload/system characteristics of other people.

Just for the record. Even the changelog of this commit is blatantly
wrong:

  "We can see that timers get migrated into a single cpu, presumably
   idle at the time timers are set up."

The timer migration moves timers to non idle cpus to leave the idle
ones alone for power saving sake.

I can see and understand the reason why you want to avoid that, but I
have to ask the question whether this pinning is the correct behaviour
under all workloads and system characteristics. If yes, then the patch
is the right answer, if no, then it is simply the wrong approach.

> but /proc/sys/kernel/timer_migration adds a fair overhead in many
> workloads.
> 
> get_nohz_timer_target() has to touch 3 cache lines per cpu...

And this is something we can fix and completely avoid if we think
about it. Looking at the code I have to admit that the out of line
call and the sysctl variable lookup is silly. But its not rocket
science to fix this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-22 18:56               ` Thomas Gleixner
@ 2015-04-22 19:59                 ` Eric Dumazet
  2015-04-22 21:56                   ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-04-22 19:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
> On Wed, 22 Apr 2015, Eric Dumazet wrote:
> > Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> > for a specific example of the problems that can be raised.
> 
> If you have a problem with the core timer code then it should be fixed
> there and not worked around in some place which will ruin stuff for
> power saving interested users. I'm so tired of this 'I fix it in my
> sandbox' attitude, really. If the core code has a shortcoming we fix
> it there right away because you are probably not the only one who runs
> into that shortcoming. So if we don't fix it in the core we end up
> with a metric ton of slightly different (or broken) workarounds which
> affect the workload/system characteristics of other people.
> 
> Just for the record. Even the changelog of this commit is blatantly
> wrong:
> 
>   "We can see that timers get migrated into a single cpu, presumably
>    idle at the time timers are set up."

Spare me the obvious typo. A 'not' is missing.

> 
> The timer migration moves timers to non idle cpus to leave the idle
> ones alone for power saving sake.
> 
> I can see and understand the reason why you want to avoid that, but I
> have to ask the question whether this pinning is the correct behaviour
> under all workloads and system characteristics. If yes, then the patch
> is the right answer, if no, then it is simply the wrong approach.
> 
> > but /proc/sys/kernel/timer_migration adds a fair overhead in many
> > workloads.
> > 
> > get_nohz_timer_target() has to touch 3 cache lines per cpu...
> 
> And this is something we can fix and completely avoid if we think
> about it. Looking at the code I have to admit that the out of line
> call and the sysctl variable lookup is silly. But its not rocket
> science to fix this.

Awesome.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-22 19:59                 ` Eric Dumazet
@ 2015-04-22 21:56                   ` Thomas Gleixner
  2015-04-23  6:57                     ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-22 21:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim

On Wed, 22 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-22 at 20:56 +0200, Thomas Gleixner wrote:
> > On Wed, 22 Apr 2015, Eric Dumazet wrote:
> > > Check commit 4a8e320c929991c9480 ("net: sched: use pinned timers")
> > > for a specific example of the problems that can be raised.
> > 
> > If you have a problem with the core timer code then it should be fixed
> > there and not worked around in some place which will ruin stuff for
> > power saving interested users. I'm so tired of this 'I fix it in my
> > sandbox' attitude, really. If the core code has a shortcoming we fix
> > it there right away because you are probably not the only one who runs
> > into that shortcoming. So if we don't fix it in the core we end up
> > with a metric ton of slightly different (or broken) workarounds which
> > affect the workload/system characteristics of other people.
> > 
> > Just for the record. Even the changelog of this commit is blatantly
> > wrong:
> > 
> >   "We can see that timers get migrated into a single cpu, presumably
> >    idle at the time timers are set up."
> 
> Spare me the obvious typo. A 'not' is missing.

:)
 
> > 
> > The timer migration moves timers to non idle cpus to leave the idle
> > ones alone for power saving sake.
> > 
> > I can see and understand the reason why you want to avoid that, but I
> > have to ask the question whether this pinning is the correct behaviour
> > under all workloads and system characteristics. If yes, then the patch
> > is the right answer, if no, then it is simply the wrong approach.

I take your 'Awesome' below as a no then.

> > > but /proc/sys/kernel/timer_migration adds a fair overhead in many
> > > workloads.
> > > 
> > > get_nohz_timer_target() has to touch 3 cache lines per cpu...
> > 
> > And this is something we can fix and completely avoid if we think
> > about it. Looking at the code I have to admit that the out of line
> > call and the sysctl variable lookup is silly. But its not rocket
> > science to fix this.
> 
> Awesome.

Here you go. Completely untested, at least it compiles.

Thanks,

	tglx
---

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -343,14 +343,10 @@ extern int runqueue_is_locked(int cpu);
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
 extern void nohz_balance_enter_idle(int cpu);
 extern void set_cpu_sd_state_idle(void);
-extern int get_nohz_timer_target(int pinned);
+extern int get_nohz_timer_target(void);
 #else
 static inline void nohz_balance_enter_idle(int cpu) { }
 static inline void set_cpu_sd_state_idle(void) { }
-static inline int get_nohz_timer_target(int pinned)
-{
-	return smp_processor_id();
-}
 #endif
 
 /*
Index: linux/include/linux/sched/sysctl.h
===================================================================
--- linux.orig/include/linux/sched/sysctl.h
+++ linux/include/linux/sched/sysctl.h
@@ -57,24 +57,12 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
-extern unsigned int sysctl_timer_migration;
 extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
 #endif
-#ifdef CONFIG_SCHED_DEBUG
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return sysctl_timer_migration;
-}
-#else
-static inline unsigned int get_sysctl_timer_migration(void)
-{
-	return 1;
-}
-#endif
 
 /*
  *  control realtime throttling:
Index: linux/include/linux/timer.h
===================================================================
--- linux.orig/include/linux/timer.h
+++ linux/include/linux/timer.h
@@ -254,6 +254,16 @@ extern void run_local_timers(void);
 struct hrtimer;
 extern enum hrtimer_restart it_real_fn(struct hrtimer *);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+#include <linux/sysctl.h>
+
+extern unsigned int sysctl_timer_migration;
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos);
+extern struct static_key timer_migration_enabled;
+#endif
+
 unsigned long __round_jiffies(unsigned long j, int cpu);
 unsigned long __round_jiffies_relative(unsigned long j, int cpu);
 unsigned long round_jiffies(unsigned long j);
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -593,13 +593,12 @@ void resched_cpu(int cpu)
  * selecting an idle cpu will add more delays to the timers than intended
  * (as that cpu's timer base may not be uptodate wrt jiffies etc).
  */
-int get_nohz_timer_target(int pinned)
+int get_nohz_timer_target(void)
 {
-	int cpu = smp_processor_id();
-	int i;
+	int i, cpu = smp_processor_id();
 	struct sched_domain *sd;
 
-	if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
+	if (!idle_cpu(cpu))
 		return cpu;
 
 	rcu_read_lock();
@@ -7088,8 +7087,6 @@ void __init sched_init_smp(void)
 }
 #endif /* CONFIG_SMP */
 
-const_debug unsigned int sysctl_timer_migration = 1;
-
 int in_sched_functions(unsigned long addr)
 {
 	return in_lock_functions(addr) ||
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -349,15 +349,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "timer_migration",
-		.data		= &sysctl_timer_migration,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-		.extra2		= &one,
-	},
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -1132,6 +1123,15 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+	{
+		.procname	= "timer_migration",
+		.data		= &sysctl_timer_migration,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= timer_migration_handler,
+	},
+#endif
 	{ }
 };
 
Index: linux/kernel/time/hrtimer.c
===================================================================
--- linux.orig/kernel/time/hrtimer.c
+++ linux/kernel/time/hrtimer.c
@@ -190,6 +190,23 @@ hrtimer_check_target(struct hrtimer *tim
 #endif
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+static inline struct hrtimer_cpu_base *get_target_base(int pinned)
+{
+	if (pinned)
+		return this_cpu_ptr(&hrtimer_bases);
+	if (static_key_true(&timer_migration_enabled))
+		return &per_cpu(hrtimer_bases, get_nohz_timer_target());
+	return this_cpu_ptr(&hrtimer_bases);
+}
+#else
+
+static inline struct hrtimer_cpu_base *get_target_base(int pinned)
+{
+	return this_cpu_ptr(&hrtimer_bases);
+}
+#endif
+
 /*
  * Switch the timer base to the current CPU when possible.
  */
@@ -197,14 +214,13 @@ static inline struct hrtimer_clock_base
 switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
 		    int pinned)
 {
+	struct hrtimer_cpu_base *new_cpu_base, *this_base;
 	struct hrtimer_clock_base *new_base;
-	struct hrtimer_cpu_base *new_cpu_base;
-	int this_cpu = smp_processor_id();
-	int cpu = get_nohz_timer_target(pinned);
 	int basenum = base->index;
 
+	this_base = this_cpu_ptr(&hrtimer_bases);
+	new_cpu_base = get_target_base(pinned);
 again:
-	new_cpu_base = &per_cpu(hrtimer_bases, cpu);
 	new_base = &new_cpu_base->clock_base[basenum];
 
 	if (base != new_base) {
@@ -225,17 +241,19 @@ again:
 		raw_spin_unlock(&base->cpu_base->lock);
 		raw_spin_lock(&new_base->cpu_base->lock);
 
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
 			raw_spin_unlock(&new_base->cpu_base->lock);
 			raw_spin_lock(&base->cpu_base->lock);
+			new_cpu_base = this_base;
 			timer->base = base;
 			goto again;
 		}
 		timer->base = new_base;
 	} else {
-		if (cpu != this_cpu && hrtimer_check_target(timer, new_base)) {
-			cpu = this_cpu;
+		if (new_cpu_base != this_base &&
+		    hrtimer_check_target(timer, new_base)) {
+			new_cpu_base = this_base;
 			goto again;
 		}
 	}
Index: linux/kernel/time/timer.c
===================================================================
--- linux.orig/kernel/time/timer.c
+++ linux/kernel/time/timer.c
@@ -104,6 +104,49 @@ EXPORT_SYMBOL(boot_tvec_bases);
 
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+unsigned int sysctl_timer_migration = 1;
+struct static_key timer_migration_enabled = STATIC_KEY_INIT_TRUE;
+
+int timer_migration_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	static DEFINE_MUTEX(mutex);
+	bool keyon;
+	int ret;
+
+	mutex_lock(&mutex);
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (ret || !write)
+		goto unlock;
+
+	keyon = static_key_enabled(&timer_migration_enabled);
+	if (sysctl_timer_migration && !keyon )
+		static_key_slow_inc(&timer_migration_enabled);
+	else if (!sysctl_timer_migration && keyon)
+		static_key_slow_dec(&timer_migration_enabled);
+
+unlock:
+	mutex_unlock(&mutex);
+	return ret;
+}
+
+static inline struct tvec_base *get_target_base(int pinned)
+{
+	if (pinned)
+		return __this_cpu_read(tvec_bases);
+	if (static_key_true(&timer_migration_enabled))
+		return per_cpu(tvec_bases, get_nohz_timer_target());
+	return __this_cpu_read(tvec_bases);
+}
+#else
+static inline struct tvec_base *get_target_base(int pinned)
+{
+	return __this_cpu_read(tvec_bases);
+}
+#endif
+
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 {
@@ -774,7 +817,7 @@ __mod_timer(struct timer_list *timer, un
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0 , cpu;
+	int ret = 0;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -787,8 +830,7 @@ __mod_timer(struct timer_list *timer, un
 
 	debug_activate(timer, expires);
 
-	cpu = get_nohz_timer_target(pinned);
-	new_base = per_cpu(tvec_bases, cpu);
+	new_base = get_target_base(pinned);
 
 	if (base != new_base) {
 		/*

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-22 21:56                   ` Thomas Gleixner
@ 2015-04-23  6:57                     ` Eric Dumazet
  2015-04-23 12:45                       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-04-23  6:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel,
	linux-kernel, Steven Miao, shashim


On Wed, 2015-04-22 at 23:56 +0200, Thomas Gleixner wrote:

> -int get_nohz_timer_target(int pinned)
> +int get_nohz_timer_target(void)
>  {
> -	int cpu = smp_processor_id();
> -	int i;
> +	int i, cpu = smp_processor_id();
>  	struct sched_domain *sd;
>  
> -	if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
> +	if (!idle_cpu(cpu))
>  		return cpu;

Maybe also test in_serving_softirq() ?

if (in_serving_softirq() || !idle_cpu(cpu))
	return cpu;

There is a fundamental problem with networking load : Many cpus appear
to be idle from scheduler perspective because no user/kernel task is running.

CPUs servicing NIC queues can be very busy handling thousands of packets
per second, yet have no user/kernel task running.

idle_cpu() return code is : this cpu is idle.    hmmmm, really ?

cpus are busy, *and* have to access alien data/locks to activate timers
that hardly fire anyway.

When idle_cpu() finally gives the right indication, it is too late :
ksoftirqd might be running on the wrong cpu. Innocent cpus, overwhelmed
by a sudden timer load and locked into a service loop.

This cannot resist to a DOS, and even with non malicious traffic, the
overhead is high.

Thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-23  6:57                     ` Eric Dumazet
@ 2015-04-23 12:45                       ` Thomas Gleixner
  2015-04-25 18:37                         ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-04-23 12:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel, LKML,
	Steven Miao, shashim, Frederic Weisbecker, Paul E. McKenney, cl

On Wed, 22 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-22 at 23:56 +0200, Thomas Gleixner wrote:
> 
> > -int get_nohz_timer_target(int pinned)
> > +int get_nohz_timer_target(void)
> >  {
> > -	int cpu = smp_processor_id();
> > -	int i;
> > +	int i, cpu = smp_processor_id();
> >  	struct sched_domain *sd;
> >  
> > -	if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu))
> > +	if (!idle_cpu(cpu))
> >  		return cpu;
> 
> Maybe also test in_serving_softirq() ?
> 
> if (in_serving_softirq() || !idle_cpu(cpu))
> 	return cpu;
> 
> There is a fundamental problem with networking load : Many cpus appear
> to be idle from scheduler perspective because no user/kernel task is running.
> 
> CPUs servicing NIC queues can be very busy handling thousands of packets
> per second, yet have no user/kernel task running.
> 
> idle_cpu() return code is : this cpu is idle.    hmmmm, really ?
> 
> cpus are busy, *and* have to access alien data/locks to activate timers
> that hardly fire anyway.
> 
> When idle_cpu() finally gives the right indication, it is too late :
> ksoftirqd might be running on the wrong cpu. Innocent cpus, overwhelmed
> by a sudden timer load and locked into a service loop.
> 
> This cannot resist to a DOS, and even with non malicious traffic, the
> overhead is high.

You definitely have a point from the high throughput networking
perspective.

Though in a power optimizing scenario with minimal network traffic
this might be the wrong decision. We have to gather data from the
power maniacs whether this matters or not. The FULL_NO_HZ camp might
be pretty unhappy about the above.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-23 12:45                       ` Thomas Gleixner
@ 2015-04-25 18:37                         ` Eric Dumazet
  2015-05-05 13:00                           ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-04-25 18:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel, LKML,
	Steven Miao, shashim, Frederic Weisbecker, Paul E. McKenney, cl

On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:

> You definitely have a point from the high throughput networking
> perspective.
> 
> Though in a power optimizing scenario with minimal network traffic
> this might be the wrong decision. We have to gather data from the
> power maniacs whether this matters or not. The FULL_NO_HZ camp might
> be pretty unhappy about the above.

Sure, I understand.


To make this clear, here the profile on a moderately loaded TCP server,
pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus
from softirq context).

(using regular sendmsg() system calls, that why the
get_nohz_timer_target() is 'only' second in the profile, but add the
find_next_bit() to it and this is very close being at first position)



   PerfTop:    4712 irqs/sec  kernel:96.7%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
-----------------------------------------------------------------------------------------
    10.16%  [kernel]          [k] copy_user_enhanced_fast_string            
     5.66%  [kernel]          [k] get_nohz_timer_target                     
     5.59%  [kernel]          [k] _raw_spin_lock                            
     2.53%  [kernel]          [k] __netif_receive_skb_core                  
     2.27%  [kernel]          [k] find_next_bit                             
     1.90%  [kernel]          [k] tcp_ack                                   

Maybe a reasonable heuristic would be to
change /proc/sys/kernel/timer_migration default to 0 on hosts with more
than 32 cpus.

profile with timer_migration = 0

   PerfTop:    3656 irqs/sec  kernel:94.3%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
-----------------------------------------------------------------------------------------
    13.95%  [kernel]          [k] copy_user_enhanced_fast_string            
     4.65%  [kernel]          [k] _raw_spin_lock                            
     2.57%  [kernel]          [k] __netif_receive_skb_core                  
     2.33%  [kernel]          [k] tcp_ack               

Thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-04-25 18:37                         ` Eric Dumazet
@ 2015-05-05 13:00                           ` Thomas Gleixner
  2015-05-06 16:33                             ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2015-05-05 13:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel, LKML,
	Steven Miao, shashim, Frederic Weisbecker, Paul E. McKenney, cl

On Sat, 25 Apr 2015, Eric Dumazet wrote:
> On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:
> 
> > You definitely have a point from the high throughput networking
> > perspective.
> > 
> > Though in a power optimizing scenario with minimal network traffic
> > this might be the wrong decision. We have to gather data from the
> > power maniacs whether this matters or not. The FULL_NO_HZ camp might
> > be pretty unhappy about the above.
> 
> Sure, I understand.
> 
> 
> To make this clear, here the profile on a moderately loaded TCP server,
> pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus
> from softirq context).
> 
> (using regular sendmsg() system calls, that why the
> get_nohz_timer_target() is 'only' second in the profile, but add the
> find_next_bit() to it and this is very close being at first position)
> 
> 
> 
>    PerfTop:    4712 irqs/sec  kernel:96.7%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
> -----------------------------------------------------------------------------------------
>     10.16%  [kernel]          [k] copy_user_enhanced_fast_string            
>      5.66%  [kernel]          [k] get_nohz_timer_target                     
>      5.59%  [kernel]          [k] _raw_spin_lock                            
>      2.53%  [kernel]          [k] __netif_receive_skb_core                  
>      2.27%  [kernel]          [k] find_next_bit                             
>      1.90%  [kernel]          [k] tcp_ack                                   
> 
> Maybe a reasonable heuristic would be to
> change /proc/sys/kernel/timer_migration default to 0 on hosts with more
> than 32 cpus.
> 
> profile with timer_migration = 0
> 
>    PerfTop:    3656 irqs/sec  kernel:94.3%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
> -----------------------------------------------------------------------------------------
>     13.95%  [kernel]          [k] copy_user_enhanced_fast_string            
>      4.65%  [kernel]          [k] _raw_spin_lock                            
>      2.57%  [kernel]          [k] __netif_receive_skb_core                  
>      2.33%  [kernel]          [k] tcp_ack               

Is that with the static key patch applied?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer
  2015-05-05 13:00                           ` Thomas Gleixner
@ 2015-05-06 16:33                             ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2015-05-06 16:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, viresh kumar, Ingo Molnar, linaro-kernel, LKML,
	Steven Miao, shashim, Frederic Weisbecker, Paul E. McKenney, cl

On Tue, 2015-05-05 at 15:00 +0200, Thomas Gleixner wrote:
> On Sat, 25 Apr 2015, Eric Dumazet wrote:
> > On Thu, 2015-04-23 at 14:45 +0200, Thomas Gleixner wrote:
> > 
> > > You definitely have a point from the high throughput networking
> > > perspective.
> > > 
> > > Though in a power optimizing scenario with minimal network traffic
> > > this might be the wrong decision. We have to gather data from the
> > > power maniacs whether this matters or not. The FULL_NO_HZ camp might
> > > be pretty unhappy about the above.
> > 
> > Sure, I understand.
> > 
> > 
> > To make this clear, here the profile on a moderately loaded TCP server,
> > pushing ~20Gbits of data. Most of TCP output is ACK clock driven (thus
> > from softirq context).
> > 
> > (using regular sendmsg() system calls, that why the
> > get_nohz_timer_target() is 'only' second in the profile, but add the
> > find_next_bit() to it and this is very close being at first position)
> > 
> > 
> > 
> >    PerfTop:    4712 irqs/sec  kernel:96.7%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
> > -----------------------------------------------------------------------------------------
> >     10.16%  [kernel]          [k] copy_user_enhanced_fast_string            
> >      5.66%  [kernel]          [k] get_nohz_timer_target                     
> >      5.59%  [kernel]          [k] _raw_spin_lock                            
> >      2.53%  [kernel]          [k] __netif_receive_skb_core                  
> >      2.27%  [kernel]          [k] find_next_bit                             
> >      1.90%  [kernel]          [k] tcp_ack                                   
> > 
> > Maybe a reasonable heuristic would be to
> > change /proc/sys/kernel/timer_migration default to 0 on hosts with more
> > than 32 cpus.
> > 
> > profile with timer_migration = 0
> > 
> >    PerfTop:    3656 irqs/sec  kernel:94.3%  exact:  0.0% [4000Hz cycles],  (all, 72 CPUs)
> > -----------------------------------------------------------------------------------------
> >     13.95%  [kernel]          [k] copy_user_enhanced_fast_string            
> >      4.65%  [kernel]          [k] _raw_spin_lock                            
> >      2.57%  [kernel]          [k] __netif_receive_skb_core                  
> >      2.33%  [kernel]          [k] tcp_ack               
> 
> Is that with the static key patch applied?

This was without.

I applied your patch on current linus tree, but for some reason my 72
cpu host is not liking the resulting kernel. I had to ask for a repair,
and this might take a while.

Note your kernel works correctly on other hosts, but with 48 or 32 cpus,
so this must be something unrelated.

I'll let you know when I get more interesting data.

Thanks




^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-05-06 16:33 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-31  6:55 [PATCH 0/2] timer: Migrate running timers Viresh Kumar
2015-03-31  6:55 ` [PATCH 1/2] timer: Avoid waking up an idle-core by migrate running timer Viresh Kumar
2015-03-31 14:53   ` Peter Zijlstra
2015-04-14 23:13   ` Thomas Gleixner
2015-04-17  8:12     ` viresh kumar
2015-04-17  8:32       ` Ingo Molnar
2015-04-21 21:32       ` Thomas Gleixner
2015-04-21 21:54         ` Eric Dumazet
2015-04-22 15:29           ` Peter Zijlstra
2015-04-22 16:02             ` Eric Dumazet
2015-04-22 18:56               ` Thomas Gleixner
2015-04-22 19:59                 ` Eric Dumazet
2015-04-22 21:56                   ` Thomas Gleixner
2015-04-23  6:57                     ` Eric Dumazet
2015-04-23 12:45                       ` Thomas Gleixner
2015-04-25 18:37                         ` Eric Dumazet
2015-05-05 13:00                           ` Thomas Gleixner
2015-05-06 16:33                             ` Eric Dumazet
2015-04-15 15:54   ` Thomas Gleixner
2015-03-31  6:55 ` [PATCH 2/2] timer: Replace base-> 'running_timer' with 'busy' Viresh Kumar
2015-03-31 15:01 ` [PATCH 0/2] timer: Migrate running timers Viresh Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).