LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Chris Wright <chrisw@sous-sol.org>
To: linux-kernel@vger.kernel.org, stable@kernel.org,
	Greg KH <gregkh@suse.de>, Chris Wright <chrisw@kernel.org>,
	Adrian Bunk <bunk@stusta.de>
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
	Ingo Molnar <mingo@elte.hu>
Subject: [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Date: Fri, 05 Jan 2007 18:28:12 -0800	[thread overview]
Message-ID: <20070106023229.297246000@sous-sol.org> (raw)
In-Reply-To: <20070106022753.334962000@sous-sol.org>

[-- Attachment #1: sched-fix-bad-missed-wakeups-in-the-i386-x86_64-ia64-acpi-and-apm-idle-code.patch --]
[-- Type: text/plain, Size: 8173 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

From: Ingo Molnar <mingo@elte.hu>

Fernando Lopez-Lezcano reported frequent scheduling latencies and audio 
xruns starting at the 2.6.18-rt kernel, and those problems persisted all 
until current -rt kernels. The latencies were serious and unjustified by 
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he 
tested dozens of kernels, tried various configs, boot options, 
test-patches of mine and provided latency traces of those incidents, the 
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for 
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task() 
for CPU#0. But it decided to not send an IPI that no CPU - due to 
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set, 
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The 
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline 
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045e1b0e5c82951b762257fe1c9d81564
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING. 
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely 
reorderable with the reading of the NEED_RESCHED bit - and both now 
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of 
the TS_POLLING flag must always be done before we test need_resched() 
and hit the idle instruction(s). [Symmetrically, the wakeup code needs 
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory 
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory 
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to 
act as an absolute barrier between the TS_POLLING write and the 
NEED_RESCHED read. This affects almost all idling methods (default, 
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
[chrisw: backport to 2.6.19.1]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/i386/kernel/apm.c        |    6 +++++-
 arch/i386/kernel/process.c    |    7 ++++++-
 arch/ia64/kernel/process.c    |   10 ++++++++--
 arch/x86_64/kernel/process.c  |    6 +++++-
 drivers/acpi/processor_idle.c |   12 ++++++++++--
 5 files changed, 34 insertions(+), 7 deletions(-)

--- linux-2.6.19.1.orig/arch/i386/kernel/apm.c
+++ linux-2.6.19.1/arch/i386/kernel/apm.c
@@ -784,7 +784,11 @@ static int apm_do_idle(void)
 	polling = !!(current_thread_info()->status & TS_POLLING);
 	if (polling) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
 	}
 	if (!need_resched()) {
 		idled = 1;
--- linux-2.6.19.1.orig/arch/i386/kernel/process.c
+++ linux-2.6.19.1/arch/i386/kernel/process.c
@@ -103,7 +103,12 @@ void default_idle(void)
 
 	if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
+
 		while (!need_resched()) {
 			local_irq_disable();
 			if (!need_resched())
--- linux-2.6.19.1.orig/arch/ia64/kernel/process.c
+++ linux-2.6.19.1/arch/ia64/kernel/process.c
@@ -268,10 +268,16 @@ cpu_idle (void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		if (can_do_pal_halt)
+		if (can_do_pal_halt) {
 			current_thread_info()->status &= ~TS_POLLING;
-		else
+			/*
+			 * TS_POLLING-cleared state must be visible before we
+			 * test NEED_RESCHED:
+			 */
+			smp_mb();
+		} else {
 			current_thread_info()->status |= TS_POLLING;
+		}
 
 		if (!need_resched()) {
 			void (*idle)(void);
--- linux-2.6.19.1.orig/arch/x86_64/kernel/process.c
+++ linux-2.6.19.1/arch/x86_64/kernel/process.c
@@ -111,7 +111,11 @@ static void default_idle(void)
 	local_irq_enable();
 
 	current_thread_info()->status &= ~TS_POLLING;
-	smp_mb__after_clear_bit();
+	/*
+	 * TS_POLLING-cleared state must be visible before we
+	 * test NEED_RESCHED:
+	 */
+	smp_mb();
 	while (!need_resched()) {
 		local_irq_disable();
 		if (!need_resched())
--- linux-2.6.19.1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.19.1/drivers/acpi/processor_idle.c
@@ -211,7 +211,11 @@ acpi_processor_power_activate(struct acp
 static void acpi_safe_halt(void)
 {
 	current_thread_info()->status &= ~TS_POLLING;
-	smp_mb__after_clear_bit();
+	/*
+	 * TS_POLLING-cleared state must be visible before we
+	 * test NEED_RESCHED:
+	 */
+	smp_mb();
 	if (!need_resched())
 		safe_halt();
 	current_thread_info()->status |= TS_POLLING;
@@ -345,7 +349,11 @@ static void acpi_processor_idle(void)
 	 */
 	if (cx->type == ACPI_STATE_C2 || cx->type == ACPI_STATE_C3) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
 		if (need_resched()) {
 			current_thread_info()->status |= TS_POLLING;
 			local_irq_enable();

--

  parent reply	other threads:[~2007-01-06  2:31 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-06  2:27 [patch 00/50] -stable review Chris Wright
2007-01-06  2:27 ` [patch 01/50] dm-crypt: Select CRYPTO_CBC Chris Wright
2007-01-06  2:27 ` [patch 02/50] sha512: Fix sha384 block size Chris Wright
2007-01-06  2:27 ` [patch 03/50] read_zero_pagealigned() locking fix Chris Wright
2007-01-06  2:27 ` [patch 04/50] ieee80211softmac: Fix mutex_lock at exit of ieee80211_softmac_get_genie Chris Wright
2007-01-08 15:24   ` John W. Linville
2007-01-06  2:27 ` [patch 05/50] x86-64: Mark rdtsc as sync only for netburst, not for core2 Chris Wright
2007-01-06  2:27 ` [patch 06/50] bonding: incorrect bonding state reported via ioctl Chris Wright
2007-01-06  2:28 ` [patch 07/50] DVB: lgdt330x: fix signal / lock status detection bug Chris Wright
2007-01-06  2:28 ` [patch 08/50] V4L: Fix broken TUNER_LG_NTSC_TAPE radio support Chris Wright
2007-01-06  2:28 ` [patch 09/50] Revert "[PATCH] zd1211rw: Removed unneeded packed attributes" Chris Wright
2007-01-08 15:32   ` John W. Linville
2007-01-06  2:28 ` [patch 10/50] libata: handle 0xff status properly Chris Wright
2007-01-06  2:28 ` [patch 11/50] ieee1394: ohci1394: add PPC_PMAC platform code to driver probe Chris Wright
2007-01-06  2:28 ` [patch 12/50] kbuild: dont put temp files in source Chris Wright
2007-01-06  2:28 ` [patch 13/50] ARM: Add sys_*at syscalls Chris Wright
2007-01-06  2:28 ` [patch 14/50] sched: remove __cpuinitdata anotation to cpu_isolated_map Chris Wright
2007-01-06  2:28 ` [patch 15/50] IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G Chris Wright
2007-01-06  2:28 ` [patch 16/50] SCSI: add missing cdb clearing in scsi_execute() Chris Wright
2007-01-06  2:28 ` [patch 17/50] Bluetooth: Add packet size checks for CAPI messages (CVE-2006-6106) Chris Wright
2007-01-06  2:28 ` [patch 18/50] i2c: fix broken ds1337 initialization Chris Wright
2007-01-06  2:28 ` Chris Wright [this message]
2007-01-06  2:28 ` [patch 20/50] Fix for shmem_truncate_range() BUG_ON() Chris Wright
2007-01-06  2:28 ` [patch 21/50] smc911x: fix netpoll compilation faliure Chris Wright
2007-01-06  2:28 ` [patch 22/50] fix aoe without scatter-gather [Bug 7662] Chris Wright
2007-01-06  2:28 ` [patch 23/50] UDP: Fix reversed logic in udp_get_port() Chris Wright
2007-01-06  2:28 ` [patch 24/50] cciss: fix XFER_READ/XFER_WRITE in do_cciss_request Chris Wright
2007-01-06  2:28 ` [patch 25/50] [stable] [stable patch] i386: CPU hotplug broken with 2GB VMSPLIT Chris Wright
2007-01-06  2:28 ` [patch 26/50] ramfs breaks without CONFIG_BLOCK Chris Wright
2007-01-06  2:28 ` [patch 27/50] Buglet in vmscan.c Chris Wright
2007-01-06  2:28 ` [patch 28/50] softmac: Fixed handling of deassociation from AP Chris Wright
2007-01-08 15:14   ` John W. Linville
2007-01-06  2:28 ` [patch 29/50] zd1211rw: Call ieee80211_rx in tasklet Chris Wright
2007-01-08 15:20   ` John W. Linville
2007-01-06  2:28 ` [patch 30/50] handle ext3 directory corruption better (CVE-2006-6053) Chris Wright
2007-01-06  2:28 ` [patch 31/50] corrupted cramfs filesystems cause kernel oops (CVE-2006-5823) Chris Wright
2007-01-06  2:28 ` [patch 32/50] ext2: skip pages past number of blocks in ext2_find_entry (CVE-2006-6054) Chris Wright
2007-01-06  2:28 ` [patch 33/50] PKTGEN: Fix module load/unload races Chris Wright
2007-01-06  2:28 ` [patch 34/50] SPARC64: Fix "mem=xxx" handling Chris Wright
2007-01-06  2:28 ` [patch 35/50] SPARC64: Handle ISA devices with no regs property Chris Wright
2007-01-06  2:28 ` [patch 36/50] NET: Dont export linux/random.h outside __KERNEL__ Chris Wright
2007-01-06  2:28 ` [patch 37/50] sparc32: add offset in pci_map_sg() Chris Wright
2007-01-06  2:28 ` [patch 38/50] VM: Fix nasty and subtle race in shared mmaped page writeback Chris Wright
2007-01-06  2:28 ` [patch 39/50] V4L: cx2341x: audio_properties is an u16, not u8 Chris Wright
2007-01-06  2:28 ` [patch 40/50] dvb-core: fix bug in CRC-32 checking on 64-bit systems Chris Wright
2007-01-06  2:28 ` [patch 41/50] V4L: cx88: Fix leadtek_eeprom tagging Chris Wright
2007-01-06  2:28 ` [patch 42/50] ebtables: dont compute gap before checking struct type Chris Wright
2007-01-06  2:28 ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Chris Wright
2007-01-06  2:28 ` [patch 44/50] SOUND: Sparc CS4231: Use 64 for period_bytes_min Chris Wright
2007-01-06 14:20   ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Jan Engelhardt
2007-01-06  2:28 ` [patch 45/50] IPV4/IPV6: Fix inet{,6} device initialization order Chris Wright
2007-01-06  2:28 ` [patch 46/50] asix: Fix typo for AX88772 PHY Selection Chris Wright
2007-01-06  2:28 ` [patch 47/50] NetLabel: correctly fill in unused CIPSOv4 level and category mappings Chris Wright
2007-01-06  2:28 ` [patch 48/50] connector: some fixes for ia64 unaligned access errors Chris Wright
2007-01-06  2:28 ` [patch 49/50] fix OOM killing of swapoff Chris Wright
2007-01-06  2:28 ` [patch 50/50] Fix incorrect user space access locking in mincore() (CVE-2006-4814) Chris Wright
2007-01-06  2:37 ` [patch 00/50] -stable review Chris Wright
2007-02-04 18:24 ` Michael Krufky
2007-02-05 21:43   ` [stable] " Greg KH
2007-02-05 22:12     ` Chris Wright
2007-02-05 22:18       ` Greg KH
2007-02-05 22:16     ` Michael Krufky
2007-02-05 22:33       ` Greg KH
2007-02-07 21:08         ` Michael Krufky
2007-02-09 19:41           ` Michael Krufky
2007-02-09 19:51             ` Greg KH
2007-02-20 23:12               ` Greg KH
2007-02-21  2:29                 ` Michael Krufky
2007-02-21 21:01                   ` [v4l-dvb-maintainer] " Michael Krufky
2007-02-26 14:35                     ` Adrian Bunk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070106023229.297246000@sous-sol.org \
    --to=chrisw@sous-sol.org \
    --cc=akpm@osdl.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=bunk@stusta.de \
    --cc=chrisw@kernel.org \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=gregkh@suse.de \
    --cc=jmforbes@linuxtx.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mkrufky@linuxtv.org \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=torvalds@osdl.org \
    --cc=tytso@mit.edu \
    --cc=zwane@arm.linux.org.uk \
    --subject='Re: [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).