LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Chris Wright <chrisw@sous-sol.org>
To: linux-kernel@vger.kernel.org, stable@kernel.org,
Greg KH <gregkh@suse.de>, Chris Wright <chrisw@kernel.org>,
Adrian Bunk <bunk@stusta.de>
Cc: Justin Forbes <jmforbes@linuxtx.org>,
Zwane Mwaikambo <zwane@arm.linux.org.uk>,
"Theodore Ts'o" <tytso@mit.edu>,
Randy Dunlap <rdunlap@xenotime.net>,
Dave Jones <davej@redhat.com>,
Chuck Wolber <chuckw@quantumlinux.com>,
Chris Wedgwood <reviews@ml.cw.f00f.org>,
Michael Krufky <mkrufky@linuxtv.org>,
torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
Ingo Molnar <mingo@elte.hu>
Subject: [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Date: Fri, 05 Jan 2007 18:28:12 -0800 [thread overview]
Message-ID: <20070106023229.297246000@sous-sol.org> (raw)
In-Reply-To: <20070106022753.334962000@sous-sol.org>
[-- Attachment #1: sched-fix-bad-missed-wakeups-in-the-i386-x86_64-ia64-acpi-and-apm-idle-code.patch --]
[-- Type: text/plain, Size: 8173 bytes --]
-stable review patch. If anyone has any objections, please let us know.
------------------
From: Ingo Molnar <mingo@elte.hu>
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.
After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup (try_to_wake_up)
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
IRQ_19-1479 1D..1 0us : __trace_start_sched_wakeup (c01262ba 0 0)
IRQ_19-1479 1D..1 0us : resched_task (try_to_wake_up)
IRQ_19-1479 1D..1 0us : __spin_unlock_irqrestore (try_to_wake_up)
...
<idle>-0 1...1 11us!: default_idle (cpu_idle)
...
<idle>-0 0Dn.1 602us : smp_apic_timer_interrupt (c0103baf 1 0)
...
<...>-5856 0D..2 618us : __switch_to (__schedule)
<...>-5856 0D..2 618us : __schedule <<idle>-0> (20 162)
<...>-5856 0D..2 619us : __spin_unlock_irq (__schedule)
<...>-5856 0...1 619us : trace_stop_sched_switched (__schedule)
<...>-5856 0D..1 619us : trace_stop_sched_switched <<...>-5856> (37 0)
what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!
the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:
commit 495ab9c045e1b0e5c82951b762257fe1c9d81564
Author: Andi Kleen <ak@suse.de>
Date: Mon Jun 26 13:59:11 2006 +0200
[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.
the problem is this type of change:
if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
- clear_thread_flag(TIF_POLLING_NRFLAG);
+ current_thread_info()->status &= ~TS_POLLING;
smp_mb__after_clear_bit();
while (!need_resched()) {
local_irq_disable();
this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:
clear_bit(flag, &ti->flags);
and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:
static inline void clear_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btrl %1,%0"
hence smp_mb__after_clear_bit() is defined as a simple compile barrier:
#define smp_mb__after_clear_bit() barrier()
but the explicit TS_POLLING clearing introduced by the patch:
+ current_thread_info()->status &= ~TS_POLLING;
is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.
CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]
Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.
( And it also turned out that the reason why these latencies never
triggered on my testsystems is that i routinely use idle=poll, which
was the only idle variant not affected by this bug. )
The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
[chrisw: backport to 2.6.19.1]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
arch/i386/kernel/apm.c | 6 +++++-
arch/i386/kernel/process.c | 7 ++++++-
arch/ia64/kernel/process.c | 10 ++++++++--
arch/x86_64/kernel/process.c | 6 +++++-
drivers/acpi/processor_idle.c | 12 ++++++++++--
5 files changed, 34 insertions(+), 7 deletions(-)
--- linux-2.6.19.1.orig/arch/i386/kernel/apm.c
+++ linux-2.6.19.1/arch/i386/kernel/apm.c
@@ -784,7 +784,11 @@ static int apm_do_idle(void)
polling = !!(current_thread_info()->status & TS_POLLING);
if (polling) {
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
}
if (!need_resched()) {
idled = 1;
--- linux-2.6.19.1.orig/arch/i386/kernel/process.c
+++ linux-2.6.19.1/arch/i386/kernel/process.c
@@ -103,7 +103,12 @@ void default_idle(void)
if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
+
while (!need_resched()) {
local_irq_disable();
if (!need_resched())
--- linux-2.6.19.1.orig/arch/ia64/kernel/process.c
+++ linux-2.6.19.1/arch/ia64/kernel/process.c
@@ -268,10 +268,16 @@ cpu_idle (void)
/* endless idle loop with no priority at all */
while (1) {
- if (can_do_pal_halt)
+ if (can_do_pal_halt) {
current_thread_info()->status &= ~TS_POLLING;
- else
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
+ } else {
current_thread_info()->status |= TS_POLLING;
+ }
if (!need_resched()) {
void (*idle)(void);
--- linux-2.6.19.1.orig/arch/x86_64/kernel/process.c
+++ linux-2.6.19.1/arch/x86_64/kernel/process.c
@@ -111,7 +111,11 @@ static void default_idle(void)
local_irq_enable();
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
while (!need_resched()) {
local_irq_disable();
if (!need_resched())
--- linux-2.6.19.1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.19.1/drivers/acpi/processor_idle.c
@@ -211,7 +211,11 @@ acpi_processor_power_activate(struct acp
static void acpi_safe_halt(void)
{
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
if (!need_resched())
safe_halt();
current_thread_info()->status |= TS_POLLING;
@@ -345,7 +349,11 @@ static void acpi_processor_idle(void)
*/
if (cx->type == ACPI_STATE_C2 || cx->type == ACPI_STATE_C3) {
current_thread_info()->status &= ~TS_POLLING;
- smp_mb__after_clear_bit();
+ /*
+ * TS_POLLING-cleared state must be visible before we
+ * test NEED_RESCHED:
+ */
+ smp_mb();
if (need_resched()) {
current_thread_info()->status |= TS_POLLING;
local_irq_enable();
--
next prev parent reply other threads:[~2007-01-06 2:31 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-06 2:27 [patch 00/50] -stable review Chris Wright
2007-01-06 2:27 ` [patch 01/50] dm-crypt: Select CRYPTO_CBC Chris Wright
2007-01-06 2:27 ` [patch 02/50] sha512: Fix sha384 block size Chris Wright
2007-01-06 2:27 ` [patch 03/50] read_zero_pagealigned() locking fix Chris Wright
2007-01-06 2:27 ` [patch 04/50] ieee80211softmac: Fix mutex_lock at exit of ieee80211_softmac_get_genie Chris Wright
2007-01-08 15:24 ` John W. Linville
2007-01-06 2:27 ` [patch 05/50] x86-64: Mark rdtsc as sync only for netburst, not for core2 Chris Wright
2007-01-06 2:27 ` [patch 06/50] bonding: incorrect bonding state reported via ioctl Chris Wright
2007-01-06 2:28 ` [patch 07/50] DVB: lgdt330x: fix signal / lock status detection bug Chris Wright
2007-01-06 2:28 ` [patch 08/50] V4L: Fix broken TUNER_LG_NTSC_TAPE radio support Chris Wright
2007-01-06 2:28 ` [patch 09/50] Revert "[PATCH] zd1211rw: Removed unneeded packed attributes" Chris Wright
2007-01-08 15:32 ` John W. Linville
2007-01-06 2:28 ` [patch 10/50] libata: handle 0xff status properly Chris Wright
2007-01-06 2:28 ` [patch 11/50] ieee1394: ohci1394: add PPC_PMAC platform code to driver probe Chris Wright
2007-01-06 2:28 ` [patch 12/50] kbuild: dont put temp files in source Chris Wright
2007-01-06 2:28 ` [patch 13/50] ARM: Add sys_*at syscalls Chris Wright
2007-01-06 2:28 ` [patch 14/50] sched: remove __cpuinitdata anotation to cpu_isolated_map Chris Wright
2007-01-06 2:28 ` [patch 15/50] IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G Chris Wright
2007-01-06 2:28 ` [patch 16/50] SCSI: add missing cdb clearing in scsi_execute() Chris Wright
2007-01-06 2:28 ` [patch 17/50] Bluetooth: Add packet size checks for CAPI messages (CVE-2006-6106) Chris Wright
2007-01-06 2:28 ` [patch 18/50] i2c: fix broken ds1337 initialization Chris Wright
2007-01-06 2:28 ` Chris Wright [this message]
2007-01-06 2:28 ` [patch 20/50] Fix for shmem_truncate_range() BUG_ON() Chris Wright
2007-01-06 2:28 ` [patch 21/50] smc911x: fix netpoll compilation faliure Chris Wright
2007-01-06 2:28 ` [patch 22/50] fix aoe without scatter-gather [Bug 7662] Chris Wright
2007-01-06 2:28 ` [patch 23/50] UDP: Fix reversed logic in udp_get_port() Chris Wright
2007-01-06 2:28 ` [patch 24/50] cciss: fix XFER_READ/XFER_WRITE in do_cciss_request Chris Wright
2007-01-06 2:28 ` [patch 25/50] [stable] [stable patch] i386: CPU hotplug broken with 2GB VMSPLIT Chris Wright
2007-01-06 2:28 ` [patch 26/50] ramfs breaks without CONFIG_BLOCK Chris Wright
2007-01-06 2:28 ` [patch 27/50] Buglet in vmscan.c Chris Wright
2007-01-06 2:28 ` [patch 28/50] softmac: Fixed handling of deassociation from AP Chris Wright
2007-01-08 15:14 ` John W. Linville
2007-01-06 2:28 ` [patch 29/50] zd1211rw: Call ieee80211_rx in tasklet Chris Wright
2007-01-08 15:20 ` John W. Linville
2007-01-06 2:28 ` [patch 30/50] handle ext3 directory corruption better (CVE-2006-6053) Chris Wright
2007-01-06 2:28 ` [patch 31/50] corrupted cramfs filesystems cause kernel oops (CVE-2006-5823) Chris Wright
2007-01-06 2:28 ` [patch 32/50] ext2: skip pages past number of blocks in ext2_find_entry (CVE-2006-6054) Chris Wright
2007-01-06 2:28 ` [patch 33/50] PKTGEN: Fix module load/unload races Chris Wright
2007-01-06 2:28 ` [patch 34/50] SPARC64: Fix "mem=xxx" handling Chris Wright
2007-01-06 2:28 ` [patch 35/50] SPARC64: Handle ISA devices with no regs property Chris Wright
2007-01-06 2:28 ` [patch 36/50] NET: Dont export linux/random.h outside __KERNEL__ Chris Wright
2007-01-06 2:28 ` [patch 37/50] sparc32: add offset in pci_map_sg() Chris Wright
2007-01-06 2:28 ` [patch 38/50] VM: Fix nasty and subtle race in shared mmaped page writeback Chris Wright
2007-01-06 2:28 ` [patch 39/50] V4L: cx2341x: audio_properties is an u16, not u8 Chris Wright
2007-01-06 2:28 ` [patch 40/50] dvb-core: fix bug in CRC-32 checking on 64-bit systems Chris Wright
2007-01-06 2:28 ` [patch 41/50] V4L: cx88: Fix leadtek_eeprom tagging Chris Wright
2007-01-06 2:28 ` [patch 42/50] ebtables: dont compute gap before checking struct type Chris Wright
2007-01-06 2:28 ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Chris Wright
2007-01-06 2:28 ` [patch 44/50] SOUND: Sparc CS4231: Use 64 for period_bytes_min Chris Wright
2007-01-06 14:20 ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Jan Engelhardt
2007-01-06 2:28 ` [patch 45/50] IPV4/IPV6: Fix inet{,6} device initialization order Chris Wright
2007-01-06 2:28 ` [patch 46/50] asix: Fix typo for AX88772 PHY Selection Chris Wright
2007-01-06 2:28 ` [patch 47/50] NetLabel: correctly fill in unused CIPSOv4 level and category mappings Chris Wright
2007-01-06 2:28 ` [patch 48/50] connector: some fixes for ia64 unaligned access errors Chris Wright
2007-01-06 2:28 ` [patch 49/50] fix OOM killing of swapoff Chris Wright
2007-01-06 2:28 ` [patch 50/50] Fix incorrect user space access locking in mincore() (CVE-2006-4814) Chris Wright
2007-01-06 2:37 ` [patch 00/50] -stable review Chris Wright
2007-02-04 18:24 ` Michael Krufky
2007-02-05 21:43 ` [stable] " Greg KH
2007-02-05 22:12 ` Chris Wright
2007-02-05 22:18 ` Greg KH
2007-02-05 22:16 ` Michael Krufky
2007-02-05 22:33 ` Greg KH
2007-02-07 21:08 ` Michael Krufky
2007-02-09 19:41 ` Michael Krufky
2007-02-09 19:51 ` Greg KH
2007-02-20 23:12 ` Greg KH
2007-02-21 2:29 ` Michael Krufky
2007-02-21 21:01 ` [v4l-dvb-maintainer] " Michael Krufky
2007-02-26 14:35 ` Adrian Bunk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070106023229.297246000@sous-sol.org \
--to=chrisw@sous-sol.org \
--cc=akpm@osdl.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=bunk@stusta.de \
--cc=chrisw@kernel.org \
--cc=chuckw@quantumlinux.com \
--cc=davej@redhat.com \
--cc=gregkh@suse.de \
--cc=jmforbes@linuxtx.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=mkrufky@linuxtv.org \
--cc=rdunlap@xenotime.net \
--cc=reviews@ml.cw.f00f.org \
--cc=stable@kernel.org \
--cc=torvalds@osdl.org \
--cc=tytso@mit.edu \
--cc=zwane@arm.linux.org.uk \
--subject='Re: [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).