LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [BUG] long freezes on thinkpad t60
@ 2007-05-24 12:04 Miklos Szeredi
  2007-05-24 12:54 ` Ingo Molnar
  2007-05-24 22:08 ` Henrique de Moraes Holschuh
  0 siblings, 2 replies; 88+ messages in thread
From: Miklos Szeredi @ 2007-05-24 12:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, linux-acpi

On some strange workload involving strace and fuse I get ocasional
long periods (10-100s) of total unresponsiveness, not even SysRq-*
working.  Then the machine continues as normal.  Nothing in dmesg,
absolutely no indication about what is happening.

Tried nmi_watchdog=1, but then the machine locks up hard shortly after
boot.

Tried nmi_watchdog=1 acpi=off, then I can't reproduce the problem.

Tried 2.6.22-rc2 and 2.6.21, both with the same result.

.config and dmesg attached.

Any ideas?  Possibly something ACPI related?

Thanks,
Miklos

Linux version 2.6.22-rc2 (mszeredi@tucsk) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #3 SMP Tue May 22 17:55:42 CEST 2007
Command line: root=/dev/sda2
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fed0000 (usable)
 BIOS-e820: 000000003fed0000 - 000000003fedf000 (ACPI data)
 BIOS-e820: 000000003fedf000 - 000000003ff00000 (ACPI NVS)
 BIOS-e820: 000000003ff00000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 261840) 1 entries of 256 used
end_pfn_map = 1048576
DMI present.
ACPI: RSDP 000F67D0, 0024 (r2 LENOVO)
ACPI: XSDT 3FED1308, 008C (r1 LENOVO TP-79        2110  LTP        0)
ACPI: FACP 3FED1400, 00F4 (r3 LENOVO TP-79        2110 LNVO        1)
ACPI Warning (tbfadt-0434): Optional field "Gpe1Block" has zero address or length: 000000000000102C/0 [20070126]
ACPI: DSDT 3FED175E, D481 (r1 LENOVO TP-79        2110 MSFT  100000E)
ACPI: FACS 3FEF4000, 0040
ACPI: SSDT 3FED15B4, 01AA (r1 LENOVO TP-79        2110 MSFT  100000E)
ACPI: ECDT 3FEDEBDF, 0052 (r1 LENOVO TP-79        2110 LNVO        1)
ACPI: TCPA 3FEDEC31, 0032 (r2 LENOVO TP-79        2110 LNVO        1)
ACPI: APIC 3FEDEC63, 0068 (r1 LENOVO TP-79        2110 LNVO        1)
ACPI: MCFG 3FEDECCB, 003C (r1 LENOVO TP-79        2110 LNVO        1)
ACPI: HPET 3FEDED07, 0038 (r1 LENOVO TP-79        2110 LNVO        1)
ACPI: SLIC 3FEDEE62, 0176 (r1 LENOVO TP-79        2110  LTP        0)
ACPI: BOOT 3FEDEFD8, 0028 (r1 LENOVO TP-79        2110  LTP        1)
ACPI: SSDT 3FEF2655, 025F (r1 LENOVO TP-79        2110 INTL 20050513)
ACPI: SSDT 3FEF28B4, 00A6 (r1 LENOVO TP-79        2110 INTL 20050513)
ACPI: SSDT 3FEF295A, 04F7 (r1 LENOVO TP-79        2110 INTL 20050513)
ACPI: SSDT 3FEF2E51, 01D8 (r1 LENOVO TP-79        2110 INTL 20050513)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 261840) 1 entries of 256 used
Zone PFN ranges:
  DMA             0 ->     4096
  DMA32        4096 ->  1048576
  Normal    1048576 ->  1048576
early_node_map[2] active PFN ranges
    0:        0 ->      159
    0:      256 ->   261840
On node 0 totalpages: 261743
  DMA zone: 56 pages used for memmap
  DMA zone: 1115 pages reserved
  DMA zone: 2828 pages, LIFO batch:0
  DMA32 zone: 3523 pages used for memmap
  DMA32 zone: 254221 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 1, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
ACPI: HPET id: 0x8086a201 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
swsusp: Registered nosave memory region: 000000000009f000 - 00000000000a0000
swsusp: Registered nosave memory region: 00000000000a0000 - 00000000000d2000
swsusp: Registered nosave memory region: 00000000000d2000 - 00000000000d4000
swsusp: Registered nosave memory region: 00000000000d4000 - 00000000000dc000
swsusp: Registered nosave memory region: 00000000000dc000 - 0000000000100000
Allocating PCI resources starting at 50000000 (gap: 40000000:b0000000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
PERCPU: Allocating 32616 bytes of per cpu data
Built 1 zonelists.  Total pages: 257049
Kernel command line: root=/dev/sda2
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Extended CMOS year: 2000
Marking TSC unstable due to TSCs unsynchronized
time.c: Detected 1828.747 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
Checking aperture...
Memory: 1026060k/1047360k available (2254k kernel code, 20612k reserved, 1395k data, 184k init)
Calibrating delay using timer specific routine.. 3662.19 BogoMIPS (lpj=7324383)
Mount-cache hash table entries: 256
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring enabled (TM2)
SMP alternatives: switching to UP code
ACPI: Core revision 20070126
Using local APIC timer interrupts.
result 10390601
Detected 10.390 MHz APIC timer.
SMP alternatives: switching to SMP code
Booting processor 1/2 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 3657.64 BogoMIPS (lpj=7315290)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU1: Thermal monitoring enabled (TM2)
Intel(R) Core(TM)2 CPU         T5600  @ 1.83GHz stepping 06
Brought up 2 CPUs
migration_cost=28
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Interpreter enabled
ACPI: (supports S0 S3 S4 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
PCI quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO
PCI quirk: region 1180-11bf claimed by ICH6 GPIO
PCI: Transparent bridge - 0000:00:1e.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP2._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP3._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11)
ACPI: Power Resource [PUBS] (on)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp: PnP ACPI: found 12 devices
ACPI: ACPI bus type pnp unregistered
SCSI subsystem initialized
libata version 2.20 loaded.
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
BUG: at mm/slab.c:777 __find_general_cachep()

Call Trace:
 [<ffffffff8026fe8b>] __kmalloc+0x3e/0xbe
 [<ffffffff8021c337>] cache_k8_northbridges+0x7f/0xf0
 [<ffffffff805b2e7d>] gart_iommu_init+0x13/0x4f4
 [<ffffffff80227ce1>] __wake_up+0x38/0x4f
 [<ffffffff803ec306>] genl_rcv+0x0/0x59
 [<ffffffff803eadaf>] netlink_kernel_create+0x12c/0x156
 [<ffffffff8042fda3>] mutex_lock+0xd/0x1e
 [<ffffffff805aed1b>] pci_iommu_init+0x9/0x12
 [<ffffffff805ac600>] kernel_init+0x167/0x2d1
 [<ffffffff8020a3e8>] child_rip+0xa/0x12
 [<ffffffff802f4538>] acpi_ds_init_one_object+0x0/0x7c
 [<ffffffff805ac499>] kernel_init+0x0/0x2d1
 [<ffffffff8020a3de>] child_rip+0x0/0x12

PCI-GART: No AMD northbridge found.
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz
pnp: 00:00: iomem range 0x0-0x9ffff could not be reserved
pnp: 00:00: iomem range 0xc0000-0xc3fff has been reserved
Time: hpet clocksource has been installed.
pnp: 00:00: iomem range 0xc4000-0xc7fff has been reserved
pnp: 00:00: iomem range 0xc8000-0xcbfff has been reserved
pnp: 00:02: iomem range 0xf0000000-0xf3ffffff could not be reserved
pnp: 00:02: iomem range 0xfed1c000-0xfed1ffff could not be reserved
pnp: 00:02: iomem range 0xfed14000-0xfed17fff could not be reserved
pnp: 00:02: iomem range 0xfed18000-0xfed18fff could not be reserved
PCI: Bridge: 0000:00:01.0
  IO window: 2000-2fff
  MEM window: ee100000-ee1fffff
  PREFETCH window: d8000000-dfffffff
PCI: Bridge: 0000:00:1c.0
  IO window: 3000-3fff
  MEM window: ee000000-ee0fffff
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:1c.1
  IO window: 4000-5fff
  MEM window: ec000000-edffffff
  PREFETCH window: e4000000-e40fffff
PCI: Bridge: 0000:00:1c.2
  IO window: 6000-7fff
  MEM window: e8000000-e9ffffff
  PREFETCH window: e4100000-e41fffff
PCI: Bridge: 0000:00:1c.3
  IO window: 8000-9fff
  MEM window: ea000000-ebffffff
  PREFETCH window: e4200000-e42fffff
PCI: Bus 22, cardbus bridge: 0000:15:00.0
  IO window: 0000a000-0000a0ff
  IO window: 0000a400-0000a4ff
  PREFETCH window: e0000000-e3ffffff
  MEM window: 50000000-53ffffff
PCI: Bridge: 0000:00:1e.0
  IO window: a000-dfff
  MEM window: e4300000-e7ffffff
  PREFETCH window: e0000000-e3ffffff
ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:01.0 to 64
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 20
PCI: Setting latency timer of device 0000:00:1c.0 to 64
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 21 (level, low) -> IRQ 21
PCI: Setting latency timer of device 0000:00:1c.1 to 64
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:1c.2 to 64
ACPI: PCI Interrupt 0000:00:1c.3[D] -> GSI 23 (level, low) -> IRQ 23
PCI: Setting latency timer of device 0000:00:1c.3 to 64
PCI: Enabling device 0000:00:1e.0 (0005 -> 0007)
PCI: Setting latency timer of device 0000:00:1e.0 to 64
ACPI: PCI Interrupt 0000:15:00.0[A] -> GSI 16 (level, low) -> IRQ 16
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 6, 262144 bytes)
TCP established hash table entries: 131072 (order: 9, 3145728 bytes)
TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
Simple Boot Flag at 0x35 set to 0x1
io scheduler noop registered
io scheduler cfq registered (default)
Boot video device is 0000:01:00.0
PCI: Setting latency timer of device 0000:00:01.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:01.0:pcie00]
PCI: Setting latency timer of device 0000:00:1c.0 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.0:pcie00]
Allocate Port Service[0000:00:1c.0:pcie02]
PCI: Setting latency timer of device 0000:00:1c.1 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.1:pcie00]
Allocate Port Service[0000:00:1c.1:pcie02]
PCI: Setting latency timer of device 0000:00:1c.2 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.2:pcie00]
Allocate Port Service[0000:00:1c.2:pcie02]
PCI: Setting latency timer of device 0000:00:1c.3 to 64
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:1c.3:pcie00]
Allocate Port Service[0000:00:1c.3:pcie02]
ACPI: AC Adapter [AC] (on-line)
ACPI: Battery Slot [BAT0] (battery absent)
input: Power Button (FF) as /class/input/input0
ACPI: Power Button (FF) [PWRF]
input: Lid Switch as /class/input/input1
ACPI: Lid Switch [LID]
input: Sleep Button (CM) as /class/input/input2
ACPI: Sleep Button (CM) [SLPB]
ACPI: SSDT 3FEF1D36, 0240 (r1  PmRef  Cpu0Ist      100 INTL 20050513)
ACPI: SSDT 3FEF1FFB, 065A (r1  PmRef  Cpu0Cst      100 INTL 20050513)
Monitor-Mwait will be used to enter C-1 state
Monitor-Mwait will be used to enter C-2 state
Monitor-Mwait will be used to enter C-3 state
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: SSDT 3FEF1C6E, 00C8 (r1  PmRef  Cpu1Ist      100 INTL 20050513)
ACPI: SSDT 3FEF1F76, 0085 (r1  PmRef  Cpu1Cst      100 INTL 20050513)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: Thermal Zone [THM0] (39 C)
ACPI: Thermal Zone [THM1] (39 C)
Real Time Clock Driver v1.12ac
hpet_resources: 0xfed00000 is busy
Linux agpgart interface v0.102 (c) Dave Jones
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
loop: module loaded
thinkpad_acpi: ThinkPad ACPI Extras v0.14
thinkpad_acpi: http://ibm-acpi.sf.net/
thinkpad_acpi: ThinkPad EC firmware 79HT50WW-1.07
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ICH7: IDE controller at PCI slot 0000:00:1f.1
ACPI: PCI Interrupt 0000:00:1f.1[C] -> GSI 16 (level, low) -> IRQ 16
ICH7: chipset revision 2
ICH7: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0x1880-0x1887, BIOS settings: hda:DMA, hdb:pio
Probing IDE interface ide0...
hda: MATSHITADVD-RAM UJ-842, ATAPI CD/DVD-ROM drive
hda: selected mode 0x42
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hda: ATAPI 24X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
ahci 0000:00:1f.2: version 2.1
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 16 (level, low) -> IRQ 16
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 1.5 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq pm led clo pio slum part 
PCI: Setting latency timer of device 0000:00:1f.2 to 64
scsi0 : ahci
scsi1 : ahci
scsi2 : ahci
scsi3 : ahci
ata1: SATA max UDMA/133 cmd 0xffffc2000003a500 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 0
ata2: SATA max UDMA/133 cmd 0xffffc2000003a580 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 0
ata3: SATA max UDMA/133 cmd 0xffffc2000003a600 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 0
ata4: SATA max UDMA/133 cmd 0xffffc2000003a680 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 0
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156301488
ata1.00: ATA-7: HTS541080G9SA00, MB4IC60R, max UDMA/100
ata1.00: 156301488 sectors, multi 16: LBA48 
ata1.00: ata_hpa_resize 1: sectors = 156301488, hpa_sectors = 156301488
ata1.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 0)
ata3: SATA link down (SStatus 0 SControl 0)
ata4: SATA link down (SStatus 0 SControl 0)
scsi 0:0:0:0: Direct-Access     ATA      HTS541080G9SA00  MB4I PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 < sda5 >
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
usbmon: debugfs is not available
ACPI: PCI Interrupt 0000:00:1d.7[D] -> GSI 19 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1d.7 to 64
ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:1d.7: debug port 1
PCI: cache line size of 32 is not supported by device 0000:00:1d.7
ehci_hcd 0000:00:1d.7: irq 19, io mem 0xee404000
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 8 ports detected
USB Universal Host Controller Interface driver v3.0
ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1d.0 to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller
uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2
uhci_hcd 0000:00:1d.0: irq 16, io base 0x00001800
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1d.1 to 64
uhci_hcd 0000:00:1d.1: UHCI Host Controller
uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 3
uhci_hcd 0000:00:1d.1: irq 17, io base 0x00001820
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1d.2 to 64
uhci_hcd 0000:00:1d.2: UHCI Host Controller
uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 4
uhci_hcd 0000:00:1d.2: irq 18, io base 0x00001840
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 19 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:00:1d.3 to 64
uhci_hcd 0000:00:1d.3: UHCI Host Controller
uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5
uhci_hcd 0000:00:1d.3: irq 19, io base 0x00001860
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /class/input/input3
coretemp: This driver uses undocumented features of Core CPU. Temperature might be wrong!
usb 2-2: new low speed USB device using uhci_hcd and address 2
usb 2-2: configuration #1 chosen from 1 choice
Synaptics Touchpad, model: 1, fw: 6.2, id: 0x81a0b1, caps: 0xa04793/0x300000
serio: Synaptics pass-through port at isa0060/serio1/input0
usb 5-2: new full speed USB device using uhci_hcd and address 2
input: SynPS/2 Synaptics TouchPad as /class/input/input4
usb 5-2: configuration #1 chosen from 1 choice
input: Microsoft Basic Optical Mouse as /class/input/input5
input: USB HID v1.10 Mouse [Microsoft Basic Optical Mouse] on usb-0000:00:1d.0-2
usbcore: registered new interface driver usbhid
drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
Advanced Linux Sound Architecture Driver Version 1.0.14rc4 (Wed May 16 09:45:46 2007 UTC).
ACPI: PCI Interrupt 0000:00:1b.0[B] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1b.0 to 64
ALSA device list:
  #0: HDA Intel at 0xee400000 irq 17
Netfilter messages via NETLINK v0.30.
nf_conntrack version 0.5.0 (4091 buckets, 32728 max)
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
IBM TrackPoint firmware: 0x0e, buttons: 3/3
input: TPPS/2 IBM TrackPoint as /class/input/input6
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 184k freed
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2
Copyright (c) 1999-2006 Intel Corporation.
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:02:00.0 to 64
e1000: 0000:02:00.0: e1000_validate_option: Receive Interrupt Delay set to 32
e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:41:e3:2c:76
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
Adding 1542196k swap on /dev/sda1.  Priority:-1 extents:1 across:1542196k
EXT3 FS on sda2, internal journal
fuse init (API version 7.8)
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth0: e1000_watchdog: 10/100 speed: disabling TSO
IA-32 Microcode Update Driver: v1.14a <tigran@aivazian.fsnet.co.uk>


#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.22-rc2
# Tue May 22 16:51:07 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=18
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_HT=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_NR_CPUS=2
CONFIG_HOTPLUG_CPU=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_SECCOMP is not set
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y

#
# Power management options
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""
CONFIG_SUSPEND_SMP=y

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_ACPI_CPUFREQ=y

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
# CONFIG_PCI_MMCONFIG is not set
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
# CONFIG_BINFMT_MISC is not set
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
# CONFIG_IPV6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
CONFIG_NF_CONNTRACK_ENABLED=y
CONFIG_NF_CONNTRACK=y
# CONFIG_NF_CT_ACCT is not set
# CONFIG_NF_CONNTRACK_MARK is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CT_PROTO_SCTP is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NETFILTER_XTABLES=y
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
# CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV4=y
# CONFIG_NF_CONNTRACK_PROC_COMPAT is not set
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_IPRANGE is not set
# CONFIG_IP_NF_MATCH_TOS is not set
# CONFIG_IP_NF_MATCH_RECENT is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_TTL is not set
# CONFIG_IP_NF_MATCH_OWNER is not set
# CONFIG_IP_NF_MATCH_ADDRTYPE is not set
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
# CONFIG_IP_NF_TARGET_ULOG is not set
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
# CONFIG_IP_NF_TARGET_MASQUERADE is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_SAME is not set
# CONFIG_NF_NAT_SNMP_BASIC is not set
# CONFIG_NF_NAT_FTP is not set
# CONFIG_NF_NAT_IRC is not set
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
# CONFIG_NF_NAT_SIP is not set
CONFIG_IP_NF_MANGLE=y
# CONFIG_IP_NF_TARGET_TOS is not set
# CONFIG_IP_NF_TARGET_ECN is not set
# CONFIG_IP_NF_TARGET_TTL is not set
# CONFIG_IP_NF_TARGET_CLUSTERIP is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set

#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set

#
# Wireless
#
# CONFIG_CFG80211 is not set
CONFIG_WIRELESS_EXT=y
# CONFIG_MAC80211 is not set
# CONFIG_IEEE80211 is not set
# CONFIG_RFKILL is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set

#
# Connector - unified userspace <-> kernelspace linker
#
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y

#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
CONFIG_THINKPAD_ACPI=y
# CONFIG_THINKPAD_ACPI_DEBUG is not set
# CONFIG_THINKPAD_ACPI_DOCK is not set
CONFIG_THINKPAD_ACPI_BAY=y
# CONFIG_BLINK is not set
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set
CONFIG_BLK_DEV_IDEACPI=y
# CONFIG_IDE_TASK_IOCTL is not set
CONFIG_IDE_PROC_FS=y

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
# CONFIG_IDEPCI_SHARE_IRQ is not set
CONFIG_IDEPCI_PCIBUS_ORDER=y
# CONFIG_BLK_DEV_OFFBOARD is not set
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT8213 is not set
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_BLK_DEV_TC86C001 is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_ESP_CORE is not set
# CONFIG_SCSI_SRP is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set

#
# I2O device support
#
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# Ethernet (10 or 100Mbit)
#
# CONFIG_NET_ETHERNET is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
CONFIG_E1000=m
# CONFIG_E1000_NAPI is not set
# CONFIG_E1000_DISABLE_PACKET_SPLIT is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_NETDEV_10000 is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
CONFIG_WLAN_80211=y
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_LIBERTAS_USB is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set
# CONFIG_WATCHDOG is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_DRM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
# CONFIG_I2C is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_F71805F is not set
CONFIG_SENSORS_CORETEMP=y
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_SM501 is not set

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DVB_CORE is not set
# CONFIG_DAB is not set

#
# Graphics support
#
# CONFIG_BACKLIGHT_LCD_SUPPORT is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_PROGEAR is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set
# CONFIG_VGASTATE is not set
# CONFIG_FB is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=128
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
# CONFIG_SND_RTCTIMER is not set
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set

#
# System on Chip audio support
#
# CONFIG_SND_SOC is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set

#
# HID Devices
#
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
# CONFIG_USB_DEVICEFS is not set
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_EHCI_BIG_ENDIAN_MMIO is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
CONFIG_USB_MON=y

#
# USB port drivers
#

#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
# CONFIG_MMC is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set

#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Virtualization
#
# CONFIG_KVM is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set

#
# File systems
#
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=m
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=m
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set

#
# Distributed Lock Manager
#
# CONFIG_DLM is not set

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
# CONFIG_KPROBES is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_DEBUG_RODATA is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_AES is not set
CONFIG_CRYPTO_AES_X86_64=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_TEST is not set

#
# Hardware crypto devices
#

#
# Library routines
#
CONFIG_BITREVERSE=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_LIBCRC32C is not set
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 12:04 [BUG] long freezes on thinkpad t60 Miklos Szeredi
@ 2007-05-24 12:54 ` Ingo Molnar
  2007-05-24 14:03   ` Miklos Szeredi
  2007-05-24 22:08 ` Henrique de Moraes Holschuh
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-05-24 12:54 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-acpi, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1006 bytes --]


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> On some strange workload involving strace and fuse I get ocasional 
> long periods (10-100s) of total unresponsiveness, not even SysRq-* 
> working.  Then the machine continues as normal.  Nothing in dmesg, 
> absolutely no indication about what is happening.

> Any ideas?  Possibly something ACPI related?

how reproducable are these lockups - could you possibly trace it? If yes 
then please apply:

  http://www.tglx.de/private/tglx/ht-debug/tracer.diff

and run the attached trace-it-1sec.c thing in a loop:

  echo 1 > /proc/sys/kernel/mcount_enabled

  while true; do
     ./trace-it-1sec > trace-`date`.txt
  done

and wait for the lockup. Once it happens, please upload the trace*.txt 
file that contains the lockup, i guess we'll be able to tell you more 
about the nature of the lockup. (Perhaps increase the sleep(1) to 
sleep(5) to capture longer periods and to increase the odds that you 
catch the lockup while the utility is tracing.)

	Ingo

[-- Attachment #2: trace-it-1sec.c --]
[-- Type: text/plain, Size: 2306 bytes --]


/*
 * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com>
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 * 
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_WAKEUP_TIMING and
 * CONFIG_LATENCY_TRACE in the .config, rebuild the kernel and boot
 * into it. Note that the tracer can have significant runtime overhead,
 * so you dont want to use it for performance testing :)
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
	int ret;

	if (getuid() != 0) {
		fprintf(stderr, "needs to run as root.\n");
		exit(1);
	}
	ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
	if (ret) {
		fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
		exit(1);
	}
	system("echo 1 > /proc/sys/kernel/trace_enabled");
//	system("echo 0 > /proc/sys/kernel/trace_freerunning");
	system("echo 0 > /proc/sys/kernel/trace_print_at_crash");
	system("echo 1 > /proc/sys/kernel/trace_user_triggered");
	system("echo 0 > /proc/sys/kernel/trace_verbose");
	system("echo 0 > /proc/sys/kernel/preempt_max_latency");
	system("echo 0 > /proc/sys/kernel/preempt_thresh");
	system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 1 > /proc/sys/kernel/wakeup_timing");
//	system("echo 1 > /proc/sys/kernel/mcount_enabled");

	prctl(0, 1); // start tracing
	sleep(1);
	prctl(0, 0); // stop tracing

	system("cat /proc/latency_trace");

	return 0;
}



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 12:54 ` Ingo Molnar
@ 2007-05-24 14:03   ` Miklos Szeredi
  2007-05-24 14:10     ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-05-24 14:03 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, linux-acpi, tglx

> > On some strange workload involving strace and fuse I get ocasional 
> > long periods (10-100s) of total unresponsiveness, not even SysRq-* 
> > working.  Then the machine continues as normal.  Nothing in dmesg, 
> > absolutely no indication about what is happening.
> 
> > Any ideas?  Possibly something ACPI related?
> 
> how reproducable are these lockups - could you possibly trace it? If yes 
> then please apply:
> 
>   http://www.tglx.de/private/tglx/ht-debug/tracer.diff

With this patch boot stops at segfaulting fsck.  I enabled all the new
config options, is that not a good idea?  Which one exactly do I need?

Thanks,
Miklos

PS. tracer.diff needed some hacking to make it apply/compile.

Index: linux-2.6.22-rc2/include/linux/linkage.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/linkage.h	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/linkage.h	2007-05-24 15:57:43.000000000 +0200
@@ -3,6 +3,8 @@
 
 #include <asm/linkage.h>
 
+#define notrace __attribute ((no_instrument_function))
+
 #ifdef __cplusplus
 #define CPP_ASMLINKAGE extern "C"
 #else
Index: linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt
===================================================================
--- linux-2.6.22-rc2.orig/Documentation/stable_api_nonsense.txt	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt	2007-05-24 15:57:42.000000000 +0200
@@ -62,6 +62,9 @@ consider the following facts about the L
       - different structures can contain different fields
       - Some functions may not be implemented at all, (i.e. some locks
 	compile away to nothing for non-SMP builds.)
+      - Parameter passing of variables from function to function can be
+	done in different ways (the CONFIG_REGPARM option controls
+	this.)
       - Memory within the kernel can be aligned in different ways,
 	depending on the build options.
   - Linux runs on a wide range of different processor architectures.
Index: linux-2.6.22-rc2/arch/i386/Kconfig
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/Kconfig	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/Kconfig	2007-05-24 15:57:42.000000000 +0200
@@ -764,6 +764,14 @@ config BOOT_IOREMAP
 	depends on (((X86_SUMMIT || X86_GENERICARCH) && NUMA) || (X86 && EFI))
 	default y
 
+#
+# function tracing might turn this off:
+#
+config REGPARM
+	bool
+	depends on !MCOUNT
+	default y
+
 config SECCOMP
 	bool "Enable seccomp to safely compute untrusted bytecode"
 	depends on PROC_FS
Index: linux-2.6.22-rc2/arch/i386/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/Makefile	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/Makefile	2007-05-24 15:57:42.000000000 +0200
@@ -31,7 +31,7 @@ LDFLAGS_vmlinux := --emit-relocs
 endif
 CHECKFLAGS	+= -D__i386__
 
-CFLAGS += -pipe -msoft-float -mregparm=3 -freg-struct-return
+CFLAGS += -pipe -msoft-float
 
 # prevent gcc from keeping the stack 16 byte aligned
 CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2)
@@ -39,6 +39,8 @@ CFLAGS += $(call cc-option,-mpreferred-s
 # CPU-specific tuning. Anything which can be shared with UML should go here.
 include $(srctree)/arch/i386/Makefile.cpu
 
+cflags-$(CONFIG_REGPARM) += -mregparm=3 -freg-struct-return
+
 # temporary until string.h is fixed
 cflags-y += -ffreestanding
 
Index: linux-2.6.22-rc2/include/asm-i386/module.h
===================================================================
--- linux-2.6.22-rc2.orig/include/asm-i386/module.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/asm-i386/module.h	2007-05-24 15:57:43.000000000 +0200
@@ -64,12 +64,18 @@ struct mod_arch_specific
 #error unknown processor family
 #endif
 
+#ifdef CONFIG_REGPARM
+#define MODULE_REGPARM "REGPARM "
+#else
+#define MODULE_REGPARM ""
+#endif
+
 #ifdef CONFIG_4KSTACKS
 #define MODULE_STACKSIZE "4KSTACKS "
 #else
 #define MODULE_STACKSIZE ""
 #endif
 
-#define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_STACKSIZE
+#define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY MODULE_REGPARM MODULE_STACKSIZE
 
 #endif /* _ASM_I386_MODULE_H */
Index: linux-2.6.22-rc2/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/Makefile	2007-05-22 16:25:01.000000000 +0200
+++ linux-2.6.22-rc2/Makefile	2007-05-24 15:57:42.000000000 +0200
@@ -490,10 +490,14 @@ endif
 
 include $(srctree)/arch/$(ARCH)/Makefile
 
-ifdef CONFIG_FRAME_POINTER
-CFLAGS		+= -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,)
+ifdef CONFIG_MCOUNT
+CFLAGS                += -pg -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,)
 else
-CFLAGS		+= -fomit-frame-pointer
+  ifdef CONFIG_FRAME_POINTER
+    CFLAGS		+= -fno-omit-frame-pointer $(call cc-option,-fno-optimize-sibling-calls,)
+  else
+    CFLAGS		+= -fomit-frame-pointer
+  endif
 endif
 
 ifdef CONFIG_DEBUG_INFO
Index: linux-2.6.22-rc2/arch/i386/lib/delay.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/lib/delay.c	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/lib/delay.c	2007-05-24 15:57:43.000000000 +0200
@@ -23,7 +23,7 @@
 #endif
 
 /* simple loop based delay: */
-static void delay_loop(unsigned long loops)
+static notrace void delay_loop(unsigned long loops)
 {
 	int d0;
 
@@ -38,7 +38,7 @@ static void delay_loop(unsigned long loo
 }
 
 /* TSC based delay: */
-static void delay_tsc(unsigned long loops)
+static notrace void delay_tsc(unsigned long loops)
 {
 	unsigned long bclock, now;
 
@@ -69,7 +69,7 @@ int read_current_timer(unsigned long *ti
 	return -1;
 }
 
-void __delay(unsigned long loops)
+void notrace __delay(unsigned long loops)
 {
 	delay_fn(loops);
 }
Index: linux-2.6.22-rc2/arch/x86_64/kernel/tsc.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/tsc.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/tsc.c	2007-05-24 15:57:42.000000000 +0200
@@ -176,13 +176,13 @@ __setup("notsc", notsc_setup);
 
 
 /* clock source code: */
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
 	cycle_t ret = (cycle_t)get_cycles_sync();
 	return ret;
 }
 
-static cycle_t __vsyscall_fn vread_tsc(void)
+static notrace cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret = (cycle_t)get_cycles_sync();
 	return ret;
Index: linux-2.6.22-rc2/drivers/clocksource/acpi_pm.c
===================================================================
--- linux-2.6.22-rc2.orig/drivers/clocksource/acpi_pm.c	2007-05-22 16:25:11.000000000 +0200
+++ linux-2.6.22-rc2/drivers/clocksource/acpi_pm.c	2007-05-24 15:57:42.000000000 +0200
@@ -30,13 +30,13 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-static inline u32 read_pmtmr(void)
+static notrace inline u32 read_pmtmr(void)
 {
 	/* mask the output to 24 bits */
 	return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
-u32 acpi_pm_read_verified(void)
+u32 notrace acpi_pm_read_verified(void)
 {
 	u32 v1 = 0, v2 = 0, v3 = 0;
 
@@ -56,12 +56,12 @@ u32 acpi_pm_read_verified(void)
 	return v2;
 }
 
-static cycle_t acpi_pm_read_slow(void)
+static notrace cycle_t acpi_pm_read_slow(void)
 {
 	return (cycle_t)acpi_pm_read_verified();
 }
 
-static cycle_t acpi_pm_read(void)
+static notrace cycle_t acpi_pm_read(void)
 {
 	return (cycle_t)read_pmtmr();
 }
Index: linux-2.6.22-rc2/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.22-rc2.orig/fs/proc/proc_misc.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/fs/proc/proc_misc.c	2007-05-24 15:57:42.000000000 +0200
@@ -623,6 +623,20 @@ static int execdomains_read_proc(char *p
 	return proc_calc_metrics(page, start, off, count, eof, len);
 }
 
+#ifdef CONFIG_EVENT_TRACE
+extern struct seq_operations latency_trace_op;
+static int latency_trace_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &latency_trace_op);
+}
+static struct file_operations proc_latency_trace_operations = {
+	.open		= latency_trace_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+#endif
+
 #ifdef CONFIG_MAGIC_SYSRQ
 /*
  * writing 'C' to /proc/sysrq-trigger is like sysrq-C
@@ -716,6 +730,9 @@ void __init proc_misc_init(void)
 #ifdef CONFIG_SCHEDSTATS
 	create_seq_entry("schedstat", 0, &proc_schedstat_operations);
 #endif
+#ifdef CONFIG_EVENT_TRACE
+	create_seq_entry("latency_trace", 0, &proc_latency_trace_operations);
+#endif
 #ifdef CONFIG_PROC_KCORE
 	proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
 	if (proc_root_kcore) {
Index: linux-2.6.22-rc2/include/linux/clocksource.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/clocksource.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/clocksource.h	2007-05-24 15:57:43.000000000 +0200
@@ -21,6 +21,9 @@
 typedef u64 cycle_t;
 struct clocksource;
 
+extern unsigned long preempt_max_latency;
+extern unsigned long preempt_thresh;
+
 /**
  * struct clocksource - hardware abstraction for a free running counter
  *	Provides mostly state-free accessors to the underlying hardware.
@@ -172,8 +175,20 @@ static inline cycle_t clocksource_read(s
  */
 static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles)
 {
-	u64 ret = (u64)cycles;
-	ret = (ret * cs->mult) >> cs->shift;
+	return ((u64)cycles * cs->mult) >> cs->shift;
+}
+
+/**
+ * ns2cyc - converts nanoseconds to clocksource cycles
+ * @cs:		Pointer to clocksource
+ * @nsecs:	Nanoseconds
+ */
+static inline cycles_t ns2cyc(struct clocksource *cs, u64 nsecs)
+{
+	cycles_t ret = nsecs << cs->shift;
+
+	do_div(ret, cs->mult + 1);
+
 	return ret;
 }
 
@@ -221,4 +236,8 @@ static inline void update_vsyscall(struc
 }
 #endif
 
+extern cycle_t get_monotonic_cycles(void);
+extern unsigned long cycles_to_usecs(cycle_t);
+extern cycle_t usecs_to_cycles(unsigned long);
+
 #endif /* _LINUX_CLOCKSOURCE_H */
Index: linux-2.6.22-rc2/include/linux/kernel.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/kernel.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/kernel.h	2007-05-24 15:57:43.000000000 +0200
@@ -156,6 +156,8 @@ asmlinkage int vprintk(const char *fmt, 
 	__attribute__ ((format (printf, 1, 0)));
 asmlinkage int printk(const char * fmt, ...)
 	__attribute__ ((format (printf, 1, 2)));
+extern void early_printk(const char *fmt, ...)
+	__attribute__ ((format (printf, 1, 2)));
 #else
 static inline int vprintk(const char *s, va_list args)
 	__attribute__ ((format (printf, 1, 0)));
Index: linux-2.6.22-rc2/include/linux/latency_hist.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc2/include/linux/latency_hist.h	2007-05-24 15:57:43.000000000 +0200
@@ -0,0 +1,32 @@
+/*
+ * kernel/latency_hist.h
+ *
+ * Add support for histograms of preemption-off latency and
+ * interrupt-off latency and wakeup latency, it depends on
+ * Real-Time Preemption Support.
+ *
+ *  Copyright (C) 2005 MontaVista Software, Inc.
+ *  Yi Yang <yyang@ch.mvista.com>
+ *
+ */
+#ifndef _LINUX_LATENCY_HIST_H_
+#define _LINUX_LATENCY_HIST_H_
+
+enum {
+        INTERRUPT_LATENCY = 0,
+        PREEMPT_LATENCY,
+        WAKEUP_LATENCY
+};
+
+#define MAX_ENTRY_NUM 10240
+#define LATENCY_TYPE_NUM 3
+
+#ifdef CONFIG_LATENCY_HIST
+extern void latency_hist(int latency_type, int cpu, unsigned long latency);
+# define latency_hist_flag 1
+#else
+# define latency_hist(a,b,c) do { (void)(cpu); } while (0)
+# define latency_hist_flag 0
+#endif /* CONFIG_LATENCY_HIST */
+
+#endif /* ifndef _LINUX_LATENCY_HIST_H_ */
Index: linux-2.6.22-rc2/include/linux/preempt.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/preempt.h	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/preempt.h	2007-05-24 15:57:43.000000000 +0200
@@ -9,12 +9,26 @@
 #include <linux/thread_info.h>
 #include <linux/linkage.h>
 
-#ifdef CONFIG_DEBUG_PREEMPT
-  extern void fastcall add_preempt_count(int val);
-  extern void fastcall sub_preempt_count(int val);
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_CRITICAL_TIMING)
+  extern void notrace add_preempt_count(unsigned int val);
+  extern void notrace sub_preempt_count(unsigned int val);
+  extern void notrace mask_preempt_count(unsigned int mask);
+  extern void notrace unmask_preempt_count(unsigned int mask);
 #else
 # define add_preempt_count(val)	do { preempt_count() += (val); } while (0)
 # define sub_preempt_count(val)	do { preempt_count() -= (val); } while (0)
+# define mask_preempt_count(mask) \
+		do { preempt_count() |= (mask); } while (0)
+# define unmask_preempt_count(mask) \
+		do { preempt_count() &= ~(mask); } while (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_TIMING
+  extern void touch_critical_timing(void);
+  extern void stop_critical_timing(void);
+#else
+# define touch_critical_timing()	do { } while (0)
+# define stop_critical_timing()	do { } while (0)
 #endif
 
 #define inc_preempt_count() add_preempt_count(1)
Index: linux-2.6.22-rc2/include/linux/sched.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/sched.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/sched.h	2007-05-24 15:57:43.000000000 +0200
@@ -215,6 +215,7 @@ static inline void show_state(void)
 }
 
 extern void show_regs(struct pt_regs *);
+extern void irq_show_regs_callback(int cpu, struct pt_regs *regs);
 
 /*
  * TASK is a pointer to the task whose backtrace we want to see (or NULL for current
@@ -251,6 +252,105 @@ static inline void touch_all_softlockup_
 }
 #endif
 
+#if defined(CONFIG_PREEMPT_TRACE) || defined(CONFIG_EVENT_TRACE)
+  extern void print_traces(struct task_struct *task);
+#else
+# define print_traces(task)			do { } while (0)
+#endif
+
+#ifdef CONFIG_FRAME_POINTER
+# ifndef CONFIG_ARM
+#  define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+#  define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1))
+#  define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2))
+#  define CALLER_ADDR3 ((unsigned long)__builtin_return_address(3))
+#  define CALLER_ADDR4 ((unsigned long)__builtin_return_address(4))
+#  define CALLER_ADDR5 ((unsigned long)__builtin_return_address(5))
+# else
+   extern unsigned long arm_return_addr(int level);
+#  define CALLER_ADDR0 arm_return_addr(0)
+#  define CALLER_ADDR1 arm_return_addr(1)
+#  define CALLER_ADDR2 arm_return_addr(2)
+#  define CALLER_ADDR3 arm_return_addr(3)
+#  define CALLER_ADDR4 arm_return_addr(4)
+#  define CALLER_ADDR5 arm_return_addr(5)
+#endif
+#else
+# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0))
+# define CALLER_ADDR1 0UL
+# define CALLER_ADDR2 0UL
+# define CALLER_ADDR3 0UL
+# define CALLER_ADDR4 0UL
+# define CALLER_ADDR5 0UL
+#endif
+
+#ifdef CONFIG_MCOUNT
+  extern void notrace mcount(void);
+#else
+# define mcount() do { } while (0)
+#endif
+
+#ifdef CONFIG_EVENT_TRACE
+  extern int mcount_enabled, trace_enabled, trace_user_triggered,
+		trace_user_trigger_irq, trace_freerunning, trace_verbose,
+		trace_print_on_crash, trace_all_cpus, print_functions,
+		syscall_tracing, stackframe_tracing, trace_use_raw_cycles,
+		trace_all_runnable;
+  extern void notrace trace_special(unsigned long v1, unsigned long v2, unsigned long v3);
+  extern void notrace trace_special_pid(int pid, unsigned long v1, unsigned long v2);
+  extern void notrace trace_special_u64(unsigned long long v1, unsigned long v2);
+  extern void notrace trace_special_sym(void);
+  extern void stop_trace(void);
+# define start_trace() do { trace_enabled = 1; } while (0)
+  extern void print_last_trace(void);
+  extern void nmi_trace(unsigned long eip, unsigned long parent_eip,
+			unsigned long flags);
+  extern long user_trace_start(void);
+  extern long user_trace_stop(void);
+  extern void trace_cmdline(void);
+  extern void init_tracer(void);
+#else
+# define mcount_enabled				0
+# define trace_enabled				0
+# define syscall_tracing			0
+# define stackframe_tracing			0
+# define trace_user_triggered			0
+# define trace_freerunning			0
+# define trace_all_cpus				0
+# define trace_verbose				0
+# define trace_special(v1,v2,v3)		do { } while (0)
+# define trace_special_pid(pid,v1,v2)		do { } while (0)
+# define trace_special_u64(v1,v2)		do { } while (0)
+# define trace_special_sym()			do { } while (0)
+# define stop_trace()				do { } while (0)
+# define start_trace()				do { } while (0)
+# define print_last_trace()			do { } while (0)
+# define nmi_trace(eip, parent_eip, flags)	do { } while (0)
+# define user_trace_start()			do { } while (0)
+# define user_trace_stop()			do { } while (0)
+# define trace_cmdline()			do { } while (0)
+# define init_tracer()				do { } while (0)
+#endif
+
+#ifdef CONFIG_WAKEUP_TIMING
+  extern int wakeup_timing;
+  extern void __trace_start_sched_wakeup(struct task_struct *p);
+  extern void trace_stop_sched_switched(struct task_struct *p);
+  extern void trace_change_sched_cpu(struct task_struct *p, int new_cpu);
+#else
+# define wakeup_timing 0
+# define __trace_start_sched_wakeup(p)		do { } while (0)
+# define trace_stop_sched_switched(p)		do { } while (0)
+# define trace_change_sched_cpu(p, cpu)		do { } while (0)
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+  extern void notrace time_hardirqs_on(unsigned long a0, unsigned long a1);
+  extern void notrace time_hardirqs_off(unsigned long a0, unsigned long a1);
+#else
+# define time_hardirqs_on(a0, a1)		do { } while (0)
+# define time_hardirqs_off(a0, a1)		do { } while (0)
+#endif
 
 /* Attach to any functions which should be ignored in wchan output. */
 #define __sched		__attribute__((__section__(".sched.text")))
@@ -1014,6 +1114,13 @@ struct task_struct {
 	unsigned int lockdep_recursion;
 #endif
 
+#define MAX_PREEMPT_TRACE 16
+
+#ifdef CONFIG_PREEMPT_TRACE
+	unsigned long preempt_trace_eip[MAX_PREEMPT_TRACE];
+	unsigned long preempt_trace_parent_eip[MAX_PREEMPT_TRACE];
+#endif
+
 /* journalling filesystem info */
 	void *journal_info;
 
@@ -1636,6 +1743,7 @@ static inline unsigned int task_cpu(cons
 
 static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
+	trace_change_sched_cpu(p, cpu);
 	task_thread_info(p)->cpu = cpu;
 }
 
Index: linux-2.6.22-rc2/init/main.c
===================================================================
--- linux-2.6.22-rc2.orig/init/main.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/init/main.c	2007-05-24 15:57:42.000000000 +0200
@@ -576,6 +576,8 @@ asmlinkage void __init start_kernel(void
 	if (panic_later)
 		panic(panic_later, panic_param);
 
+	init_tracer();
+
 	lockdep_info();
 
 	/*
Index: linux-2.6.22-rc2/kernel/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/kernel/Makefile	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/Makefile	2007-05-24 15:57:42.000000000 +0200
@@ -38,6 +38,11 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
+obj-$(CONFIG_DEBUG_PREEMPT) += latency_trace.o
+obj-$(CONFIG_WAKEUP_TIMING) += latency_trace.o
+obj-$(CONFIG_EVENT_TRACE) += latency_trace.o
+obj-$(CONFIG_CRITICAL_TIMING) += latency_trace.o
+obj-$(CONFIG_LATENCY_HIST) += latency_hist.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
 obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
 obj-$(CONFIG_KPROBES) += kprobes.o
Index: linux-2.6.22-rc2/kernel/fork.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/fork.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/fork.c	2007-05-24 15:57:42.000000000 +0200
@@ -990,7 +990,7 @@ static struct task_struct *copy_process(
 
 	rt_mutex_init_task(p);
 
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_LOCKDEP)
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
Index: linux-2.6.22-rc2/kernel/latency_hist.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc2/kernel/latency_hist.c	2007-05-24 15:57:42.000000000 +0200
@@ -0,0 +1,267 @@
+/*
+ * kernel/latency_hist.c
+ *
+ * Add support for histograms of preemption-off latency and
+ * interrupt-off latency and wakeup latency, it depends on
+ * Real-Time Preemption Support.
+ *
+ *  Copyright (C) 2005 MontaVista Software, Inc.
+ *  Yi Yang <yyang@ch.mvista.com>
+ *
+ */
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/percpu.h>
+#include <linux/latency_hist.h>
+#include <linux/calc64.h>
+#include <asm/atomic.h>
+
+typedef struct hist_data_struct {
+	atomic_t hist_mode; /* 0 log, 1 don't log */
+	unsigned long min_lat;
+	unsigned long avg_lat;
+	unsigned long max_lat;
+	unsigned long long beyond_hist_bound_samples;
+	unsigned long long accumulate_lat;
+	unsigned long long total_samples;
+	unsigned long long hist_array[MAX_ENTRY_NUM];
+} hist_data_t;
+
+static struct proc_dir_entry * latency_hist_root = NULL;
+static char * latency_hist_proc_dir_root = "latency_hist";
+
+static char * percpu_proc_name = "CPU";
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+static DEFINE_PER_CPU(hist_data_t, interrupt_off_hist);
+static char * interrupt_off_hist_proc_dir = "interrupt_off_latency";
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+static DEFINE_PER_CPU(hist_data_t, preempt_off_hist);
+static char * preempt_off_hist_proc_dir = "preempt_off_latency";
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+static DEFINE_PER_CPU(hist_data_t, wakeup_latency_hist);
+static char * wakeup_latency_hist_proc_dir = "wakeup_latency";
+#endif
+
+static struct proc_dir_entry *entry[LATENCY_TYPE_NUM][NR_CPUS];
+
+static inline u64 u64_div(u64 x, u64 y)
+{
+        do_div(x, y);
+        return x;
+}
+
+void latency_hist(int latency_type, int cpu, unsigned long latency)
+{
+	hist_data_t * my_hist;
+
+	if ((cpu < 0) || (cpu >= NR_CPUS) || (latency_type < INTERRUPT_LATENCY)
+			|| (latency_type > WAKEUP_LATENCY) || (latency < 0))
+		return;
+
+	switch(latency_type) {
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	case INTERRUPT_LATENCY:
+		my_hist = (hist_data_t *)&per_cpu(interrupt_off_hist, cpu);
+		break;
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	case PREEMPT_LATENCY:
+		my_hist = (hist_data_t *)&per_cpu(preempt_off_hist, cpu);
+		break;
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	case WAKEUP_LATENCY:
+		my_hist = (hist_data_t *)&per_cpu(wakeup_latency_hist, cpu);
+		break;
+#endif
+	default:
+		return;
+	}
+
+	if (atomic_read(&my_hist->hist_mode) == 0)
+		return;
+
+	if (latency >= MAX_ENTRY_NUM)
+		my_hist->beyond_hist_bound_samples++;
+	else
+		my_hist->hist_array[latency]++;
+
+	if (latency < my_hist->min_lat)
+		my_hist->min_lat = latency;
+	else if (latency > my_hist->max_lat)
+		my_hist->max_lat = latency;
+
+	my_hist->total_samples++;
+	my_hist->accumulate_lat += latency;
+	my_hist->avg_lat = (unsigned long) u64_div(my_hist->accumulate_lat,
+						  my_hist->total_samples);
+	return;
+}
+
+static void *l_start(struct seq_file *m, loff_t * pos)
+{
+	loff_t *index_ptr = kmalloc(sizeof(loff_t), GFP_KERNEL);
+	loff_t index = *pos;
+	hist_data_t *my_hist = (hist_data_t *) m->private;
+
+	if (!index_ptr)
+		return NULL;
+
+	if (index == 0) {
+		atomic_dec(&my_hist->hist_mode);
+		seq_printf(m, "#Minimum latency: %lu microseconds.\n"
+			   "#Average latency: %lu microseconds.\n"
+			   "#Maximum latency: %lu microseconds.\n"
+			   "#Total samples: %llu\n"
+			   "#There are %llu samples greater or equal than %d microseconds\n"
+			   "#usecs\t%16s\n"
+			   , my_hist->min_lat
+			   , my_hist->avg_lat
+			   , my_hist->max_lat
+			   , my_hist->total_samples
+			   , my_hist->beyond_hist_bound_samples
+			   , MAX_ENTRY_NUM, "samples");
+	}
+	if (index >= MAX_ENTRY_NUM)
+		return NULL;
+
+	*index_ptr = index;
+	return index_ptr;
+}
+
+static void *l_next(struct seq_file *m, void *p, loff_t * pos)
+{
+	loff_t *index_ptr = p;
+	hist_data_t *my_hist = (hist_data_t *) m->private;
+
+	if (++*pos >= MAX_ENTRY_NUM) {
+		atomic_inc(&my_hist->hist_mode);
+		return NULL;
+	}
+	*index_ptr = *pos;
+	return index_ptr;
+}
+
+static void l_stop(struct seq_file *m, void *p)
+{
+	kfree(p);
+}
+
+static int l_show(struct seq_file *m, void *p)
+{
+	int index = *(loff_t *) p;
+	hist_data_t *my_hist = (hist_data_t *) m->private;
+
+	seq_printf(m, "%5d\t%16llu\n", index, my_hist->hist_array[index]);
+	return 0;
+}
+
+static struct seq_operations latency_hist_seq_op = {
+	.start = l_start,
+	.next  = l_next,
+	.stop  = l_stop,
+	.show  = l_show
+};
+
+static int latency_hist_seq_open(struct inode *inode, struct file *file)
+{
+	struct proc_dir_entry *entry_ptr = NULL;
+	int ret, i, j, break_flags = 0;
+	struct seq_file *seq;
+
+	entry_ptr = PDE(file->f_dentry->d_inode);
+	for (i = 0; i < LATENCY_TYPE_NUM; i++) {
+		for (j = 0; j < NR_CPUS; j++) {
+			if (entry[i][j] == NULL)
+				continue;
+			if (entry_ptr->low_ino == entry[i][j]->low_ino) {
+				break_flags = 1;
+				break;
+			}
+		}
+		if (break_flags == 1)
+			break;
+	}
+	ret = seq_open(file, &latency_hist_seq_op);
+	if (break_flags == 1) {
+		seq = (struct seq_file *)file->private_data;
+		seq->private = entry[i][j]->data;
+	}
+	return ret;
+}
+
+static struct file_operations latency_hist_seq_fops = {
+	.open = latency_hist_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static __init int latency_hist_init(void)
+{
+	struct proc_dir_entry *tmp_parent_proc_dir;
+	int i = 0, len = 0;
+	hist_data_t *my_hist;
+	char procname[64];
+
+	latency_hist_root = proc_mkdir(latency_hist_proc_dir_root, NULL);
+
+
+#ifdef CONFIG_INTERRUPT_OFF_HIST
+	tmp_parent_proc_dir = proc_mkdir(interrupt_off_hist_proc_dir, latency_hist_root);
+	for (i = 0; i < NR_CPUS; i++) {
+		len = sprintf(procname, "%s%d", percpu_proc_name, i);
+		procname[len] = '\0';
+		entry[INTERRUPT_LATENCY][i] =
+			create_proc_entry(procname, 0, tmp_parent_proc_dir);
+		entry[INTERRUPT_LATENCY][i]->data = (void *)&per_cpu(interrupt_off_hist, i);
+		entry[INTERRUPT_LATENCY][i]->proc_fops = &latency_hist_seq_fops;
+		my_hist = (hist_data_t *) entry[INTERRUPT_LATENCY][i]->data;
+		atomic_set(&my_hist->hist_mode,1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+#endif
+
+#ifdef CONFIG_PREEMPT_OFF_HIST
+	tmp_parent_proc_dir = proc_mkdir(preempt_off_hist_proc_dir, latency_hist_root);
+	for (i = 0; i < NR_CPUS; i++) {
+		len = sprintf(procname, "%s%d", percpu_proc_name, i);
+		procname[len] = '\0';
+		entry[PREEMPT_LATENCY][i] =
+			create_proc_entry(procname, 0, tmp_parent_proc_dir);
+		entry[PREEMPT_LATENCY][i]->data = (void *)&per_cpu(preempt_off_hist, i);
+		entry[PREEMPT_LATENCY][i]->proc_fops = &latency_hist_seq_fops;
+		my_hist = (hist_data_t *) entry[PREEMPT_LATENCY][i]->data;
+		atomic_set(&my_hist->hist_mode,1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+#endif
+
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	tmp_parent_proc_dir = proc_mkdir(wakeup_latency_hist_proc_dir, latency_hist_root);
+	for (i = 0; i < NR_CPUS; i++) {
+		len = sprintf(procname, "%s%d", percpu_proc_name, i);
+		procname[len] = '\0';
+		entry[WAKEUP_LATENCY][i] =
+			create_proc_entry(procname, 0, tmp_parent_proc_dir);
+		entry[WAKEUP_LATENCY][i]->data = (void *)&per_cpu(wakeup_latency_hist, i);
+		entry[WAKEUP_LATENCY][i]->proc_fops = &latency_hist_seq_fops;
+		my_hist = (hist_data_t *) entry[WAKEUP_LATENCY][i]->data;
+		atomic_set(&my_hist->hist_mode,1);
+		my_hist->min_lat = 0xFFFFFFFFUL;
+	}
+#endif
+	return 0;
+
+}
+
+__initcall(latency_hist_init);
+
Index: linux-2.6.22-rc2/kernel/latency_trace.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc2/kernel/latency_trace.c	2007-05-24 15:57:42.000000000 +0200
@@ -0,0 +1,2763 @@
+/*
+ *  kernel/latency_trace.c
+ *
+ *  Copyright (C) 2004-2006 Ingo Molnar
+ *  Copyright (C) 2004 William Lee Irwin III
+ */
+#include <linux/mm.h>
+#include <linux/nmi.h>
+#include <linux/rtc.h>
+#include <linux/sched.h>
+#include <linux/percpu.h>
+
+#include <linux/module.h>
+#include <linux/profile.h>
+#include <linux/bootmem.h>
+#include <linux/version.h>
+#include <linux/notifier.h>
+#include <linux/kallsyms.h>
+#include <linux/seq_file.h>
+#include <linux/interrupt.h>
+#include <linux/clocksource.h>
+#include <linux/proc_fs.h>
+#include <linux/latency_hist.h>
+#include <linux/utsrelease.h>
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
+#include <asm/rtc.h>
+#include <asm/asm-offsets.h>
+#include <linux/stacktrace.h>
+
+#ifndef DEFINE_RAW_SPINLOCK
+# define DEFINE_RAW_SPINLOCK		DEFINE_SPINLOCK
+#endif
+
+#ifndef RAW_SPIN_LOCK_UNLOCKED
+# define RAW_SPIN_LOCK_UNLOCKED		SPIN_LOCK_UNLOCKED
+#endif
+
+int trace_use_raw_cycles = 0;
+
+#define __raw_spinlock_t raw_spinlock_t
+#define need_resched_delayed() 0
+
+#ifdef CONFIG_EVENT_TRACE
+/*
+ * Convert raw cycles to usecs.
+ * Note: this is not the 'clocksource cycles' value, it's the raw
+ * cycle counter cycles. We use GTOD to timestamp latency start/end
+ * points, but the trace entries inbetween are timestamped with
+ * get_cycles().
+ */
+static unsigned long notrace cycles_to_us(cycle_t delta)
+{
+	if (!trace_use_raw_cycles)
+		return cycles_to_usecs(delta);
+#ifdef CONFIG_X86
+	do_div(delta, cpu_khz/1000+1);
+#elif defined(CONFIG_PPC)
+	delta = mulhwu(tb_to_us, delta);
+#elif defined(CONFIG_ARM)
+	delta = mach_cycles_to_usecs(delta);
+#else
+	#error Implement cycles_to_usecs.
+#endif
+
+	return (unsigned long) delta;
+}
+#endif
+
+static notrace inline cycle_t now(void)
+{
+	if (trace_use_raw_cycles)
+		return get_cycles();
+	return get_monotonic_cycles();
+}
+
+#ifndef irqs_off
+# define irqs_off()			0
+#endif
+
+#ifndef DEBUG_WARN_ON
+static inline int DEBUG_WARN_ON(int cond)
+{
+	WARN_ON(cond);
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+# ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+#  define irqs_off_preempt_count() preempt_count()
+# else
+#  define irqs_off_preempt_count() 0
+# endif
+#endif
+
+#ifdef CONFIG_WAKEUP_TIMING
+struct sch_struct {
+	__raw_spinlock_t trace_lock;
+	struct task_struct *task;
+	int cpu;
+	struct cpu_trace *tr;
+} ____cacheline_aligned_in_smp;
+
+static __cacheline_aligned_in_smp struct sch_struct sch =
+		{ trace_lock: __RAW_SPIN_LOCK_UNLOCKED };
+
+int wakeup_timing = 1;
+#endif
+
+/*
+ * Track maximum latencies and save the trace:
+ */
+
+/*
+ * trace_stop_sched_switched must not be called with runqueue locks held!
+ */
+static __cacheline_aligned_in_smp DECLARE_MUTEX(max_mutex);
+
+/*
+ * Sequence count - we record it when starting a measurement and
+ * skip the latency if the sequence has changed - some other section
+ * did a maximum and could disturb our measurement with serial console
+ * printouts, etc. Truly coinciding maximum latencies should be rare
+ * and what happens together happens separately as well, so this doesnt
+ * decrease the validity of the maximum found:
+ */
+static __cacheline_aligned_in_smp unsigned long max_sequence;
+
+enum trace_type
+{
+	__TRACE_FIRST_TYPE = 0,
+
+	TRACE_FN,
+	TRACE_SPECIAL,
+	TRACE_SPECIAL_PID,
+	TRACE_SPECIAL_U64,
+	TRACE_SPECIAL_SYM,
+	TRACE_CMDLINE,
+	TRACE_SYSCALL,
+	TRACE_SYSRET,
+
+	__TRACE_LAST_TYPE
+};
+
+enum trace_flag_type
+{
+	TRACE_FLAG_IRQS_OFF		= 0x01,
+	TRACE_FLAG_NEED_RESCHED		= 0x02,
+	TRACE_FLAG_NEED_RESCHED_DELAYED	= 0x04,
+	TRACE_FLAG_HARDIRQ		= 0x08,
+	TRACE_FLAG_SOFTIRQ		= 0x10,
+	TRACE_FLAG_IRQS_HARD_OFF	= 0x20,
+};
+
+/*
+ * Maximum preemption latency measured. Initialize to maximum,
+ * we clear it after bootup.
+ */
+#ifdef CONFIG_LATENCY_HIST
+unsigned long preempt_max_latency = (cycle_t)0UL;
+#else
+unsigned long preempt_max_latency = (cycle_t)ULONG_MAX;
+#endif
+
+unsigned long preempt_thresh;
+
+/*
+ * Should this new latency be reported/recorded?
+ */
+static int report_latency(cycle_t delta)
+{
+	if (latency_hist_flag && !trace_user_triggered)
+		return 1;
+
+	if (preempt_thresh) {
+		if (delta < preempt_thresh)
+			return 0;
+	} else {
+		if (delta <= preempt_max_latency)
+			return 0;
+	}
+	return 1;
+}
+
+#ifdef CONFIG_EVENT_TRACE
+
+/*
+ * Number of per-CPU trace entries:
+ */
+#define MAX_TRACE (65536UL*16UL)
+
+#define CMDLINE_BYTES 16
+
+/*
+ * 32 bytes on 32-bit platforms:
+ */
+struct trace_entry {
+	char type;
+	char cpu;
+	char flags;
+	char preempt_count; // assumes PREEMPT_MASK is 8 bits or less
+	int pid;
+	cycle_t timestamp;
+	union {
+		struct {
+			unsigned long eip;
+			unsigned long parent_eip;
+		} fn;
+		struct {
+			unsigned long eip;
+			unsigned long v1, v2, v3;
+		} special;
+		struct {
+			unsigned char str[CMDLINE_BYTES];
+		} cmdline;
+		struct {
+			unsigned long nr; // highest bit: compat call
+			unsigned long p1, p2, p3;
+		} syscall;
+		struct {
+			unsigned long ret;
+		} sysret;
+		struct {
+			unsigned long __pad3[4];
+		} pad;
+	} u;
+} __attribute__((packed));
+
+#endif
+
+struct cpu_trace {
+	atomic_t disabled;
+	unsigned long trace_idx;
+	cycle_t preempt_timestamp;
+	unsigned long critical_start, critical_end;
+	unsigned long critical_sequence;
+	atomic_t underrun;
+	atomic_t overrun;
+	int early_warning;
+	int latency_type;
+	int cpu;
+
+#ifdef CONFIG_EVENT_TRACE
+	struct trace_entry *trace;
+	char comm[CMDLINE_BYTES];
+	pid_t pid;
+	unsigned long uid;
+	unsigned long nice;
+	unsigned long policy;
+	unsigned long rt_priority;
+	unsigned long saved_latency;
+#endif
+#ifdef CONFIG_DEBUG_STACKOVERFLOW
+	unsigned long stack_check;
+#endif
+} ____cacheline_aligned_in_smp;
+
+static struct cpu_trace cpu_traces[NR_CPUS] ____cacheline_aligned_in_smp =
+{ [0 ... NR_CPUS-1] = {
+#ifdef CONFIG_DEBUG_STACKOVERFLOW
+ .stack_check = 1
+#endif
+ } };
+
+#ifdef CONFIG_EVENT_TRACE
+
+int trace_enabled = 0;
+int syscall_tracing = 1;
+int stackframe_tracing = 0;
+int mcount_enabled = 0;
+int trace_freerunning = 0;
+int trace_print_on_crash = 0;
+int trace_verbose = 0;
+int trace_all_cpus = 0;
+int print_functions = 0;
+int trace_all_runnable = 0;
+
+/*
+ * user-triggered via gettimeofday(0,1)/gettimeofday(0,0)
+ */
+int trace_user_triggered = 0;
+int trace_user_trigger_irq = -1;
+
+void trace_start_ht_debug(void)
+{
+	trace_all_cpus = 0;
+	trace_freerunning = 1;
+	trace_user_triggered = 1;
+	mcount_enabled = 1;
+	trace_enabled = 1;
+	user_trace_start();
+}
+
+struct saved_trace_struct {
+	int cpu;
+	cycle_t first_timestamp, last_timestamp;
+	struct cpu_trace traces[NR_CPUS];
+} ____cacheline_aligned_in_smp;
+
+/*
+ * The current worst-case trace:
+ */
+static struct saved_trace_struct max_tr;
+
+/*
+ * /proc/latency_trace atomicity:
+ */
+static DECLARE_MUTEX(out_mutex);
+
+static struct saved_trace_struct out_tr;
+
+static void notrace printk_name(unsigned long eip)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(eip, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		printk("%s+%#lx/%#lx", sym_name, offset, size);
+	else
+		printk("<%08lx>", eip);
+}
+
+#ifdef CONFIG_DEBUG_STACKOVERFLOW
+
+#ifndef STACK_WARN
+# define STACK_WARN (THREAD_SIZE/8)
+#endif
+
+#define MIN_STACK_NEEDED (sizeof(struct thread_info) + STACK_WARN)
+#define MAX_STACK (THREAD_SIZE - sizeof(struct thread_info))
+
+#if (defined(__i386__) || defined(__x86_64__)) && defined(CONFIG_FRAME_POINTER)
+# define PRINT_EXACT_STACKFRAME
+#endif
+
+#ifdef PRINT_EXACT_STACKFRAME
+static unsigned long *worst_stack_bp;
+#endif
+static DEFINE_RAW_SPINLOCK(worst_stack_lock);
+unsigned long worst_stack_left = THREAD_SIZE;
+static unsigned long worst_stack_printed = THREAD_SIZE;
+static char worst_stack_comm[TASK_COMM_LEN+1];
+static int worst_stack_pid;
+static unsigned long worst_stack_sp;
+static char worst_stack[THREAD_SIZE];
+
+static notrace void fill_worst_stack(unsigned long stack_left)
+{
+	unsigned long flags;
+
+	/*
+	 * On x64, we must not read the PDA during early bootup:
+	 */
+#ifdef CONFIG_X86_64
+	if (system_state == SYSTEM_BOOTING)
+		return;
+#endif
+	spin_lock_irqsave(&worst_stack_lock, flags);
+	if (likely(stack_left < worst_stack_left)) {
+		worst_stack_left = stack_left;
+		memcpy(worst_stack, current_thread_info(), THREAD_SIZE);
+		worst_stack_sp = (unsigned long)&stack_left;
+		memcpy(worst_stack_comm, current->comm, TASK_COMM_LEN);
+		worst_stack_pid = current->pid;
+#ifdef PRINT_EXACT_STACKFRAME
+# ifdef __i386__
+		asm ("mov %%ebp, %0\n" :"=g"(worst_stack_bp));
+# elif defined(__x86_64__)
+		asm ("mov %%rbp, %0\n" :"=g"(worst_stack_bp));
+# else
+#  error Poke the author of above asm code lines !
+# endif
+#endif
+	}
+	spin_unlock_irqrestore(&worst_stack_lock, flags);
+}
+
+#ifdef PRINT_EXACT_STACKFRAME
+
+/*
+ * This takes a BP offset to point the BP back into the saved stack,
+ * the original stack might be long gone (but the stackframe within
+ * the saved copy still contains references to it).
+ */
+#define CONVERT_TO_SAVED_STACK(bp) \
+	((void *)worst_stack + ((unsigned long)bp & (THREAD_SIZE-1)))
+
+static void show_stackframe(void)
+{
+	unsigned long addr, frame_size, *bp, *prev_bp, sum = 0;
+
+	bp = CONVERT_TO_SAVED_STACK(worst_stack_bp);
+
+	while (bp[0]) {
+		addr = bp[1];
+		if (!kernel_text_address(addr))
+			break;
+
+		prev_bp = bp;
+		bp = CONVERT_TO_SAVED_STACK((unsigned long *)bp[0]);
+
+		frame_size = (bp - prev_bp) * sizeof(long);
+
+		if (frame_size < THREAD_SIZE) {
+			printk("{ %4ld} ", frame_size);
+			sum += frame_size;
+		} else
+			printk("{=%4ld} ", sum);
+
+		printk("[<%08lx>] ", addr);
+		printk_name(addr);
+		printk("\n");
+	}
+}
+
+#else
+
+static inline int valid_stack_ptr(void *p)
+{
+	return  p > (void *)worst_stack &&
+                p < (void *)worst_stack + THREAD_SIZE - 3;
+}
+
+static void show_stackframe(void)
+{
+	unsigned long prev_frame, addr;
+	unsigned long *stack;
+
+	prev_frame = (unsigned long)(worst_stack +
+					(worst_stack_sp & (THREAD_SIZE-1)));
+	stack = (unsigned long *)prev_frame;
+
+	while (valid_stack_ptr(stack)) {
+		addr = *stack++;
+		if (__kernel_text_address(addr)) {
+			printk("(%4ld) ", (unsigned long)stack - prev_frame);
+			printk("[<%08lx>] ", addr);
+			print_symbol("%s\n", addr);
+			prev_frame = (unsigned long)stack;
+		}
+		if ((char *)stack >= worst_stack + THREAD_SIZE)
+			break;
+	}
+}
+
+#endif
+
+static notrace void __print_worst_stack(void)
+{
+	unsigned long fill_ratio;
+	printk("----------------------------->\n");
+	printk("| new stack fill maximum: %s/%d, %ld bytes (out of %ld bytes).\n",
+		worst_stack_comm, worst_stack_pid,
+		MAX_STACK-worst_stack_left, (long)MAX_STACK);
+	fill_ratio = (MAX_STACK-worst_stack_left)*100/(long)MAX_STACK;
+	printk("| Stack fill ratio: %02ld%%", fill_ratio);
+	if (fill_ratio >= 90)
+		printk(" - BUG: that's quite high, please report this!\n");
+	else
+		printk(" - that's still OK, no need to report this.\n");
+	printk("------------|\n");
+
+	show_stackframe();
+	printk("<---------------------------\n\n");
+}
+
+static notrace void print_worst_stack(void)
+{
+	unsigned long flags;
+
+	if (irqs_disabled() || preempt_count())
+		return;
+
+	spin_lock_irqsave(&worst_stack_lock, flags);
+	if (worst_stack_printed == worst_stack_left) {
+		spin_unlock_irqrestore(&worst_stack_lock, flags);
+		return;
+	}
+	worst_stack_printed = worst_stack_left;
+	spin_unlock_irqrestore(&worst_stack_lock, flags);
+
+	__print_worst_stack();
+}
+
+static notrace void debug_stackoverflow(struct cpu_trace *tr)
+{
+	long stack_left;
+
+	if (unlikely(tr->stack_check <= 0))
+		return;
+	atomic_inc(&tr->disabled);
+
+	/* Debugging check for stack overflow: is there less than 1KB free? */
+#ifdef __i386__
+	__asm__ __volatile__("and %%esp,%0" :
+				"=r" (stack_left) : "0" (THREAD_SIZE - 1));
+#elif defined(__x86_64__)
+	__asm__ __volatile__("and %%rsp,%0" :
+				"=r" (stack_left) : "0" (THREAD_SIZE - 1));
+#else
+# error Poke the author of above asm code lines !
+#endif
+	if (unlikely(stack_left < MIN_STACK_NEEDED)) {
+		tr->stack_check = 0;
+		printk(KERN_ALERT "BUG: stack overflow: only %ld bytes left! [%08lx...(%08lx-%08lx)]\n",
+			stack_left - sizeof(struct thread_info),
+			(long)&stack_left,
+			(long)current_thread_info(),
+			(long)current_thread_info() + THREAD_SIZE);
+		fill_worst_stack(stack_left);
+		__print_worst_stack();
+		goto out;
+	}
+	if (unlikely(stack_left < worst_stack_left)) {
+		tr->stack_check--;
+		fill_worst_stack(stack_left);
+		print_worst_stack();
+		tr->stack_check++;
+	} else
+		if (worst_stack_printed != worst_stack_left) {
+			tr->stack_check--;
+			print_worst_stack();
+			tr->stack_check++;
+		}
+out:
+	atomic_dec(&tr->disabled);
+}
+
+#endif
+
+#ifdef CONFIG_EARLY_PRINTK
+static void notrace early_printk_name(unsigned long eip)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(eip, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		early_printk("%s <%08lx>", sym_name, eip);
+	else
+		early_printk("<%08lx>", eip);
+}
+
+static __raw_spinlock_t early_print_lock = __RAW_SPIN_LOCK_UNLOCKED;
+
+static void notrace early_print_entry(struct trace_entry *entry)
+{
+	int hardirq, softirq;
+
+	__raw_spin_lock(&early_print_lock);
+	early_printk("%-5d ", entry->pid);
+
+	early_printk("%d%c%c",
+		entry->cpu,
+		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+		(entry->flags & TRACE_FLAG_IRQS_HARD_OFF) ? 'D' : '.',
+		(entry->flags & TRACE_FLAG_NEED_RESCHED_DELAYED) ? 'n' :
+ 		((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.'));
+
+	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+	if (hardirq && softirq)
+		early_printk("H");
+	else {
+		if (hardirq)
+			early_printk("h");
+		else {
+			if (softirq)
+				early_printk("s");
+			else
+				early_printk(".");
+		}
+	}
+
+	early_printk(":%d: ", entry->preempt_count);
+
+	if (entry->type == TRACE_FN) {
+		early_printk_name(entry->u.fn.eip);
+		early_printk("  <= (");
+		early_printk_name(entry->u.fn.parent_eip);
+		early_printk(")\n");
+	} else {
+		/* special entries: */
+		early_printk_name(entry->u.special.eip);
+		early_printk(": <%08lx> <%08lx> <%08lx>\n",
+			entry->u.special.v1,
+			entry->u.special.v2,
+			entry->u.special.v3);
+	}
+	__raw_spin_unlock(&early_print_lock);
+}
+#else
+#  define early_print_entry(x) do { } while(0)
+#endif
+
+static void notrace
+____trace(int cpu, enum trace_type type, struct cpu_trace *tr,
+	  unsigned long eip, unsigned long parent_eip,
+	  unsigned long v1, unsigned long v2, unsigned long v3,
+	  unsigned long flags)
+{
+	struct trace_entry *entry;
+	unsigned long idx, idx_next;
+	cycle_t timestamp;
+	u32 pc;
+
+#ifdef CONFIG_DEBUG_PREEMPT
+//	WARN_ON(!atomic_read(&tr->disabled));
+#endif
+	if (!tr->critical_start && !trace_user_triggered && !trace_all_cpus &&
+	    !trace_print_on_crash && !print_functions)
+		goto out;
+	/*
+	 * Allocate the next index. Make sure an NMI (or interrupt)
+	 * has not taken it away. Potentially redo the timestamp as
+	 * well to make sure the trace timestamps are in chronologic
+	 * order.
+	 */
+again:
+	idx = tr->trace_idx;
+	idx_next = idx + 1;
+	timestamp = now();
+
+	if (unlikely((trace_freerunning || print_functions || atomic_read(&tr->underrun)) &&
+		     (idx_next >= MAX_TRACE) && !atomic_read(&tr->overrun))) {
+		atomic_inc(&tr->underrun);
+		idx_next = 0;
+	}
+	if (unlikely(idx >= MAX_TRACE)) {
+		atomic_inc(&tr->overrun);
+		goto out;
+	}
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (unlikely(cmpxchg(&tr->trace_idx, idx, idx_next) != idx)) {
+		if (idx_next == 0)
+			atomic_dec(&tr->underrun);
+		goto again;
+	}
+#else
+# ifdef CONFIG_SMP
+#  error CMPXCHG missing
+# else
+	/* No worry, we are protected by the atomic_incr(&tr->disabled)
+	 * in __trace further down
+	 */
+	tr->trace_idx = idx_next;
+# endif
+#endif
+	if (unlikely(idx_next != 0 && atomic_read(&tr->underrun)))
+		atomic_inc(&tr->underrun);
+
+	pc = preempt_count();
+
+	if (unlikely(!tr->trace))
+		goto out;
+	entry = tr->trace + idx;
+	entry->type = type;
+#ifdef CONFIG_SMP
+	entry->cpu = cpu;
+#endif
+	entry->flags = (irqs_off() ? TRACE_FLAG_IRQS_OFF : 0) |
+		(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_HARD_OFF : 0)|
+		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
+		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
+		(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) |
+		(need_resched_delayed() ? TRACE_FLAG_NEED_RESCHED_DELAYED : 0);
+	entry->preempt_count = pc & 0xff;
+	entry->pid = current->pid;
+	entry->timestamp = timestamp;
+
+	switch (type) {
+	case TRACE_FN:
+		entry->u.fn.eip = eip;
+		entry->u.fn.parent_eip = parent_eip;
+		if (unlikely(print_functions && !in_interrupt()))
+			early_print_entry(entry);
+		break;
+	case TRACE_SPECIAL:
+	case TRACE_SPECIAL_PID:
+	case TRACE_SPECIAL_U64:
+	case TRACE_SPECIAL_SYM:
+		entry->u.special.eip = eip;
+		entry->u.special.v1 = v1;
+		entry->u.special.v2 = v2;
+		entry->u.special.v3 = v3;
+		if (unlikely(print_functions && !in_interrupt()))
+			early_print_entry(entry);
+		break;
+	case TRACE_SYSCALL:
+		entry->u.syscall.nr = eip;
+		entry->u.syscall.p1 = v1;
+		entry->u.syscall.p2 = v2;
+		entry->u.syscall.p3 = v3;
+		break;
+	case TRACE_SYSRET:
+		entry->u.sysret.ret = eip;
+		break;
+	case TRACE_CMDLINE:
+		memcpy(entry->u.cmdline.str, current->comm, CMDLINE_BYTES);
+		break;
+	default:
+		break;
+	}
+out:
+	;
+}
+
+static inline void notrace
+___trace(enum trace_type type, unsigned long eip, unsigned long parent_eip,
+		unsigned long v1, unsigned long v2,
+			unsigned long v3)
+{
+	struct cpu_trace *tr;
+	unsigned long flags;
+	int cpu;
+
+	if (unlikely(trace_enabled <= 0))
+		return;
+
+#if defined(CONFIG_DEBUG_STACKOVERFLOW) && defined(CONFIG_X86)
+	debug_stackoverflow(cpu_traces + raw_smp_processor_id());
+#endif
+
+	raw_local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	/*
+	 * Trace on the CPU where the current highest-prio task
+	 * is waiting to become runnable:
+	 */
+#ifdef CONFIG_WAKEUP_TIMING
+	if (wakeup_timing && !trace_all_cpus && !trace_print_on_crash &&
+	    !print_functions) {
+		if (!sch.tr || cpu != sch.cpu)
+			goto out;
+		tr = sch.tr;
+	} else
+		tr = cpu_traces + cpu;
+#else
+	tr = cpu_traces + cpu;
+#endif
+	atomic_inc(&tr->disabled);
+	if (likely(atomic_read(&tr->disabled) == 1)) {
+//#define DEBUG_STACK_POISON
+#ifdef DEBUG_STACK_POISON
+		char stack;
+
+		memset(&stack - 128, 0x34, 128);
+#endif
+		____trace(cpu, type, tr, eip, parent_eip, v1, v2, v3, flags);
+	}
+	atomic_dec(&tr->disabled);
+#ifdef CONFIG_WAKEUP_TIMING
+out:
+#endif
+	raw_local_irq_restore(flags);
+}
+
+/*
+ * Special, ad-hoc tracepoints:
+ */
+void notrace trace_special(unsigned long v1, unsigned long v2, unsigned long v3)
+{
+	___trace(TRACE_SPECIAL, CALLER_ADDR0, 0, v1, v2, v3);
+}
+
+EXPORT_SYMBOL(trace_special);
+
+void notrace trace_special_pid(int pid, unsigned long v1, unsigned long v2)
+{
+	___trace(TRACE_SPECIAL_PID, CALLER_ADDR0, 0, pid, v1, v2);
+}
+
+EXPORT_SYMBOL(trace_special_pid);
+
+void notrace trace_special_u64(unsigned long long v1, unsigned long v2)
+{
+	___trace(TRACE_SPECIAL_U64, CALLER_ADDR0, 0,
+		 (unsigned long) (v1 >> 32), (unsigned long) (v1 & 0xFFFFFFFF),
+		 v2);
+}
+
+EXPORT_SYMBOL(trace_special_u64);
+
+void notrace trace_special_sym(void)
+{
+#define STACK_ENTRIES 8
+	unsigned long entries[STACK_ENTRIES];
+	struct stack_trace trace;
+
+	if (!trace_enabled)
+		return;
+
+	if (!stackframe_tracing)
+		return 	___trace(TRACE_SPECIAL, CALLER_ADDR0, 0, CALLER_ADDR1, 0, 0);
+
+	trace.entries = entries;
+	trace.skip = 3;
+#if 0
+	trace.all_contexts = 1;
+#endif
+	trace.max_entries = STACK_ENTRIES;
+	trace.nr_entries = 0;
+
+#if 0
+	save_stack_trace(&trace, NULL);
+#else
+	save_stack_trace(&trace);
+#endif
+	/*
+	 * clear out the rest:
+	 */
+	while (trace.nr_entries < trace.max_entries)
+		entries[trace.nr_entries++] = 0;
+
+	___trace(TRACE_SPECIAL_SYM, entries[0], 0,
+					entries[1], entries[2], entries[3]);
+	___trace(TRACE_SPECIAL_SYM, entries[4], 0,
+					entries[5], entries[6], entries[7]);
+}
+
+EXPORT_SYMBOL(trace_special_sym);
+
+/*
+ * Non-inlined function:
+ */
+void notrace __trace(unsigned long eip, unsigned long parent_eip)
+{
+	___trace(TRACE_FN, eip, parent_eip, 0, 0, 0);
+}
+
+#ifdef CONFIG_MCOUNT
+
+extern void mcount(void);
+
+EXPORT_SYMBOL(mcount);
+
+void notrace __mcount(void)
+{
+	___trace(TRACE_FN, CALLER_ADDR1, CALLER_ADDR2, 0, 0, 0);
+}
+
+#endif
+
+void notrace
+sys_call(unsigned long nr, unsigned long p1, unsigned long p2, unsigned long p3)
+{
+	if (syscall_tracing)
+		___trace(TRACE_SYSCALL, nr, 0, p1, p2, p3);
+}
+
+#if defined(CONFIG_COMPAT) && defined(CONFIG_X86)
+
+void notrace
+sys_ia32_call(unsigned long nr, unsigned long p1, unsigned long p2,
+	      unsigned long p3)
+{
+	if (syscall_tracing)
+		___trace(TRACE_SYSCALL, nr | 0x80000000, 0, p1, p2, p3);
+}
+
+#endif
+
+void notrace sys_ret(unsigned long ret)
+{
+	if (syscall_tracing)
+		___trace(TRACE_SYSRET, ret, 0, 0, 0, 0);
+}
+
+static void notrace print_name(struct seq_file *m, unsigned long eip)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	/*
+	 * Special trace values:
+	 */
+	if (((long)eip < 10000L) && ((long)eip > -10000L)) {
+		seq_printf(m, "(%5ld)", eip);
+		return;
+	}
+	sym_name = kallsyms_lookup(eip, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		seq_puts(m, sym_name);
+	else
+		seq_printf(m, "<%08lx>", eip);
+}
+
+static void notrace print_name_offset(struct seq_file *m, unsigned long eip)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(eip, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		seq_printf(m, "%s+%#lx/%#lx <%08lx>",
+					sym_name, offset, size, eip);
+	else
+		seq_printf(m, "<%08lx>", eip);
+}
+
+static unsigned long out_sequence = -1;
+
+static int pid_to_cmdline_array[PID_MAX_DEFAULT+1];
+
+static void notrace _trace_cmdline(int cpu, struct cpu_trace *tr)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+	____trace(cpu, TRACE_CMDLINE, tr, 0, 0, 0, 0, 0, flags);
+}
+
+void notrace trace_cmdline(void)
+{
+	___trace(TRACE_CMDLINE, 0, 0, 0, 0, 0);
+}
+
+static void construct_pid_to_cmdline(struct cpu_trace *tr)
+{
+	unsigned int i, j, entries, pid;
+
+	if (tr->critical_sequence == out_sequence)
+		return;
+	out_sequence = tr->critical_sequence;
+
+	memset(pid_to_cmdline_array, -1, sizeof(int) * (PID_MAX_DEFAULT + 1));
+
+	if (!tr->trace)
+		return;
+
+	entries = min(tr->trace_idx, MAX_TRACE);
+
+	for (i = 0; i < entries; i++) {
+		struct trace_entry *entry = tr->trace + i;
+
+		if (entry->type != TRACE_CMDLINE)
+			continue;
+		pid = entry->pid;
+		if (pid < PID_MAX_DEFAULT) {
+			pid_to_cmdline_array[pid] = i;
+			/*
+			 * Replace space with underline - makes it easier
+			 * to process for tools:
+			 */
+			for (j = 0; j < CMDLINE_BYTES; j++)
+				if (entry->u.cmdline.str[j] == ' ')
+					entry->u.cmdline.str[j] = '_';
+		}
+	}
+}
+
+char *pid_to_cmdline(unsigned long pid)
+{
+	struct cpu_trace *tr = out_tr.traces + 0;
+	char *cmdline = "<...>";
+	int idx;
+
+	pid = min(pid, (unsigned long)PID_MAX_DEFAULT);
+	if (!pid)
+		return "<idle>";
+
+	if (pid_to_cmdline_array[pid] != -1) {
+		idx = pid_to_cmdline_array[pid];
+		if (tr->trace[idx].type == TRACE_CMDLINE)
+			cmdline = tr->trace[idx].u.cmdline.str;
+	}
+	return cmdline;
+}
+
+static void copy_trace(struct cpu_trace *save, struct cpu_trace *tr, int reorder)
+{
+	if (!save->trace || !tr->trace)
+		return;
+	/* free-running needs reordering */
+	if (reorder && atomic_read(&tr->underrun)) {
+		int i, idx, idx0 = tr->trace_idx;
+
+		for (i = 0; i < MAX_TRACE; i++) {
+			idx = (idx0 + i) % MAX_TRACE;
+			save->trace[i] = tr->trace[idx];
+		}
+		save->trace_idx = MAX_TRACE;
+	} else {
+		save->trace_idx = tr->trace_idx;
+
+		memcpy(save->trace, tr->trace,
+			min(save->trace_idx, MAX_TRACE) *
+					sizeof(struct trace_entry));
+	}
+	save->underrun = tr->underrun;
+	save->overrun = tr->overrun;
+}
+
+
+struct block_idx {
+	int idx[NR_CPUS];
+};
+
+/*
+ * return the trace entry (position) of the smallest-timestamp
+ * one (that is still in the valid idx range):
+ */
+static int min_idx(struct block_idx *bidx)
+{
+	cycle_t min_stamp = (cycle_t) -1;
+	struct trace_entry *entry;
+	int cpu, min_cpu = -1, idx;
+
+	for_each_online_cpu(cpu) {
+		idx = bidx->idx[cpu];
+		if (idx >= min(max_tr.traces[cpu].trace_idx, MAX_TRACE))
+			continue;
+		if (idx > MAX_TRACE*NR_CPUS) {
+			printk("huh: idx (%d) > %ld*%d!\n", idx, MAX_TRACE,
+				NR_CPUS);
+			WARN_ON(1);
+			break;
+		}
+		entry = max_tr.traces[cpu].trace + bidx->idx[cpu];
+		if (entry->timestamp < min_stamp) {
+			min_cpu = cpu;
+			min_stamp = entry->timestamp;
+		}
+	}
+
+	return min_cpu;
+}
+
+/*
+ * This code is called to construct an output trace from
+ * the maximum trace. Having separate traces serves both
+ * atomicity (a new max might be saved while we are busy
+ * accessing /proc/latency_trace) and it is also used to
+ * delay the (expensive) sorting of the output trace by
+ * timestamps, in the trace_all_cpus case.
+ */
+static void update_out_trace(void)
+{
+	struct trace_entry *out_entry, *entry, *tmp;
+	cycle_t stamp, first_stamp, last_stamp;
+	struct block_idx bidx = { { 0, }, };
+	struct cpu_trace *tmp_max, *tmp_out;
+	int cpu, sum, entries, underrun_sum, overrun_sum;
+
+	/*
+	 * For out_tr we only have the first array's trace entries
+	 * allocated - and they have are larger on SMP to make room
+	 * for all trace entries from all CPUs.
+	 */
+	tmp_out = out_tr.traces + 0;
+	tmp_max = max_tr.traces + max_tr.cpu;
+	/*
+	 * Easier to copy this way. Note: the trace buffer is private
+	 * to the output buffer, so preserve it:
+	 */
+	copy_trace(tmp_out, tmp_max, 0);
+	tmp = tmp_out->trace;
+	*tmp_out = *tmp_max;
+	tmp_out->trace = tmp;
+
+	out_tr.cpu = max_tr.cpu;
+
+	if (!tmp_out->trace)
+		return;
+
+	out_entry = tmp_out->trace + 0;
+
+	if (!trace_all_cpus) {
+		entries = min(tmp_out->trace_idx, MAX_TRACE);
+		if (!entries)
+			return;
+		out_tr.first_timestamp = tmp_out->trace[0].timestamp;
+		out_tr.last_timestamp = tmp_out->trace[entries-1].timestamp;
+		return;
+	}
+	/*
+	 * Find the range of timestamps that are fully traced in
+	 * all CPU traces. (since CPU traces can cover a variable
+	 * range of time, we have to find the best range.)
+	 */
+	first_stamp = 0;
+	for_each_online_cpu(cpu) {
+		tmp_max = max_tr.traces + cpu;
+		stamp = tmp_max->trace[0].timestamp;
+		if (stamp > first_stamp)
+			first_stamp = stamp;
+	}
+	/*
+	 * Save the timestamp range:
+	 */
+	tmp_max = max_tr.traces + max_tr.cpu;
+	entries = min(tmp_max->trace_idx, MAX_TRACE);
+	/*
+	 * No saved trace yet?
+	 */
+	if (!entries) {
+		out_tr.traces[0].trace_idx = 0;
+		return;
+	}
+
+	last_stamp = tmp_max->trace[entries-1].timestamp;
+
+	if (last_stamp < first_stamp) {
+		WARN_ON(1);
+
+		for_each_online_cpu(cpu) {
+			tmp_max = max_tr.traces + cpu;
+			entries = min(tmp_max->trace_idx, MAX_TRACE);
+			printk("CPU%d: %016Lx (%016Lx) ... #%d (%016Lx) %016Lx\n",
+				cpu,
+				tmp_max->trace[0].timestamp,
+				tmp_max->trace[1].timestamp,
+				entries,
+				tmp_max->trace[entries-2].timestamp,
+				tmp_max->trace[entries-1].timestamp);
+		}
+		tmp_max = max_tr.traces + max_tr.cpu;
+		entries = min(tmp_max->trace_idx, MAX_TRACE);
+
+		printk("CPU%d entries: %d\n", max_tr.cpu, entries);
+		printk("first stamp: %016Lx\n", first_stamp);
+		printk(" last stamp: %016Lx\n", first_stamp);
+	}
+
+#if 0
+	printk("first_stamp: %Ld [%016Lx]\n", first_stamp, first_stamp);
+	printk(" last_stamp: %Ld [%016Lx]\n", last_stamp, last_stamp);
+	printk("   +1 stamp: %Ld [%016Lx]\n",
+		tmp_max->trace[entries].timestamp,
+		tmp_max->trace[entries].timestamp);
+	printk("   +2 stamp: %Ld [%016Lx]\n",
+		tmp_max->trace[entries+1].timestamp,
+		tmp_max->trace[entries+1].timestamp);
+	printk("      delta: %Ld\n", last_stamp-first_stamp);
+	printk("    entries: %d\n", entries);
+#endif
+
+	out_tr.first_timestamp = first_stamp;
+	out_tr.last_timestamp = last_stamp;
+
+	/*
+	 * Fetch trace entries one by one, in increasing timestamp
+	 * order. Start at first_stamp, stop at last_stamp:
+	 */
+	sum = 0;
+	for (;;) {
+		cpu = min_idx(&bidx);
+		if (cpu == -1)
+			break;
+		entry = max_tr.traces[cpu].trace + bidx.idx[cpu];
+		if (entry->timestamp > last_stamp)
+			break;
+
+		bidx.idx[cpu]++;
+		if (entry->timestamp < first_stamp)
+			continue;
+		*out_entry = *entry;
+		out_entry++;
+		sum++;
+		if (sum > MAX_TRACE*NR_CPUS) {
+			printk("huh: sum (%d) > %ld*%d!\n", sum, MAX_TRACE,
+				NR_CPUS);
+			WARN_ON(1);
+			break;
+		}
+	}
+
+	sum = 0;
+	underrun_sum = 0;
+	overrun_sum = 0;
+	for_each_online_cpu(cpu) {
+		sum += max_tr.traces[cpu].trace_idx;
+		underrun_sum += atomic_read(&max_tr.traces[cpu].underrun);
+		overrun_sum += atomic_read(&max_tr.traces[cpu].overrun);
+	}
+	tmp_out->trace_idx = sum;
+	atomic_set(&tmp_out->underrun, underrun_sum);
+	atomic_set(&tmp_out->overrun, overrun_sum);
+}
+
+static void notrace print_help_header(struct seq_file *m)
+{
+	seq_puts(m, "                 _------=> CPU#            \n");
+	seq_puts(m, "                / _-----=> irqs-off        \n");
+	seq_puts(m, "               | / _----=> need-resched    \n");
+	seq_puts(m, "               || / _---=> hardirq/softirq \n");
+	seq_puts(m, "               ||| / _--=> preempt-depth   \n");
+	seq_puts(m, "               |||| /                      \n");
+	seq_puts(m, "               |||||     delay             \n");
+	seq_puts(m, "   cmd     pid ||||| time  |   caller      \n");
+	seq_puts(m, "      \\   /    |||||   \\   |   /           \n");
+}
+
+static void * notrace l_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+	unsigned long entries;
+	struct cpu_trace *tr = out_tr.traces + 0;
+
+	down(&out_mutex);
+	/*
+	 * if the file is being read newly, update the output trace:
+	 */
+	if (!n) {
+		// TODO: use the sequence counter here to optimize
+		down(&max_mutex);
+		update_out_trace();
+		up(&max_mutex);
+#if 0
+		if (!tr->trace_idx) {
+			up(&out_mutex);
+			return NULL;
+		}
+#endif
+		construct_pid_to_cmdline(tr);
+	}
+	entries = min(tr->trace_idx, MAX_TRACE);
+
+	if (!n) {
+		seq_printf(m, "preemption latency trace v1.1.5 on %s\n",
+			   UTS_RELEASE);
+		seq_puts(m, "--------------------------------------------------------------------\n");
+		seq_printf(m, " latency: %lu us, #%lu/%lu, CPU#%d | (M:%s VP:%d, KP:%d, SP:%d HP:%d",
+			cycles_to_usecs(tr->saved_latency),
+			entries,
+			(entries + atomic_read(&tr->underrun) +
+			 atomic_read(&tr->overrun)),
+			out_tr.cpu,
+#if defined(CONFIG_PREEMPT_NONE)
+			"server",
+#elif defined(CONFIG_PREEMPT_VOLUNTARY)
+			"desktop",
+#elif defined(CONFIG_PREEMPT_DESKTOP)
+			"preempt",
+#else
+			"rt",
+#endif
+			0, 0,
+#ifdef CONFIG_PREEMPT_SOFTIRQS
+			softirq_preemption
+#else
+			0
+#endif
+			,
+#ifdef CONFIG_PREEMPT_HARDIRQS
+ hardirq_preemption
+#else
+			0
+#endif
+		);
+#ifdef CONFIG_SMP
+		seq_printf(m, " #P:%d)\n", num_online_cpus());
+#else
+		seq_puts(m, ")\n");
+#endif
+		seq_puts(m, "    -----------------\n");
+		seq_printf(m, "    | task: %.16s-%d (uid:%ld nice:%ld policy:%ld rt_prio:%ld)\n",
+			tr->comm, tr->pid, tr->uid, tr->nice,
+			tr->policy, tr->rt_priority);
+		seq_puts(m, "    -----------------\n");
+		if (trace_user_triggered) {
+			seq_puts(m, " => started at: ");
+			print_name_offset(m, tr->critical_start);
+			seq_puts(m, "\n => ended at:   ");
+			print_name_offset(m, tr->critical_end);
+			seq_puts(m, "\n");
+		}
+		seq_puts(m, "\n");
+
+		if (!trace_verbose)
+			print_help_header(m);
+	}
+	if (n >= entries || !tr->trace)
+		return NULL;
+
+	return tr->trace + n;
+}
+
+static void * notrace l_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct cpu_trace *tr = out_tr.traces;
+	unsigned long entries = min(tr->trace_idx, MAX_TRACE);
+
+	WARN_ON(!tr->trace);
+
+	if (++*pos >= entries) {
+		if (*pos == entries)
+			seq_puts(m, "\n\nvim:ft=help\n");
+		return NULL;
+	}
+	return tr->trace + *pos;
+}
+
+static void notrace l_stop(struct seq_file *m, void *p)
+{
+	up(&out_mutex);
+}
+
+static void print_timestamp(struct seq_file *m, unsigned long abs_usecs,
+						unsigned long rel_usecs)
+{
+	seq_printf(m, " %4ldus", abs_usecs);
+	if (rel_usecs > 100)
+		seq_puts(m, "!: ");
+	else if (rel_usecs > 1)
+		seq_puts(m, "+: ");
+	else
+		seq_puts(m, " : ");
+}
+
+static void
+print_timestamp_short(struct seq_file *m, unsigned long abs_usecs,
+			unsigned long rel_usecs)
+{
+	seq_printf(m, " %4ldus", abs_usecs);
+	if (rel_usecs > 100)
+		seq_putc(m, '!');
+	else if (rel_usecs > 1)
+		seq_putc(m, '+');
+	else
+		seq_putc(m, ' ');
+}
+
+static void
+print_generic(struct seq_file *m, struct trace_entry *entry)
+{
+	int hardirq, softirq;
+
+	seq_printf(m, "%8.8s-%-5d ", pid_to_cmdline(entry->pid), entry->pid);
+	seq_printf(m, "%d", entry->cpu);
+	seq_printf(m, "%c%c",
+		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+		(entry->flags & TRACE_FLAG_IRQS_HARD_OFF) ? 'D' : '.',
+		(entry->flags & TRACE_FLAG_NEED_RESCHED_DELAYED) ? 'n' :
+ 		((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.'));
+
+	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+	if (hardirq && softirq)
+		seq_putc(m, 'H');
+	else {
+		if (hardirq)
+			seq_putc(m, 'h');
+		else {
+			if (softirq)
+				seq_putc(m, 's');
+			else
+				seq_putc(m, '.');
+		}
+	}
+
+	if (entry->preempt_count)
+		seq_printf(m, "%x", entry->preempt_count);
+	else
+		seq_puts(m, ".");
+}
+
+
+static int notrace l_show_fn(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry)
+{
+	unsigned long abs_usecs, rel_usecs;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	if (trace_verbose) {
+		seq_printf(m, "%16s %5d %d %d %08x %08lx [%016Lx] %ld.%03ldms (+%ld.%03ldms): ",
+			pid_to_cmdline(entry->pid),
+			entry->pid, entry->cpu, entry->flags,
+			entry->preempt_count, trace_idx,
+			entry->timestamp, abs_usecs/1000,
+			abs_usecs % 1000, rel_usecs/1000, rel_usecs % 1000);
+		print_name_offset(m, entry->u.fn.eip);
+		seq_puts(m, " (");
+		print_name_offset(m, entry->u.fn.parent_eip);
+		seq_puts(m, ")\n");
+	} else {
+		print_generic(m, entry);
+		print_timestamp(m, abs_usecs, rel_usecs);
+		print_name(m, entry->u.fn.eip);
+		seq_puts(m, " (");
+		print_name(m, entry->u.fn.parent_eip);
+		seq_puts(m, ")\n");
+	}
+	return 0;
+}
+
+static int notrace l_show_special(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry, int mode64)
+{
+	unsigned long abs_usecs, rel_usecs;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	print_generic(m, entry);
+	print_timestamp(m, abs_usecs, rel_usecs);
+	if (trace_verbose)
+		print_name_offset(m, entry->u.special.eip);
+	else
+		print_name(m, entry->u.special.eip);
+
+	if (!mode64) {
+		/*
+		 * For convenience, print small numbers in decimal:
+		 */
+		if (abs((int)entry->u.special.v1) < 10000)
+			seq_printf(m, " (%5ld ", entry->u.special.v1);
+		else
+			seq_printf(m, " (%lx ", entry->u.special.v1);
+		if (abs((int)entry->u.special.v2) < 10000)
+			seq_printf(m, "%5ld ", entry->u.special.v2);
+		else
+			seq_printf(m, "%lx ", entry->u.special.v2);
+		if (abs((int)entry->u.special.v3) < 10000)
+			seq_printf(m, "%5ld)\n", entry->u.special.v3);
+		else
+			seq_printf(m, "%lx)\n", entry->u.special.v3);
+	} else {
+		seq_printf(m, " (%13Ld %ld)\n",
+			   ((u64)entry->u.special.v1 << 32)
+			   + (u64)entry->u.special.v2, entry->u.special.v3);
+	}
+	return 0;
+}
+
+static int notrace
+l_show_special_pid(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry)
+{
+	unsigned long abs_usecs, rel_usecs;
+	unsigned int pid;
+
+	pid = entry->u.special.v1;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	print_generic(m, entry);
+	print_timestamp(m, abs_usecs, rel_usecs);
+	if (trace_verbose)
+		print_name_offset(m, entry->u.special.eip);
+	else
+		print_name(m, entry->u.special.eip);
+	seq_printf(m, " <%.8s-%d> (%ld %ld)\n",
+		pid_to_cmdline(pid), pid,
+		entry->u.special.v2, entry->u.special.v3);
+
+	return 0;
+}
+
+static int notrace
+l_show_special_sym(struct seq_file *m, unsigned long trace_idx,
+		   struct trace_entry *entry, struct trace_entry *entry0,
+		   struct trace_entry *next_entry, int mode64)
+{
+	unsigned long abs_usecs, rel_usecs;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	print_generic(m, entry);
+	print_timestamp(m, abs_usecs, rel_usecs);
+	if (trace_verbose)
+		print_name_offset(m, entry->u.special.eip);
+	else
+		print_name(m, entry->u.special.eip);
+
+	seq_puts(m, "()<-");
+	print_name(m, entry->u.special.v1);
+	seq_puts(m, "()<-");
+	print_name(m, entry->u.special.v2);
+	seq_puts(m, "()<-");
+	print_name(m, entry->u.special.v3);
+	seq_puts(m, "()\n");
+
+	return 0;
+}
+
+
+static int notrace l_show_cmdline(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry)
+{
+	unsigned long abs_usecs, rel_usecs;
+
+	if (!trace_verbose)
+		return 0;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	seq_printf(m,
+		"[ => %16s ] %ld.%03ldms (+%ld.%03ldms)\n",
+			entry->u.cmdline.str,
+			abs_usecs/1000, abs_usecs % 1000,
+			rel_usecs/1000, rel_usecs % 1000);
+
+	return 0;
+}
+
+extern unsigned long sys_call_table[NR_syscalls];
+
+#if defined(CONFIG_COMPAT) && defined(CONFIG_X86)
+extern unsigned long ia32_sys_call_table[], ia32_syscall_end[];
+#define IA32_NR_syscalls (ia32_syscall_end - ia32_sys_call_table)
+#endif
+
+static int notrace l_show_syscall(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry)
+{
+	unsigned long abs_usecs, rel_usecs;
+	unsigned long nr;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	print_generic(m, entry);
+	print_timestamp_short(m, abs_usecs, rel_usecs);
+
+	seq_puts(m, "> ");
+	nr = entry->u.syscall.nr;
+#if defined(CONFIG_COMPAT) && defined(CONFIG_X86)
+	if (nr & 0x80000000) {
+		nr &= ~0x80000000;
+		if (nr < IA32_NR_syscalls)
+			print_name(m, ia32_sys_call_table[nr]);
+		else
+			seq_printf(m, "<badsys(%lu)>", nr);
+	} else
+#endif
+	if (nr < NR_syscalls)
+		print_name(m, sys_call_table[nr]);
+	else
+		seq_printf(m, "<badsys(%lu)>", nr);
+
+#ifdef CONFIG_64BIT
+	seq_printf(m, " (%016lx %016lx %016lx)\n",
+		entry->u.syscall.p1, entry->u.syscall.p2, entry->u.syscall.p3);
+#else
+	seq_printf(m, " (%08lx %08lx %08lx)\n",
+		entry->u.syscall.p1, entry->u.syscall.p2, entry->u.syscall.p3);
+#endif
+
+	return 0;
+}
+
+static int notrace l_show_sysret(struct seq_file *m, unsigned long trace_idx,
+		struct trace_entry *entry, struct trace_entry *entry0,
+		struct trace_entry *next_entry)
+{
+	unsigned long abs_usecs, rel_usecs;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+	rel_usecs = cycles_to_us(next_entry->timestamp - entry->timestamp);
+
+	print_generic(m, entry);
+	print_timestamp_short(m, abs_usecs, rel_usecs);
+
+	seq_printf(m, "< (%ld)\n", entry->u.sysret.ret);
+
+	return 0;
+}
+
+
+static int notrace l_show(struct seq_file *m, void *p)
+{
+	struct cpu_trace *tr = out_tr.traces;
+	struct trace_entry *entry, *entry0, *next_entry;
+	unsigned long trace_idx;
+
+	cond_resched();
+	entry = p;
+	if (entry->timestamp < out_tr.first_timestamp)
+		return 0;
+	if (entry->timestamp > out_tr.last_timestamp)
+		return 0;
+
+	entry0 = tr->trace;
+	trace_idx = entry - entry0;
+
+	if (trace_idx + 1 < tr->trace_idx)
+		next_entry = entry + 1;
+	else
+		next_entry = entry;
+
+	if (trace_verbose)
+		seq_printf(m, "(T%d/#%ld) ", entry->type, trace_idx);
+
+	switch (entry->type) {
+		case TRACE_FN:
+			l_show_fn(m, trace_idx, entry, entry0, next_entry);
+			break;
+		case TRACE_SPECIAL:
+			l_show_special(m, trace_idx, entry, entry0, next_entry, 0);
+			break;
+		case TRACE_SPECIAL_PID:
+			l_show_special_pid(m, trace_idx, entry, entry0, next_entry);
+			break;
+		case TRACE_SPECIAL_U64:
+			l_show_special(m, trace_idx, entry, entry0, next_entry, 1);
+			break;
+		case TRACE_SPECIAL_SYM:
+			l_show_special_sym(m, trace_idx, entry, entry0,
+					   next_entry, 1);
+			break;
+		case TRACE_CMDLINE:
+			l_show_cmdline(m, trace_idx, entry, entry0, next_entry);
+			break;
+		case TRACE_SYSCALL:
+			l_show_syscall(m, trace_idx, entry, entry0, next_entry);
+			break;
+		case TRACE_SYSRET:
+			l_show_sysret(m, trace_idx, entry, entry0, next_entry);
+			break;
+		default:
+			seq_printf(m, "unknown trace type %d\n", entry->type);
+	}
+	return 0;
+}
+
+struct seq_operations latency_trace_op = {
+	.start	= l_start,
+	.next	= l_next,
+	.stop	= l_stop,
+	.show	= l_show
+};
+
+/*
+ * Copy the new maximum trace into the separate maximum-trace
+ * structure. (this way the maximum trace is permanently saved,
+ * for later retrieval via /proc/latency_trace)
+ */
+static void update_max_tr(struct cpu_trace *tr)
+{
+	struct cpu_trace *save;
+	int cpu, all_cpus = 0;
+
+#ifdef CONFIG_PREEMPT
+	WARN_ON(!preempt_count() && !irqs_disabled());
+#endif
+
+	max_tr.cpu = tr->cpu;
+	save = max_tr.traces + tr->cpu;
+
+	if ((wakeup_timing || trace_user_triggered || trace_print_on_crash ||
+	     print_functions) && trace_all_cpus) {
+		all_cpus = 1;
+		for_each_online_cpu(cpu)
+			atomic_inc(&cpu_traces[cpu].disabled);
+	}
+
+	save->saved_latency = preempt_max_latency;
+	save->preempt_timestamp = tr->preempt_timestamp;
+	save->critical_start = tr->critical_start;
+	save->critical_end = tr->critical_end;
+	save->critical_sequence = tr->critical_sequence;
+
+	memcpy(save->comm, current->comm, CMDLINE_BYTES);
+	save->pid = current->pid;
+	save->uid = current->uid;
+	save->nice = current->static_prio - 20 - MAX_RT_PRIO;
+	save->policy = current->policy;
+	save->rt_priority = current->rt_priority;
+
+	if (all_cpus) {
+		for_each_online_cpu(cpu) {
+			copy_trace(max_tr.traces + cpu, cpu_traces + cpu, 1);
+			atomic_dec(&cpu_traces[cpu].disabled);
+		}
+	} else
+		copy_trace(save, tr, 1);
+}
+
+#else /* !EVENT_TRACE */
+
+static inline void notrace
+____trace(int cpu, enum trace_type type, struct cpu_trace *tr,
+	  unsigned long eip, unsigned long parent_eip,
+	  unsigned long v1, unsigned long v2, unsigned long v3,
+	  unsigned long flags)
+{
+}
+
+static inline void notrace
+___trace(enum trace_type type, unsigned long eip, unsigned long parent_eip,
+		unsigned long v1, unsigned long v2,
+			unsigned long v3)
+{
+}
+
+static inline void notrace __trace(unsigned long eip, unsigned long parent_eip)
+{
+}
+
+static inline void update_max_tr(struct cpu_trace *tr)
+{
+}
+
+static inline void notrace _trace_cmdline(int cpu, struct cpu_trace *tr)
+{
+}
+
+#endif
+
+static int setup_preempt_thresh(char *s)
+{
+	int thresh;
+
+	get_option(&s, &thresh);
+	if (thresh > 0) {
+		preempt_thresh = usecs_to_cycles(thresh);
+		printk("Preemption threshold = %u us\n", thresh);
+	}
+	return 1;
+}
+__setup("preempt_thresh=", setup_preempt_thresh);
+
+static inline void notrace reset_trace_idx(int cpu, struct cpu_trace *tr)
+{
+	if (trace_all_cpus)
+		for_each_online_cpu(cpu) {
+			tr = cpu_traces + cpu;
+			tr->trace_idx = 0;
+			atomic_set(&tr->underrun, 0);
+			atomic_set(&tr->overrun, 0);
+		}
+	else{
+		tr->trace_idx = 0;
+		atomic_set(&tr->underrun, 0);
+		atomic_set(&tr->overrun, 0);
+	}
+}
+
+#ifdef CONFIG_CRITICAL_TIMING
+
+static void notrace
+check_critical_timing(int cpu, struct cpu_trace *tr, unsigned long parent_eip)
+{
+	unsigned long latency, t0, t1;
+	cycle_t T0, T1, T2, delta;
+	unsigned long flags;
+
+	if (trace_user_triggered)
+		return;
+	/*
+	 * usecs conversion is slow so we try to delay the conversion
+	 * as long as possible:
+	 */
+	T0 = tr->preempt_timestamp;
+	T1 = get_monotonic_cycles();
+	delta = T1-T0;
+
+	local_save_flags(flags);
+
+	if (!report_latency(delta))
+		goto out;
+
+	____trace(cpu, TRACE_FN, tr, CALLER_ADDR0, parent_eip, 0, 0, 0, flags);
+	/*
+	 * Update the timestamp, because the trace entry above
+	 * might change it (it can only get larger so the latency
+	 * is fair to be reported):
+	 */
+	T2 = get_monotonic_cycles();
+
+	delta = T2-T0;
+
+	latency = cycles_to_usecs(delta);
+	latency_hist(tr->latency_type, cpu, latency);
+
+	if (latency_hist_flag) {
+		if (preempt_max_latency >= delta)
+			goto out;
+	}
+
+	if (tr->critical_sequence != max_sequence || down_trylock(&max_mutex))
+		goto out;
+
+#ifndef CONFIG_CRITICAL_LATENCY_HIST
+	if (!preempt_thresh && preempt_max_latency > delta) {
+		printk("bug: updating %016Lx > %016Lx?\n",
+			preempt_max_latency, delta);
+		printk("  [%016Lx %016Lx %016Lx]\n", T0, T1, T2);
+	}
+#endif
+
+	preempt_max_latency = delta;
+	t0 = cycles_to_usecs(T0);
+	t1 = cycles_to_usecs(T1);
+
+	tr->critical_end = parent_eip;
+
+	update_max_tr(tr);
+
+#ifndef CONFIG_CRITICAL_LATENCY_HIST
+	if (preempt_thresh)
+		printk("(%16s-%-5d|#%d): %lu us critical section "
+			"violates %lu us threshold.\n"
+			" => started at timestamp %lu: ",
+				current->comm, current->pid,
+				raw_smp_processor_id(),
+				latency, cycles_to_usecs(preempt_thresh), t0);
+	else
+		printk("(%16s-%-5d|#%d): new %lu us maximum-latency "
+			"critical section.\n => started at timestamp %lu: ",
+				current->comm, current->pid,
+				raw_smp_processor_id(),
+				latency, t0);
+
+	print_symbol("<%s>\n", tr->critical_start);
+	printk(" =>   ended at timestamp %lu: ", t1);
+	print_symbol("<%s>\n", tr->critical_end);
+	dump_stack();
+	t1 = cycles_to_usecs(get_monotonic_cycles());
+	printk(" =>   dump-end timestamp %lu\n\n", t1);
+#endif
+
+	max_sequence++;
+
+	up(&max_mutex);
+
+out:
+	tr->critical_sequence = max_sequence;
+	tr->preempt_timestamp = get_monotonic_cycles();
+	tr->early_warning = 0;
+	reset_trace_idx(cpu, tr);
+	_trace_cmdline(cpu, tr);
+	____trace(cpu, TRACE_FN, tr, CALLER_ADDR0, parent_eip, 0, 0, 0, flags);
+}
+
+void notrace touch_critical_timing(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct cpu_trace *tr = cpu_traces + cpu;
+
+	if (!tr->critical_start || atomic_read(&tr->disabled) ||
+			trace_user_triggered || wakeup_timing)
+		return;
+
+	if (preempt_count() > 0 && tr->critical_start) {
+		atomic_inc(&tr->disabled);
+		check_critical_timing(cpu, tr, CALLER_ADDR0);
+		tr->critical_start = CALLER_ADDR0;
+		tr->critical_sequence = max_sequence;
+		atomic_dec(&tr->disabled);
+	}
+}
+EXPORT_SYMBOL(touch_critical_timing);
+
+void notrace stop_critical_timing(void)
+{
+	struct cpu_trace *tr = cpu_traces + raw_smp_processor_id();
+
+	tr->critical_start = 0;
+}
+EXPORT_SYMBOL(stop_critical_timing);
+
+static inline void notrace
+__start_critical_timing(unsigned long eip, unsigned long parent_eip,
+			int latency_type)
+{
+	int cpu = raw_smp_processor_id();
+	struct cpu_trace *tr = cpu_traces + cpu;
+	unsigned long flags;
+
+	if (tr->critical_start || atomic_read(&tr->disabled) ||
+			trace_user_triggered || wakeup_timing)
+		return;
+
+	atomic_inc(&tr->disabled);
+
+	tr->critical_sequence = max_sequence;
+	tr->preempt_timestamp = get_monotonic_cycles();
+	tr->critical_start = eip;
+	reset_trace_idx(cpu, tr);
+	tr->latency_type = latency_type;
+	_trace_cmdline(cpu, tr);
+
+	local_save_flags(flags);
+	____trace(cpu, TRACE_FN, tr, eip, parent_eip, 0, 0, 0, flags);
+
+	atomic_dec(&tr->disabled);
+}
+
+static inline void notrace
+__stop_critical_timing(unsigned long eip, unsigned long parent_eip)
+{
+	int cpu = raw_smp_processor_id();
+	struct cpu_trace *tr = cpu_traces + cpu;
+	unsigned long flags;
+
+	if (!tr->critical_start || atomic_read(&tr->disabled) ||
+			trace_user_triggered || wakeup_timing)
+		return;
+
+	atomic_inc(&tr->disabled);
+	local_save_flags(flags);
+	____trace(cpu, TRACE_FN, tr, eip, parent_eip, 0, 0, 0, flags);
+	check_critical_timing(cpu, tr, eip);
+	tr->critical_start = 0;
+	atomic_dec(&tr->disabled);
+}
+
+#endif
+
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+
+#ifdef CONFIG_LOCKDEP
+
+void notrace time_hardirqs_on(unsigned long a0, unsigned long a1)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (!irqs_off_preempt_count() && irqs_disabled_flags(flags))
+		__stop_critical_timing(a0, a1);
+}
+
+void notrace time_hardirqs_off(unsigned long a0, unsigned long a1)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (!irqs_off_preempt_count() && irqs_disabled_flags(flags))
+		__start_critical_timing(a0, a1, INTERRUPT_LATENCY);
+}
+
+#else /* !CONFIG_LOCKDEP */
+
+/*
+ * Dummy:
+ */
+
+void early_boot_irqs_off(void)
+{
+}
+
+void early_boot_irqs_on(void)
+{
+}
+
+void trace_softirqs_on(unsigned long ip)
+{
+}
+
+void trace_softirqs_off(unsigned long ip)
+{
+}
+
+inline void print_irqtrace_events(struct task_struct *curr)
+{
+}
+
+/*
+ * We are only interested in hardirq on/off events:
+ */
+void notrace trace_hardirqs_on(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (!irqs_off_preempt_count() && irqs_disabled_flags(flags))
+		__stop_critical_timing(CALLER_ADDR0, 0 /* CALLER_ADDR1 */);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_on);
+
+void notrace trace_hardirqs_off(void)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	if (!irqs_off_preempt_count() && irqs_disabled_flags(flags))
+		__start_critical_timing(CALLER_ADDR0, 0 /* CALLER_ADDR1 */,
+					INTERRUPT_LATENCY);
+}
+
+EXPORT_SYMBOL(trace_hardirqs_off);
+
+#endif /* !CONFIG_LOCKDEP */
+
+#endif /* CONFIG_CRITICAL_IRQSOFF_TIMING */
+
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_CRITICAL_TIMING)
+
+static inline unsigned long get_parent_eip(void)
+{
+	unsigned long parent_eip = CALLER_ADDR1;
+
+	if (in_lock_functions(parent_eip)) {
+		parent_eip = CALLER_ADDR2;
+		if (in_lock_functions(parent_eip))
+			parent_eip = CALLER_ADDR3;
+	}
+
+	return parent_eip;
+}
+
+void notrace add_preempt_count(unsigned int val)
+{
+	unsigned long eip = CALLER_ADDR0;
+	unsigned long parent_eip = get_parent_eip();
+
+#ifdef CONFIG_DEBUG_PREEMPT
+	/*
+	 * Underflow?
+	 */
+	if (DEBUG_WARN_ON(((int)preempt_count() < 0)))
+		return;
+	/*
+	 * Spinlock count overflowing soon?
+	 */
+	if (DEBUG_WARN_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10))
+		return;
+#endif
+
+	preempt_count() += val;
+#ifdef CONFIG_PREEMPT_TRACE
+	if (val <= 10) {
+		unsigned int idx = preempt_count() & PREEMPT_MASK;
+		if (idx < MAX_PREEMPT_TRACE) {
+			current->preempt_trace_eip[idx] = eip;
+			current->preempt_trace_parent_eip[idx] = parent_eip;
+		}
+	}
+#endif
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+	{
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+		unsigned long flags;
+
+		local_save_flags(flags);
+
+		if (!irqs_disabled_flags(flags))
+#endif
+			if (preempt_count() == val)
+				__start_critical_timing(eip, parent_eip,
+							PREEMPT_LATENCY);
+	}
+#endif
+	(void)eip, (void)parent_eip;
+}
+EXPORT_SYMBOL(add_preempt_count);
+
+void notrace sub_preempt_count(unsigned int val)
+{
+#ifdef CONFIG_DEBUG_PREEMPT
+	/*
+	 * Underflow?
+	 */
+	if (DEBUG_WARN_ON(unlikely(val > preempt_count())))
+		return;
+	/*
+	 * Is the spinlock portion underflowing?
+	 */
+	if (DEBUG_WARN_ON((val < PREEMPT_MASK) &&
+			  !(preempt_count() & PREEMPT_MASK)))
+		return;
+#endif
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+	{
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+		unsigned long flags;
+
+		local_save_flags(flags);
+
+		if (!irqs_disabled_flags(flags))
+#endif
+			if (preempt_count() == val)
+				__stop_critical_timing(CALLER_ADDR0,
+						       CALLER_ADDR1);
+	}
+#endif
+	preempt_count() -= val;
+}
+
+EXPORT_SYMBOL(sub_preempt_count);
+
+void notrace mask_preempt_count(unsigned int mask)
+{
+	unsigned long eip = CALLER_ADDR0;
+	unsigned long parent_eip = get_parent_eip();
+
+	preempt_count() |= mask;
+
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+	{
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+		unsigned long flags;
+
+		local_save_flags(flags);
+
+		if (!irqs_disabled_flags(flags))
+#endif
+			if (preempt_count() == mask)
+				__start_critical_timing(eip, parent_eip,
+							PREEMPT_LATENCY);
+	}
+#endif
+	(void) eip, (void) parent_eip;
+}
+EXPORT_SYMBOL(mask_preempt_count);
+
+void notrace unmask_preempt_count(unsigned int mask)
+{
+#ifdef CONFIG_CRITICAL_PREEMPT_TIMING
+	{
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+		unsigned long flags;
+
+		local_save_flags(flags);
+
+		if (!irqs_disabled_flags(flags))
+#endif
+			if (preempt_count() == mask)
+				__stop_critical_timing(CALLER_ADDR0,
+						       CALLER_ADDR1);
+	}
+#endif
+	preempt_count() &= ~mask;
+}
+EXPORT_SYMBOL(unmask_preempt_count);
+
+
+#endif
+
+/*
+ * Wakeup latency timing/tracing. We get upcalls from the scheduler
+ * when a task is being woken up and we time/trace it until it gets
+ * to a CPU - or an even-higher-prio task supercedes it. (in that
+ * case we throw away the currently traced task - we dont try to
+ * handle nesting, that simplifies things significantly)
+ */
+#ifdef CONFIG_WAKEUP_TIMING
+
+static void notrace
+check_wakeup_timing(struct cpu_trace *tr, unsigned long parent_eip,
+		    unsigned long *flags)
+{
+	int cpu = raw_smp_processor_id();
+	unsigned long latency, t0, t1;
+	cycle_t T0, T1, delta;
+
+	if (trace_user_triggered)
+		return;
+
+	atomic_inc(&tr->disabled);
+	if (atomic_read(&tr->disabled) != 1)
+		goto out;
+
+	T0 = tr->preempt_timestamp;
+	T1 = get_monotonic_cycles();
+	/*
+	 * Any wraparound or time warp and we are out:
+	 */
+	if (T0 > T1)
+		goto out;
+	delta = T1-T0;
+
+	if (!report_latency(delta))
+		goto out;
+
+	____trace(smp_processor_id(), TRACE_FN, tr, CALLER_ADDR0, parent_eip,
+		  0, 0, 0, *flags);
+
+	latency = cycles_to_usecs(delta);
+	latency_hist(tr->latency_type, cpu, latency);
+
+	if (latency_hist_flag) {
+		if (preempt_max_latency >= delta)
+			goto out;
+	}
+
+	if (tr->critical_sequence != max_sequence || down_trylock(&max_mutex))
+		goto out;
+
+#ifndef CONFIG_WAKEUP_LATENCY_HIST
+	if (!preempt_thresh && preempt_max_latency > delta) {
+		printk("bug2: updating %016lx > %016Lx?\n",
+			preempt_max_latency, delta);
+		printk("  [%016Lx %016Lx]\n", T0, T1);
+	}
+#endif
+
+	preempt_max_latency = delta;
+	t0 = cycles_to_usecs(T0);
+	t1 = cycles_to_usecs(T1);
+	tr->critical_end = parent_eip;
+
+	update_max_tr(tr);
+
+	atomic_dec(&tr->disabled);
+	__raw_spin_unlock(&sch.trace_lock);
+	local_irq_restore(*flags);
+
+#ifndef CONFIG_WAKEUP_LATENCY_HIST
+	if (preempt_thresh)
+		printk("(%16s-%-5d|#%d): %lu us wakeup latency "
+			"violates %lu us threshold.\n",
+				current->comm, current->pid,
+				raw_smp_processor_id(), latency,
+				cycles_to_usecs(preempt_thresh));
+	else
+		printk("(%16s-%-5d|#%d): new %lu us maximum-latency "
+			"wakeup.\n", current->comm, current->pid,
+				raw_smp_processor_id(), latency);
+#endif
+
+	max_sequence++;
+
+	up(&max_mutex);
+
+	return;
+
+out:
+	atomic_dec(&tr->disabled);
+	__raw_spin_unlock(&sch.trace_lock);
+	local_irq_restore(*flags);
+}
+
+/*
+ * Start wakeup latency tracing - called with the runqueue held
+ * and interrupts disabled:
+ */
+void __trace_start_sched_wakeup(struct task_struct *p)
+{
+	struct cpu_trace *tr;
+	int cpu;
+
+	if (trace_user_triggered || !wakeup_timing) {
+		trace_special_pid(p->pid, p->prio, -1);
+		return;
+	}
+
+	__raw_spin_lock(&sch.trace_lock);
+	if (sch.task && (sch.task->prio <= p->prio))
+		goto out_unlock;
+
+	/*
+	 * New highest-prio task just woke up - start tracing:
+	 */
+	sch.task = p;
+	cpu = task_cpu(p);
+	sch.cpu = cpu;
+	/*
+	 * We keep using this CPU's trace buffer even if the task
+	 * gets migrated to another CPU. Tracing only happens on
+	 * the CPU that 'owns' the highest-prio task so it's
+	 * fundamentally single-threaded.
+	 */
+	sch.tr = tr = cpu_traces + cpu;
+	reset_trace_idx(cpu, tr);
+
+//	if (!atomic_read(&tr->disabled)) {
+		atomic_inc(&tr->disabled);
+		tr->critical_sequence = max_sequence;
+		tr->preempt_timestamp = get_monotonic_cycles();
+		tr->latency_type = WAKEUP_LATENCY;
+		tr->critical_start = CALLER_ADDR0;
+		_trace_cmdline(raw_smp_processor_id(), tr);
+		atomic_dec(&tr->disabled);
+//	}
+
+	mcount();
+	trace_special_pid(p->pid, p->prio, cpu);
+	trace_special_sym();
+out_unlock:
+	__raw_spin_unlock(&sch.trace_lock);
+}
+
+void trace_stop_sched_switched(struct task_struct *p)
+{
+	struct cpu_trace *tr;
+	unsigned long flags;
+
+	if (trace_user_triggered || !wakeup_timing)
+		return;
+
+	local_irq_save(flags);
+	__raw_spin_lock(&sch.trace_lock);
+	if (p == sch.task) {
+		trace_special_pid(p->pid, p->prio, task_cpu(p));
+
+		sch.task = NULL;
+		tr = sch.tr;
+		sch.tr = NULL;
+		WARN_ON(!tr);
+		/* auto-unlocks the spinlock: */
+		check_wakeup_timing(tr, CALLER_ADDR0, &flags);
+	} else {
+		if (sch.task)
+			trace_special_pid(sch.task->pid, sch.task->prio,
+					  p->prio);
+		if (sch.task && (sch.task->prio >= p->prio))
+			sch.task = NULL;
+		__raw_spin_unlock(&sch.trace_lock);
+	}
+	local_irq_restore(flags);
+}
+
+void trace_change_sched_cpu(struct task_struct *p, int new_cpu)
+{
+	unsigned long flags;
+
+	if (!wakeup_timing)
+		return;
+
+	trace_special_pid(p->pid, task_cpu(p), new_cpu);
+	trace_special_sym();
+	local_irq_save(flags);
+	__raw_spin_lock(&sch.trace_lock);
+	if (p == sch.task && task_cpu(p) != new_cpu) {
+		sch.cpu = new_cpu;
+		trace_special(task_cpu(p), new_cpu, 0);
+	}
+	__raw_spin_unlock(&sch.trace_lock);
+	local_irq_restore(flags);
+}
+
+#endif
+
+#ifdef CONFIG_EVENT_TRACE
+
+long user_trace_start(void)
+{
+	struct cpu_trace *tr;
+	unsigned long flags;
+	int cpu;
+
+	if (!trace_user_triggered || trace_print_on_crash || print_functions)
+		return -EINVAL;
+
+	/*
+	 * If the user has not yet reset the max latency after
+	 * bootup then we assume that this was the intention
+	 * (we wont get any tracing done otherwise):
+	 */
+	if (preempt_max_latency == (cycle_t)ULONG_MAX)
+		preempt_max_latency = 0;
+
+	/*
+	 * user_trace_start() might be called from hardirq
+	 * context, if trace_user_triggered_irq is set, so
+	 * be careful about locking:
+	 */
+	if (preempt_count() || irqs_disabled()) {
+		if (down_trylock(&max_mutex))
+			return -EAGAIN;
+	} else
+		down(&max_mutex);
+
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	tr = cpu_traces + cpu;
+
+#ifdef CONFIG_WAKEUP_TIMING
+	if (wakeup_timing) {
+		__raw_spin_lock(&sch.trace_lock);
+		sch.task = current;
+		sch.cpu = cpu;
+		sch.tr = tr;
+		__raw_spin_unlock(&sch.trace_lock);
+	}
+#endif
+	reset_trace_idx(cpu, tr);
+
+	tr->critical_sequence = max_sequence;
+	tr->preempt_timestamp = get_monotonic_cycles();
+	tr->critical_start = CALLER_ADDR0;
+	_trace_cmdline(cpu, tr);
+	mcount();
+
+	WARN_ON(!irqs_disabled());
+	local_irq_restore(flags);
+
+	up(&max_mutex);
+
+	return 0;
+}
+
+EXPORT_SYMBOL_GPL(user_trace_start);
+
+long user_trace_stop(void)
+{
+	unsigned long latency = 0, flags;
+	struct cpu_trace *tr;
+	cycle_t delta;
+
+	if (!trace_user_triggered || trace_print_on_crash || print_functions)
+		return -EINVAL;
+
+	local_irq_save(flags);
+	mcount();
+
+#ifdef CONFIG_WAKEUP_TIMING
+	if (wakeup_timing) {
+		struct task_struct *t;
+
+		__raw_spin_lock(&sch.trace_lock);
+#if 0
+		t = sch.task;
+		if (current != t) {
+			__raw_spin_unlock(&sch.trace_lock);
+			local_irq_restore(flags);
+			printk("wrong stop: curr: %s/%d[%d] => %p\n",
+			       current->comm, current->pid,
+			       task_thread_info(current)->cpu, t);
+			if (t)
+				printk("wrong stop: curr: %s/%d[%d]\n",
+				       t->comm, t->pid, task_thread_info(t)->cpu);
+			return -EINVAL;
+		}
+#endif
+		sch.task = NULL;
+		tr = sch.tr;
+		sch.tr = NULL;
+		__raw_spin_unlock(&sch.trace_lock);
+	} else
+#endif
+		tr = cpu_traces + smp_processor_id();
+
+	atomic_inc(&tr->disabled);
+	if (tr->preempt_timestamp) {
+		cycle_t T0, T1;
+		unsigned long long tmp0;
+
+		T0 = tr->preempt_timestamp;
+		T1 = get_monotonic_cycles();
+		tmp0 = preempt_max_latency;
+		if (T1 < T0)
+			T0 = T1;
+		delta = T1 - T0;
+		if (!report_latency(delta))
+			goto out;
+		if (tr->critical_sequence != max_sequence ||
+						down_trylock(&max_mutex))
+			goto out;
+
+		WARN_ON(!preempt_thresh && preempt_max_latency > delta);
+
+		preempt_max_latency = delta;
+		update_max_tr(tr);
+
+		latency = cycles_to_usecs(delta);
+
+		max_sequence++;
+		up(&max_mutex);
+out:
+		tr->preempt_timestamp = 0;
+	}
+	atomic_dec(&tr->disabled);
+	local_irq_restore(flags);
+
+	if (latency) {
+		if (preempt_thresh)
+			printk("(%16s-%-5d|#%d): %lu us user-latency "
+				"violates %lu us threshold.\n",
+					current->comm, current->pid,
+					raw_smp_processor_id(), latency,
+					cycles_to_usecs(preempt_thresh));
+		else
+			printk("(%16s-%-5d|#%d): new %lu us user-latency.\n",
+				current->comm, current->pid,
+					raw_smp_processor_id(), latency);
+	}
+
+	return 0;
+}
+
+EXPORT_SYMBOL(user_trace_stop);
+
+static int trace_print_cpu = -1;
+
+void notrace stop_trace(void)
+{
+	if (trace_print_on_crash && trace_print_cpu == -1) {
+		trace_enabled = -1;
+		trace_print_cpu = raw_smp_processor_id();
+	}
+}
+
+EXPORT_SYMBOL(stop_trace);
+
+static void print_entry(struct trace_entry *entry, struct trace_entry *entry0)
+{
+	unsigned long abs_usecs;
+	int hardirq, softirq;
+
+	abs_usecs = cycles_to_us(entry->timestamp - entry0->timestamp);
+
+	printk("%-5d ", entry->pid);
+
+	printk("%d%c%c",
+		entry->cpu,
+		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+		(entry->flags & TRACE_FLAG_IRQS_HARD_OFF) ? 'D' : '.',
+		(entry->flags & TRACE_FLAG_NEED_RESCHED_DELAYED) ? 'n' :
+ 		((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.'));
+
+	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+	if (hardirq && softirq)
+		printk("H");
+	else {
+		if (hardirq)
+			printk("h");
+		else {
+			if (softirq)
+				printk("s");
+			else
+				printk(".");
+		}
+	}
+
+	if (entry->preempt_count)
+		printk(":%x ", entry->preempt_count);
+	else
+		printk(":. ");
+
+	printk("%ld.%03ldms: ", abs_usecs/1000, abs_usecs % 1000);
+
+	switch (entry->type) {
+	case TRACE_FN:
+		printk_name(entry->u.fn.eip);
+		printk("  <= (");
+		printk_name(entry->u.fn.parent_eip);
+		printk(")\n");
+		break;
+	case TRACE_SPECIAL:
+		printk(" special: %lx %lx %lx\n",
+		       entry->u.special.v1, entry->u.special.v2,
+		       entry->u.special.v3);
+		break;
+	case TRACE_SPECIAL_U64:
+		printk("  spec64: %lx%08lx %lx\n",
+		       entry->u.special.v1, entry->u.special.v2,
+		       entry->u.special.v3);
+		break;
+	}
+}
+
+/*
+ * Print the current trace at crash time.
+ *
+ * We print it backwards, so that the newest (most interesting) entries
+ * are printed first.
+ */
+void print_last_trace(void)
+{
+	unsigned int idx0, idx, i, cpu;
+	struct cpu_trace *tr;
+	struct trace_entry *entry0, *entry;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	if (trace_enabled != -1 || trace_print_cpu != cpu ||
+						!trace_print_on_crash) {
+		if (trace_print_on_crash)
+			printk("skipping trace printing on CPU#%d != %d\n",
+				cpu, trace_print_cpu);
+		preempt_enable();
+		return;
+	}
+
+	trace_print_on_crash = 0;
+
+	tr = cpu_traces + cpu;
+	if (!tr->trace)
+		goto out;
+
+	printk("Last %ld trace entries:\n", MAX_TRACE);
+	idx0 = tr->trace_idx;
+	printk("curr idx: %d\n", idx0);
+	if (idx0 >= MAX_TRACE)
+		idx0 = 0;
+	idx = idx0;
+	entry0 = tr->trace + idx0;
+
+	for (i = 0; i < MAX_TRACE; i++) {
+		if (idx == 0)
+			idx = MAX_TRACE-1;
+		else
+			idx--;
+		entry = tr->trace + idx;
+		switch (entry->type) {
+		case TRACE_FN:
+		case TRACE_SPECIAL:
+		case TRACE_SPECIAL_U64:
+			print_entry(entry, entry0);
+			break;
+		}
+	}
+	printk("printed %ld entries\n", MAX_TRACE);
+out:
+	preempt_enable();
+}
+
+#ifdef CONFIG_SMP
+/*
+ * On SMP, try to 'peek' on other CPU's traces and record them
+ * in this CPU's trace. This way we get a rough idea about what's
+ * going on there, without the overhead of global tracing.
+ *
+ * (no need to make this PER_CPU, we bounce it around anyway.)
+ */
+unsigned long nmi_eips[NR_CPUS];
+unsigned long nmi_flags[NR_CPUS];
+
+void notrace nmi_trace(unsigned long eip, unsigned long parent_eip,
+			unsigned long flags)
+{
+	int cpu, this_cpu = smp_processor_id();
+
+	__trace(eip, parent_eip);
+
+	nmi_eips[this_cpu] = parent_eip;
+	nmi_flags[this_cpu] = flags;
+	for (cpu = 0; cpu < NR_CPUS; cpu++)
+		if (cpu_online(cpu) && cpu != this_cpu) {
+			__trace(eip, nmi_eips[cpu]);
+			__trace(eip, nmi_flags[cpu]);
+		}
+}
+#else
+/*
+ * On UP, NMI tracing is quite simple:
+ */
+void notrace nmi_trace(unsigned long eip, unsigned long parent_eip,
+			unsigned long flags)
+{
+	__trace(eip, parent_eip);
+}
+#endif
+
+#endif
+
+#ifdef CONFIG_PREEMPT_TRACE
+
+static void print_preempt_trace(struct task_struct *task)
+{
+	unsigned int count = task_thread_info(task)->preempt_count;
+	unsigned int i, lim = count & PREEMPT_MASK;
+	if (lim >= MAX_PREEMPT_TRACE)
+		lim = MAX_PREEMPT_TRACE-1;
+	printk("---------------------------\n");
+	printk("| preempt count: %08x ]\n", count);
+	printk("| %d-level deep critical section nesting:\n", lim);
+	printk("----------------------------------------\n");
+	for (i = 1; i <= lim; i++) {
+		printk(".. [<%08lx>] .... ", task->preempt_trace_eip[i]);
+		print_symbol("%s\n", task->preempt_trace_eip[i]);
+		printk(".....[<%08lx>] ..   ( <= ",
+				task->preempt_trace_parent_eip[i]);
+		print_symbol("%s)\n", task->preempt_trace_parent_eip[i]);
+	}
+	printk("\n");
+}
+
+#endif
+
+#if defined(CONFIG_PREEMPT_TRACE) || defined(CONFIG_EVENT_TRACE)
+void print_traces(struct task_struct *task)
+{
+	if (!task)
+		task = current;
+
+#ifdef CONFIG_PREEMPT_TRACE
+	print_preempt_trace(task);
+#endif
+#ifdef CONFIG_EVENT_TRACE
+	print_last_trace();
+#endif
+}
+#endif
+
+#ifdef CONFIG_EVENT_TRACE
+/*
+ * Allocate all the per-CPU trace buffers and the
+ * save-maximum/save-output staging buffers:
+ */
+void __init init_tracer(void)
+{
+	unsigned long size, total_size = 0;
+	struct trace_entry *array;
+	struct cpu_trace *tr;
+	int cpu;
+
+	printk("num_possible_cpus(): %d\n", num_possible_cpus());
+
+	size = sizeof(struct trace_entry)*MAX_TRACE;
+
+	for_each_possible_cpu(cpu) {
+		tr = cpu_traces + cpu;
+		array = alloc_bootmem(size);
+		if (!array) {
+			printk(KERN_ERR
+			"CPU#%d: failed to allocate %ld bytes trace buffer!\n",
+				cpu, size);
+		} else {
+			printk(KERN_INFO
+			"CPU#%d: allocated %ld bytes trace buffer.\n",
+				cpu, size);
+			total_size += size;
+		}
+		tr->cpu = cpu;
+		tr->trace = array;
+
+		array = alloc_bootmem(size);
+		if (!array) {
+			printk(KERN_ERR
+			"CPU#%d: failed to allocate %ld bytes max-trace buffer!\n",
+				cpu, size);
+		} else {
+			printk(KERN_INFO
+			"CPU#%d: allocated %ld bytes max-trace buffer.\n",
+				cpu, size);
+			total_size += size;
+		}
+		max_tr.traces[cpu].trace = array;
+	}
+
+	/*
+	 * The output trace buffer is a special one that only has
+	 * trace entries for the first cpu-trace structure:
+	 */
+	size = sizeof(struct trace_entry)*MAX_TRACE*num_possible_cpus();
+	array = alloc_bootmem(size);
+	if (!array) {
+		printk(KERN_ERR
+			"failed to allocate %ld bytes out-trace buffer!\n",
+			size);
+	} else {
+		printk(KERN_INFO "allocated %ld bytes out-trace buffer.\n",
+			size);
+		total_size += size;
+	}
+	out_tr.traces[0].trace = array;
+	printk(KERN_INFO
+		"tracer: a total of %ld bytes allocated.\n",
+		total_size);
+}
+#endif
Index: linux-2.6.22-rc2/kernel/lockdep.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/lockdep.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/lockdep.c	2007-05-24 15:57:42.000000000 +0200
@@ -166,14 +166,14 @@ static struct list_head chainhash_table[
 	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
 	(key2))
 
-void lockdep_off(void)
+void notrace lockdep_off(void)
 {
 	current->lockdep_recursion++;
 }
 
 EXPORT_SYMBOL(lockdep_off);
 
-void lockdep_on(void)
+void notrace lockdep_on(void)
 {
 	current->lockdep_recursion--;
 }
@@ -706,7 +706,7 @@ find_usage_forwards(struct lock_class *s
  * Return 1 otherwise and keep <backwards_match> unchanged.
  * Return 0 on error.
  */
-static noinline int
+static noinline notrace int
 find_usage_backwards(struct lock_class *source, unsigned int depth)
 {
 	struct lock_list *entry;
@@ -1386,7 +1386,7 @@ cache_hit:
  * We are building curr_chain_key incrementally, so double-check
  * it from scratch, to make sure that it's done correctly:
  */
-static void check_chain_key(struct task_struct *curr)
+static void notrace check_chain_key(struct task_struct *curr)
 {
 #ifdef CONFIG_DEBUG_LOCKDEP
 	struct held_lock *hlock, *prev_hlock = NULL;
@@ -1573,8 +1573,8 @@ valid_state(struct task_struct *curr, st
 /*
  * Mark a lock with a usage bit, and validate the state transition:
  */
-static int mark_lock(struct task_struct *curr, struct held_lock *this,
-		     enum lock_usage_bit new_bit)
+static int notrace mark_lock(struct task_struct *curr, struct held_lock *this,
+			     enum lock_usage_bit new_bit)
 {
 	unsigned int new_mask = 1 << new_bit, ret = 1;
 
@@ -1781,6 +1781,7 @@ static int mark_lock(struct task_struct 
 	 * We must printk outside of the graph_lock:
 	 */
 	if (ret == 2) {
+		user_trace_stop();
 		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
 		print_lock(this);
 		print_irqtrace_events(curr);
@@ -1794,7 +1795,7 @@ static int mark_lock(struct task_struct 
 /*
  * Mark all held locks with a usage bit:
  */
-static int
+static int notrace
 mark_held_locks(struct task_struct *curr, int hardirq)
 {
 	enum lock_usage_bit usage_bit;
@@ -1841,7 +1842,7 @@ void early_boot_irqs_on(void)
 /*
  * Hardirqs will be enabled:
  */
-void trace_hardirqs_on(void)
+void notrace trace_hardirqs_on(void)
 {
 	struct task_struct *curr = current;
 	unsigned long ip;
@@ -1882,6 +1883,9 @@ void trace_hardirqs_on(void)
 	curr->hardirq_enable_ip = ip;
 	curr->hardirq_enable_event = ++curr->irq_events;
 	debug_atomic_inc(&hardirqs_on_events);
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+	time_hardirqs_on(CALLER_ADDR0, 0 /* CALLER_ADDR1 */);
+#endif
 }
 
 EXPORT_SYMBOL(trace_hardirqs_on);
@@ -1889,7 +1893,7 @@ EXPORT_SYMBOL(trace_hardirqs_on);
 /*
  * Hardirqs were disabled:
  */
-void trace_hardirqs_off(void)
+void notrace trace_hardirqs_off(void)
 {
 	struct task_struct *curr = current;
 
@@ -1907,6 +1911,9 @@ void trace_hardirqs_off(void)
 		curr->hardirq_disable_ip = _RET_IP_;
 		curr->hardirq_disable_event = ++curr->irq_events;
 		debug_atomic_inc(&hardirqs_off_events);
+#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING
+		time_hardirqs_off(CALLER_ADDR0, 0 /* CALLER_ADDR1 */);
+#endif
 	} else
 		debug_atomic_inc(&redundant_hardirqs_off);
 }
@@ -2404,7 +2411,7 @@ __lock_release(struct lockdep_map *lock,
 /*
  * Check whether we follow the irq-flags state precisely:
  */
-static void check_flags(unsigned long flags)
+static notrace void check_flags(unsigned long flags)
 {
 #if defined(CONFIG_DEBUG_LOCKDEP) && defined(CONFIG_TRACE_IRQFLAGS)
 	if (!debug_locks)
@@ -2436,8 +2443,9 @@ static void check_flags(unsigned long fl
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
  */
-void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
-		  int trylock, int read, int check, unsigned long ip)
+void notrace
+lock_acquire(struct lockdep_map *lock, unsigned int subclass,
+	     int trylock, int read, int check, unsigned long ip)
 {
 	unsigned long flags;
 
@@ -2445,9 +2453,9 @@ void lock_acquire(struct lockdep_map *lo
 		return;
 
 	raw_local_irq_save(flags);
+	current->lockdep_recursion = 1;
 	check_flags(flags);
 
-	current->lockdep_recursion = 1;
 	__lock_acquire(lock, subclass, trylock, read, check,
 		       irqs_disabled_flags(flags), ip);
 	current->lockdep_recursion = 0;
@@ -2456,7 +2464,8 @@ void lock_acquire(struct lockdep_map *lo
 
 EXPORT_SYMBOL_GPL(lock_acquire);
 
-void lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
+void notrace
+lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
 {
 	unsigned long flags;
 
@@ -2464,8 +2473,8 @@ void lock_release(struct lockdep_map *lo
 		return;
 
 	raw_local_irq_save(flags);
-	check_flags(flags);
 	current->lockdep_recursion = 1;
+	check_flags(flags);
 	__lock_release(lock, nested, ip);
 	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
Index: linux-2.6.22-rc2/kernel/panic.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/panic.c	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/kernel/panic.c	2007-05-24 15:57:42.000000000 +0200
@@ -66,6 +66,8 @@ NORET_TYPE void panic(const char * fmt, 
         unsigned long caller = (unsigned long) __builtin_return_address(0);
 #endif
 
+	stop_trace();
+
 	/*
 	 * It's possible to come here directly from a panic-assertion and not
 	 * have preempt disabled. Some functions called from here want
Index: linux-2.6.22-rc2/kernel/printk.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/printk.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/printk.c	2007-05-24 15:57:42.000000000 +0200
@@ -324,12 +324,29 @@ static void __call_console_drivers(unsig
 {
 	struct console *con;
 
+	touch_critical_timing();
 	for (con = console_drivers; con; con = con->next) {
 		if ((con->flags & CON_ENABLED) && con->write &&
 				(cpu_online(smp_processor_id()) ||
-				(con->flags & CON_ANYTIME)))
+				(con->flags & CON_ANYTIME))) {
+			/*
+			 * Disable tracing of printk details - it just
+			 * clobbers the trace output with lots of
+			 * repetitive lines (especially if console is
+			 * on a serial line):
+			 */
+#ifdef CONFIG_EVENT_TRACE
+			int trace_save = trace_enabled;
+
+			trace_enabled = 0;
+			con->write(con, &LOG_BUF(start), end - start);
+			trace_enabled = trace_save;
+#else
 			con->write(con, &LOG_BUF(start), end - start);
+#endif
+		}
 	}
+	touch_critical_timing();
 }
 
 static int __read_mostly ignore_loglevel;
Index: linux-2.6.22-rc2/kernel/sched.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/sched.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/sched.c	2007-05-24 15:57:42.000000000 +0200
@@ -76,6 +76,10 @@ unsigned long long __attribute__((weak))
 #define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
 #define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)
 
+#define __PRIO(prio) \
+	((prio) <= 99 ? 199 - (prio) : (prio) - 120)
+
+#define PRIO(p) __PRIO((p)->prio)
 /*
  * 'User priority' is the nice value converted to something we
  * can work with better when scaling various scheduler parameters,
@@ -1043,6 +1047,12 @@ static void deactivate_task(struct task_
 	p->array = NULL;
 }
 
+static inline void trace_start_sched_wakeup(struct task_struct *p, struct rq *rq)
+{
+	if (p != rq->curr)
+		__trace_start_sched_wakeup(p);
+}
+
 /*
  * resched_task - mark a task 'to be rescheduled now'.
  *
@@ -1060,6 +1070,8 @@ static void resched_task(struct task_str
 {
 	int cpu;
 
+	trace_start_sched_wakeup(p, task_rq(p));
+
 	assert_spin_locked(&task_rq(p)->lock);
 
 	if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
@@ -1090,6 +1102,8 @@ static void resched_cpu(int cpu)
 #else
 static inline void resched_task(struct task_struct *p)
 {
+	trace_start_sched_wakeup(p, task_rq(p));
+
 	assert_spin_locked(&task_rq(p)->lock);
 	set_tsk_need_resched(p);
 }
@@ -1615,14 +1629,19 @@ out:
 
 int fastcall wake_up_process(struct task_struct *p)
 {
-	return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
+	int ret = try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
 				 TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+	mcount();
+	return ret;
 }
 EXPORT_SYMBOL(wake_up_process);
 
 int fastcall wake_up_state(struct task_struct *p, unsigned int state)
 {
-	return try_to_wake_up(p, state, 0);
+	int ret = try_to_wake_up(p, state, 0);
+
+	mcount();
+	return ret;
 }
 
 static void task_running_tick(struct rq *rq, struct task_struct *p);
@@ -1860,6 +1879,7 @@ static inline void finish_task_switch(st
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
+	trace_stop_sched_switched(current);
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -1930,6 +1950,8 @@ context_switch(struct rq *rq, struct tas
 	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
 #endif
 
+	trace_cmdline();
+
 	/* Here we just switch the register state and the stack. */
 	switch_to(prev, next, prev);
 
@@ -3461,41 +3483,39 @@ void scheduler_tick(void)
 #endif
 }
 
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_DEBUG_PREEMPT)
+#if defined(CONFIG_EVENT_TRACE) && defined(CONFIG_DEBUG_RT_MUTEXES)
 
-void fastcall add_preempt_count(int val)
+static void trace_array(struct prio_array *array)
 {
-	/*
-	 * Underflow?
-	 */
-	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
-		return;
-	preempt_count() += val;
-	/*
-	 * Spinlock count overflowing soon?
-	 */
-	DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >=
-				PREEMPT_MASK - 10);
+	int i;
+	struct task_struct *p;
+	struct list_head *head, *tmp;
+
+	for (i = 0; i < MAX_RT_PRIO; i++) {
+		head = array->queue + i;
+		if (list_empty(head)) {
+			WARN_ON(test_bit(i, array->bitmap));
+			continue;
+		}
+		WARN_ON(!test_bit(i, array->bitmap));
+		list_for_each(tmp, head) {
+			p = list_entry(tmp, struct task_struct, run_list);
+			trace_special_pid(p->pid, p->prio, PRIO(p));
+		}
+	}
 }
-EXPORT_SYMBOL(add_preempt_count);
 
-void fastcall sub_preempt_count(int val)
+static inline void trace_all_runnable_tasks(struct rq *rq)
 {
-	/*
-	 * Underflow?
-	 */
-	if (DEBUG_LOCKS_WARN_ON(val > preempt_count()))
-		return;
-	/*
-	 * Is the spinlock portion underflowing?
-	 */
-	if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) &&
-			!(preempt_count() & PREEMPT_MASK)))
-		return;
+	if (trace_enabled)
+		trace_array(rq->active);
+}
+
+#else
 
-	preempt_count() -= val;
+static inline void trace_all_runnable_tasks(struct rq *rq)
+{
 }
-EXPORT_SYMBOL(sub_preempt_count);
 
 #endif
 
@@ -3641,6 +3661,8 @@ switch_tasks:
 		prev->sleep_avg = 0;
 	prev->timestamp = prev->last_ran = now;
 
+	trace_all_runnable_tasks(rq);
+
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
 		next->timestamp = next->last_ran = now;
@@ -3651,14 +3673,17 @@ switch_tasks:
 		prepare_task_switch(rq, next);
 		prev = context_switch(rq, prev, next);
 		barrier();
+		trace_special_pid(prev->pid, PRIO(prev), PRIO(current));
 		/*
 		 * this_rq must be evaluated again because prev may have moved
 		 * CPUs since it called schedule(), thus the 'rq' on its stack
 		 * frame will be invalid.
 		 */
 		finish_task_switch(this_rq(), prev);
-	} else
+	} else {
 		spin_unlock_irq(&rq->lock);
+		trace_stop_sched_switched(next);
+	}
 
 	prev = current;
 	if (unlikely(reacquire_kernel_lock(prev) < 0))
@@ -4108,6 +4133,7 @@ void rt_mutex_setprio(struct task_struct
 		} else if (TASK_PREEMPTS_CURR(p, rq))
 			resched_task(rq->curr);
 	}
+
 	task_rq_unlock(rq, &flags);
 }
 
@@ -7055,6 +7081,7 @@ void __might_sleep(char *file, int line)
 		if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
 			return;
 		prev_jiffy = jiffies;
+		stop_trace();
 		printk(KERN_ERR "BUG: sleeping function called from invalid"
 				" context at %s:%d\n", file, line);
 		printk("in_atomic():%d, irqs_disabled():%d\n",
Index: linux-2.6.22-rc2/kernel/softlockup.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/softlockup.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/softlockup.c	2007-05-24 15:57:42.000000000 +0200
@@ -100,6 +100,7 @@ void softlockup_tick(void)
 	if (now > (touch_timestamp + 10)) {
 		per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
+		stop_trace();
 		spin_lock(&print_lock);
 		printk(KERN_ERR "BUG: soft lockup detected on CPU#%d!\n",
 			this_cpu);
Index: linux-2.6.22-rc2/kernel/sysctl.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/sysctl.c	2007-05-22 16:59:43.000000000 +0200
+++ linux-2.6.22-rc2/kernel/sysctl.c	2007-05-24 15:57:42.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/utsname.h>
 #include <linux/capability.h>
 #include <linux/smp_lock.h>
+#include <linux/clocksource.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/kobject.h>
@@ -43,6 +44,7 @@
 #include <linux/limits.h>
 #include <linux/dcache.h>
 #include <linux/syscalls.h>
+#include <linux/profile.h>
 #include <linux/nfs_fs.h>
 #include <linux/acpi.h>
 
@@ -215,6 +217,132 @@ static ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+#ifdef CONFIG_WAKEUP_TIMING
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "wakeup_timing",
+		.data		= &wakeup_timing,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
+#if defined(CONFIG_WAKEUP_TIMING) || defined(CONFIG_EVENT_TRACE)
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "preempt_max_latency",
+		.data		= &preempt_max_latency,
+		.maxlen		= sizeof(preempt_max_latency),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "preempt_thresh",
+		.data		= &preempt_thresh,
+		.maxlen		= sizeof(preempt_thresh),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+#endif
+#ifdef CONFIG_EVENT_TRACE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_enabled",
+		.data		= &trace_enabled,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "syscall_tracing",
+		.data		= &syscall_tracing,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "stackframe_tracing",
+		.data		= &stackframe_tracing,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "mcount_enabled",
+		.data		= &mcount_enabled,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_user_triggered",
+		.data		= &trace_user_triggered,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_user_trigger_irq",
+		.data		= &trace_user_trigger_irq,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_freerunning",
+		.data		= &trace_freerunning,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_print_on_crash",
+		.data		= &trace_print_on_crash,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_verbose",
+		.data		= &trace_verbose,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_all_cpus",
+		.data		= &trace_all_cpus,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_use_raw_cycles",
+		.data		= &trace_use_raw_cycles,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "trace_all_runnable",
+		.data		= &trace_all_runnable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 	{
 		.ctl_name	= KERN_CORE_USES_PID,
 		.procname	= "core_uses_pid",
Index: linux-2.6.22-rc2/lib/Kconfig.debug
===================================================================
--- linux-2.6.22-rc2.orig/lib/Kconfig.debug	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/lib/Kconfig.debug	2007-05-24 15:57:43.000000000 +0200
@@ -296,6 +296,192 @@ config STACKTRACE
 	depends on DEBUG_KERNEL
 	depends on STACKTRACE_SUPPORT
 
+config PREEMPT_TRACE
+	bool
+	default y
+	depends on DEBUG_PREEMPT
+
+config EVENT_TRACE
+	bool "Kernel event tracing"
+	default n
+	depends on GENERIC_TIME
+	select FRAME_POINTER
+	select STACKTRACE
+	help
+	  This option enables a kernel tracing mechanism that will track
+	  certain kernel events such as system call entry and return,
+	  IRQ entry, context-switching, etc.
+
+	  Run the scripts/trace-it utility on a kernel with this option
+	  enabled for sample output.
+
+config FUNCTION_TRACE
+	bool "Kernel function call tracing"
+	default n
+	depends on !REORDER
+	select EVENT_TRACE
+	help
+	  This option enables a kernel tracing mechanism that will track
+	  precise function-call granularity kernel execution. Sample
+	  output:
+
+           pcscd-1772  0D..2 6867us : deactivate_task <pcscd-1772> (-2 1)
+           pcscd-1772  0D..2 6867us : dequeue_task (deactivate_task)
+          <idle>-0     0D..2 6870us : __switch_to (__schedule)
+          <idle>-0     0D..2 6871us : __schedule <pcscd-1772> (-2 20)
+          <idle>-0     0D..2 6871us : __lock_acquire (lock_acquire)
+          <idle>-0     0D..2 6872us : __spin_unlock_irq (__schedule)
+
+	  Run the scripts/trace-it sample utility on a kernel with this
+	  option enabled to capture 1 second worth of events.
+
+	  (Note that kernel size and overhead increases noticeably
+	  with this option enabled.)
+
+config WAKEUP_TIMING
+	bool "Wakeup latency timing"
+	depends on GENERIC_TIME
+	help
+	  This option measures the time spent from a highprio thread being
+	  woken up to it getting scheduled on a CPU, with microsecond
+	  accuracy.
+
+	  The default measurement method is a maximum search, which is
+	  disabled by default and can be runtime (re-)started via:
+
+	      echo 0 > /proc/sys/kernel/preempt_max_latency
+
+config LATENCY_TRACE
+	bool "Latency tracing"
+	default n
+	depends on LATENCY_TIMING && !REORDER && GENERIC_TIME
+	select FRAME_POINTER
+	select FUNCTION_TRACE
+	help
+	  When this option is enabled then the last maximum latency timing
+	  event's full trace can be found in /proc/latency_trace, in a
+	  human-readable (or rather as some would say, in a
+	  kernel-developer-readable) form.
+
+	  (Note that kernel size and overhead increases noticeably
+	  with this option enabled.)
+
+config CRITICAL_PREEMPT_TIMING
+	bool "Non-preemptible critical section latency timing"
+	default n
+	depends on PREEMPT
+	depends on GENERIC_TIME
+	help
+	  This option measures the time spent in preempt-off critical
+	  sections, with microsecond accuracy.
+
+	  The default measurement method is a maximum search, which is
+	  disabled by default and can be runtime (re-)started via:
+
+	      echo 0 > /proc/sys/kernel/preempt_max_latency
+
+	  (Note that kernel size and overhead increases with this option
+	  enabled. This option and the irqs-off timing option can be
+	  used together or separately.)
+
+config CRITICAL_IRQSOFF_TIMING
+	bool "Interrupts-off critical section latency timing"
+	default n
+	depends on GENERIC_TIME
+	select TRACE_IRQFLAGS
+	help
+	  This option measures the time spent in irqs-off critical
+	  sections, with microsecond accuracy.
+
+	  The default measurement method is a maximum search, which is
+	  disabled by default and can be runtime (re-)started via:
+
+	      echo 0 > /proc/sys/kernel/preempt_max_latency
+
+	  (Note that kernel size and overhead increases with this option
+	  enabled. This option and the preempt-off timing option can be
+	  used together or separately.)
+
+config WAKEUP_LATENCY_HIST
+	bool "wakeup latency histogram"
+	default n
+	depends on WAKEUP_TIMING
+	help
+	  This option logs all the wakeup latency timing to a big histogram
+	  bucket, in the meanwhile, it also dummies up printk produced by
+	  wakeup latency timing.
+
+	  The wakeup latency timing histogram can be viewed via:
+
+	      cat /proc/latency_hist/wakeup_latency/CPU*
+
+	  (Note: * presents CPU ID.)
+
+config PREEMPT_OFF_HIST
+        bool "non-preemptible critical section latency histogram"
+        default n
+        depends on CRITICAL_PREEMPT_TIMING
+        help
+          This option logs all the non-preemptible critical section latency
+	  timing to a big histogram bucket, in the meanwhile, it also
+	  dummies up printk produced by non-preemptible critical section
+	  latency timing.
+
+          The non-preemptible critical section latency timing histogram can
+	  be viewed via:
+
+              cat /proc/latency_hist/preempt_off_latency/CPU*
+
+          (Note: * presents CPU ID.)
+
+config INTERRUPT_OFF_HIST
+        bool "interrupts-off critical section latency histogram"
+        default n
+        depends on CRITICAL_IRQSOFF_TIMING
+        help
+          This option logs all the interrupts-off critical section latency
+          timing to a big histogram bucket, in the meanwhile, it also
+          dummies up printk produced by interrupts-off critical section
+          latency timing.
+
+          The interrupts-off critical section latency timing histogram can
+          be viewed via:
+
+              cat /proc/latency_hist/interrupt_off_latency/CPU*
+
+          (Note: * presents CPU ID.)
+
+config CRITICAL_TIMING
+	bool
+	default y
+	depends on CRITICAL_PREEMPT_TIMING || CRITICAL_IRQSOFF_TIMING
+
+config DEBUG_TRACE_IRQFLAGS
+	bool
+	default y
+	depends on CRITICAL_IRQSOFF_TIMING
+
+config LATENCY_TIMING
+	bool
+	default y
+	depends on WAKEUP_TIMING || CRITICAL_TIMING
+	select SYSCTL
+
+config CRITICAL_LATENCY_HIST
+	bool
+	default y
+	depends on PREEMPT_OFF_HIST || INTERRUPT_OFF_HIST
+
+config LATENCY_HIST
+	bool
+	default y
+	depends on WAKEUP_LATENCY_HIST || CRITICAL_LATENCY_HIST
+
+config MCOUNT
+	bool
+	depends on FUNCTION_TRACE
+	default y
+
 config DEBUG_KOBJECT
 	bool "kobject debugging"
 	depends on DEBUG_KERNEL
Index: linux-2.6.22-rc2/lib/debug_locks.c
===================================================================
--- linux-2.6.22-rc2.orig/lib/debug_locks.c	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/lib/debug_locks.c	2007-05-24 15:57:43.000000000 +0200
@@ -36,7 +36,16 @@ int debug_locks_silent;
 int debug_locks_off(void)
 {
 	if (xchg(&debug_locks, 0)) {
+#if 0
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+		if (spin_is_locked(&current->pi_lock))
+			spin_unlock(&current->pi_lock);
+#endif
+#endif
 		if (!debug_locks_silent) {
+			stop_trace();
+			user_trace_stop();
+			printk("stopped custom tracer.\n");
 			console_verbose();
 			return 1;
 		}
Index: linux-2.6.22-rc2/scripts/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/scripts/Makefile	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/scripts/Makefile	2007-05-24 15:57:43.000000000 +0200
@@ -7,6 +7,7 @@
 # conmakehash:   Create chartable
 # conmakehash:	 Create arrays for initializing the kernel console tables
 
+hostprogs-$(CONFIG_EVENT_TRACE)  += trace-it
 hostprogs-$(CONFIG_KALLSYMS)     += kallsyms
 hostprogs-$(CONFIG_LOGO)         += pnmtologo
 hostprogs-$(CONFIG_VT)           += conmakehash
Index: linux-2.6.22-rc2/scripts/trace-it.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc2/scripts/trace-it.c	2007-05-24 15:57:43.000000000 +0200
@@ -0,0 +1,79 @@
+
+/*
+ * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com>
+ *
+ * user-triggered tracing.
+ *
+ * The -rt kernel has a built-in kernel tracer, which will trace
+ * all kernel function calls (and a couple of special events as well),
+ * by using a build-time gcc feature that instruments all kernel
+ * functions.
+ *
+ * The tracer is highly automated for a number of latency tracing purposes,
+ * but it can also be switched into 'user-triggered' mode, which is a
+ * half-automatic tracing mode where userspace apps start and stop the
+ * tracer. This file shows a dumb example how to turn user-triggered
+ * tracing on, and how to start/stop tracing. Note that if you do
+ * multiple start/stop sequences, the kernel will do a maximum search
+ * over their latencies, and will keep the trace of the largest latency
+ * in /proc/latency_trace. The maximums are also reported to the kernel
+ * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
+ *
+ * For the tracer to be activated, turn on CONFIG_EVENT_TRACING
+ * in the .config, rebuild the kernel and boot into it. The trace will
+ * get _alot_ more verbose if you also turn on CONFIG_FUNCTION_TRACING,
+ * every kernel function call will be put into the trace. Note that
+ * CONFIG_FUNCTION_TRACING has significant runtime overhead, so you dont
+ * want to use it for performance testing :)
+ */
+
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <linux/unistd.h>
+
+int main (int argc, char **argv)
+{
+	int ret;
+
+	if (getuid() != 0) {
+		fprintf(stderr, "needs to run as root.\n");
+		exit(1);
+	}
+	ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
+	if (ret) {
+		fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
+		exit(1);
+	}
+	system("echo 1 > /proc/sys/kernel/trace_user_triggered");
+	system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 0 > /proc/sys/kernel/wakeup_timing");
+	system("echo 1 > /proc/sys/kernel/trace_enabled");
+	system("echo 1 > /proc/sys/kernel/mcount_enabled");
+	system("echo 0 > /proc/sys/kernel/trace_freerunning");
+	system("echo 0 > /proc/sys/kernel/trace_print_on_crash");
+	system("echo 0 > /proc/sys/kernel/trace_verbose");
+	system("echo 0 > /proc/sys/kernel/preempt_thresh 2>/dev/null");
+	system("echo 0 > /proc/sys/kernel/preempt_max_latency 2>/dev/null");
+
+	// start tracing
+	if (prctl(0, 1)) {
+		fprintf(stderr, "trace-it: couldnt start tracing!\n");
+		return 1;
+	}
+	usleep(10000000);
+	if (prctl(0, 0)) {
+		fprintf(stderr, "trace-it: couldnt stop tracing!\n");
+		return 1;
+	}
+
+	system("echo 0 > /proc/sys/kernel/trace_user_triggered");
+	system("echo 0 > /proc/sys/kernel/trace_enabled");
+	system("cat /proc/latency_trace");
+
+	return 0;
+}
+
+
Index: linux-2.6.22-rc2/arch/i386/boot/compressed/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/boot/compressed/Makefile	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/boot/compressed/Makefile	2007-05-24 15:57:42.000000000 +0200
@@ -9,6 +9,7 @@ targets		:= vmlinux vmlinux.bin vmlinux.
 EXTRA_AFLAGS	:= -traditional
 
 LDFLAGS_vmlinux := -T
+CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2  -fno-strict-aliasing
 CFLAGS_misc.o += -fPIC
 hostprogs-y	:= relocs
 
Index: linux-2.6.22-rc2/arch/i386/kernel/Makefile
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/Makefile	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/Makefile	2007-05-24 15:57:42.000000000 +0200
@@ -21,6 +21,7 @@ obj-$(CONFIG_APM)		+= apm.o
 obj-$(CONFIG_X86_SMP)		+= smp.o smpboot.o tsc_sync.o
 obj-$(CONFIG_SMP)		+= smpcommon.o
 obj-$(CONFIG_X86_TRAMPOLINE)	+= trampoline.o
+obj-$(CONFIG_MCOUNT)		+= mcount-wrapper.o
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
Index: linux-2.6.22-rc2/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/apic.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/apic.c	2007-05-24 15:57:43.000000000 +0200
@@ -577,6 +577,8 @@ void fastcall smp_apic_timer_interrupt(s
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
+	trace_special(regs->eip, 1, 0);
+
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.
Index: linux-2.6.22-rc2/arch/i386/kernel/entry.S
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/entry.S	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/entry.S	2007-05-24 15:57:42.000000000 +0200
@@ -329,6 +329,11 @@ sysenter_past_esp:
 	pushl %eax
 	CFI_ADJUST_CFA_OFFSET 4
 	SAVE_ALL
+#ifdef CONFIG_EVENT_TRACE
+	pushl %edx; pushl %ecx; pushl %ebx; pushl %eax
+	call sys_call
+	popl %eax; popl %ebx; popl %ecx; popl %edx
+#endif
 	GET_THREAD_INFO(%ebp)
 
 	/* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
@@ -343,6 +348,11 @@ sysenter_past_esp:
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx
 	jne syscall_exit_work
+#ifdef CONFIG_EVENT_TRACE
+	pushl %eax
+	call sys_ret
+	popl %eax
+#endif
 /* if something modifies registers it must also disable sysexit */
 	movl PT_EIP(%esp), %edx
 	movl PT_OLDESP(%esp), %ecx
@@ -366,6 +376,11 @@ ENTRY(system_call)
 	pushl %eax			# save orig_eax
 	CFI_ADJUST_CFA_OFFSET 4
 	SAVE_ALL
+#ifdef CONFIG_EVENT_TRACE
+	pushl %edx; pushl %ecx; pushl %ebx; pushl %eax
+	call sys_call
+	popl %eax; popl %ebx; popl %ecx; popl %edx
+#endif
 	GET_THREAD_INFO(%ebp)
 	testl $TF_MASK,PT_EFLAGS(%esp)
 	jz no_singlestep
Index: linux-2.6.22-rc2/arch/i386/kernel/hpet.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/hpet.c	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/hpet.c	2007-05-24 15:57:42.000000000 +0200
@@ -205,7 +205,7 @@ static int hpet_next_event(unsigned long
 /*
  * Clock source related code
  */
-static cycle_t read_hpet(void)
+static cycle_t notrace read_hpet(void)
 {
 	return (cycle_t)hpet_readl(HPET_COUNTER);
 }
Index: linux-2.6.22-rc2/arch/i386/kernel/irq.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/irq.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/irq.c	2007-05-24 15:57:42.000000000 +0200
@@ -68,7 +68,7 @@ static union irq_ctx *softirq_ctx[NR_CPU
  * SMP cross-CPU interrupts have their own specific
  * handlers).
  */
-fastcall unsigned int do_IRQ(struct pt_regs *regs)
+fastcall notrace unsigned int do_IRQ(struct pt_regs *regs)
 {	
 	struct pt_regs *old_regs;
 	/* high bit used in ret_from_ code */
@@ -87,6 +87,11 @@ fastcall unsigned int do_IRQ(struct pt_r
 
 	old_regs = set_irq_regs(regs);
 	irq_enter();
+#ifdef CONFIG_EVENT_TRACE
+	if (irq == trace_user_trigger_irq)
+		user_trace_start();
+#endif
+	trace_special(regs->eip, irq, 0);
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 	/* Debugging check for stack overflow: is there less than 1KB free? */
 	{
Index: linux-2.6.22-rc2/arch/i386/kernel/mcount-wrapper.S
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc2/arch/i386/kernel/mcount-wrapper.S	2007-05-24 15:57:42.000000000 +0200
@@ -0,0 +1,27 @@
+/*
+ *  linux/arch/i386/mcount-wrapper.S
+ *
+ *  Copyright (C) 2004 Ingo Molnar
+ */
+
+.globl mcount
+mcount:
+
+	cmpl $0, mcount_enabled
+	jz out
+
+	push %ebp
+	mov %esp, %ebp
+	pushl %eax
+	pushl %ecx
+	pushl %edx
+
+	call __mcount
+
+	popl %edx
+	popl %ecx
+	popl %eax
+	popl %ebp
+out:
+	ret
+
Index: linux-2.6.22-rc2/arch/i386/kernel/traps.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/traps.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/traps.c	2007-05-24 15:57:42.000000000 +0200
@@ -136,7 +136,7 @@ static inline unsigned long print_contex
 
 #define MSG(msg) ops->warning(data, msg)
 
-void dump_trace(struct task_struct *task, struct pt_regs *regs,
+void notrace dump_trace(struct task_struct *task, struct pt_regs *regs,
 	        unsigned long *stack,
 		struct stacktrace_ops *ops, void *data)
 {
@@ -222,6 +222,7 @@ show_trace_log_lvl(struct task_struct *t
 {
 	dump_trace(task, regs, stack, &print_trace_ops, log_lvl);
 	printk("%s =======================\n", log_lvl);
+	print_traces(task);
 }
 
 void show_trace(struct task_struct *task, struct pt_regs *regs,
Index: linux-2.6.22-rc2/arch/i386/kernel/tsc.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/kernel/tsc.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/kernel/tsc.c	2007-05-24 15:57:42.000000000 +0200
@@ -255,7 +255,7 @@ core_initcall(cpufreq_tsc);
 
 static unsigned long current_tsc_khz = 0;
 
-static cycle_t read_tsc(void)
+static notrace cycle_t read_tsc(void)
 {
 	cycle_t ret;
 
Index: linux-2.6.22-rc2/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/mm/fault.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/mm/fault.c	2007-05-24 15:57:42.000000000 +0200
@@ -483,6 +483,7 @@ bad_area_nosemaphore:
 		nr = (address - idt_descr.address) >> 3;
 
 		if (nr == 6) {
+		stop_trace();
 			do_invalid_op(regs, 0);
 			return;
 		}
Index: linux-2.6.22-rc2/arch/i386/mm/init.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/mm/init.c	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/mm/init.c	2007-05-24 15:57:42.000000000 +0200
@@ -193,7 +193,7 @@ static inline int page_kills_ppro(unsign
 	return 0;
 }
 
-int page_is_ram(unsigned long pagenr)
+int notrace page_is_ram(unsigned long pagenr)
 {
 	int i;
 	unsigned long addr, end;
Index: linux-2.6.22-rc2/include/asm-i386/processor.h
===================================================================
--- linux-2.6.22-rc2.orig/include/asm-i386/processor.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/asm-i386/processor.h	2007-05-24 15:57:43.000000000 +0200
@@ -602,7 +602,9 @@ static inline void load_esp0(struct tss_
  * clear %ecx since some cpus (Cyrix MII) do not set or clear %ecx
  * resulting in stale register contents being returned.
  */
-static inline void cpuid(unsigned int op, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx)
+static inline void
+cpuid(unsigned int op, unsigned int *eax, unsigned int *ebx,
+      unsigned int *ecx, unsigned int *edx)
 {
 	*eax = op;
 	*ecx = 0;
@@ -610,8 +612,9 @@ static inline void cpuid(unsigned int op
 }
 
 /* Some CPUID calls want 'count' to be placed in ecx */
-static inline void cpuid_count(int op, int count, int *eax, int *ebx, int *ecx,
-			       int *edx)
+static inline void
+cpuid_count(int op, int count, unsigned int *eax, unsigned int *ebx,
+	    unsigned int *ecx, unsigned int *edx)
 {
 	*eax = op;
 	*ecx = count;
Index: linux-2.6.22-rc2/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/ia32/ia32entry.S	2007-05-22 16:25:09.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/ia32/ia32entry.S	2007-05-24 15:57:42.000000000 +0200
@@ -120,7 +120,9 @@ sysenter_do_call:	
 	cmpl	$(IA32_NR_syscalls-1),%eax
 	ja	ia32_badsys
 	IA32_ARG_FIXUP 1
+	TRACE_SYS_IA32_CALL
 	call	*ia32_sys_call_table(,%rax,8)
+	TRACE_SYS_RET
 	movq	%rax,RAX-ARGOFFSET(%rsp)
 	GET_THREAD_INFO(%r10)
 	cli
@@ -229,7 +231,9 @@ cstar_do_call:	
 	cmpl $IA32_NR_syscalls-1,%eax
 	ja  ia32_badsys
 	IA32_ARG_FIXUP 1
+	TRACE_SYS_IA32_CALL
 	call *ia32_sys_call_table(,%rax,8)
+	TRACE_SYS_RET
 	movq %rax,RAX-ARGOFFSET(%rsp)
 	GET_THREAD_INFO(%r10)
 	cli
@@ -323,8 +327,10 @@ ia32_do_syscall:	
 	cmpl $(IA32_NR_syscalls-1),%eax
 	ja  ia32_badsys
 	IA32_ARG_FIXUP
+	TRACE_SYS_IA32_CALL
 	call *ia32_sys_call_table(,%rax,8) # xxx: rip relative
 ia32_sysret:
+	TRACE_SYS_RET
 	movq %rax,RAX-ARGOFFSET(%rsp)
 	jmp int_ret_from_sys_call 
 
@@ -394,7 +400,7 @@ END(ia32_ptregs_common)
 
 	.section .rodata,"a"
 	.align 8
-ia32_sys_call_table:
+ENTRY(ia32_sys_call_table)
 	.quad sys_restart_syscall
 	.quad sys_exit
 	.quad stub32_fork
@@ -719,4 +725,7 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd
 	.quad compat_sys_timerfd
 	.quad sys_eventfd
+#ifdef CONFIG_EVENT_TRACE
+ .globl ia32_syscall_end
+#endif
 ia32_syscall_end:
Index: linux-2.6.22-rc2/arch/x86_64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/entry.S	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/entry.S	2007-05-24 15:57:42.000000000 +0200
@@ -53,6 +53,47 @@
 
 	.code64
 
+#ifdef CONFIG_EVENT_TRACE
+
+ENTRY(mcount)
+	cmpl $0, mcount_enabled
+	jz out
+
+	push %rbp
+	mov %rsp,%rbp
+
+	push %r11
+	push %r10
+	push %r9
+	push %r8
+	push %rdi
+	push %rsi
+	push %rdx
+	push %rcx
+	push %rax
+
+	mov 0x0(%rbp),%rax
+	mov 0x8(%rbp),%rdi
+	mov 0x8(%rax),%rsi
+
+	call   __trace
+
+	pop %rax
+	pop %rcx
+	pop %rdx
+	pop %rsi
+	pop %rdi
+	pop %r8
+	pop %r9
+	pop %r10
+	pop %r11
+
+	pop %rbp
+out:
+	ret
+
+#endif
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif	
@@ -234,7 +275,9 @@ ENTRY(system_call)
 	cmpq $__NR_syscall_max,%rax
 	ja badsys
 	movq %r10,%rcx
+	TRACE_SYS_CALL
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
+	TRACE_SYS_RET
 	movq %rax,RAX-ARGOFFSET(%rsp)
 /*
  * Syscall return path ending with SYSRET (fast path)
@@ -316,7 +359,9 @@ tracesys:			 
 	cmova %rcx,%rax
 	ja  1f
 	movq %r10,%rcx	/* fixup for C */
+	TRACE_SYS_CALL
 	call *sys_call_table(,%rax,8)
+ 	TRACE_SYS_RET
 1:	movq %rax,RAX-ARGOFFSET(%rsp)
 	/* Use IRET because user could have changed frame */
 		
Index: linux-2.6.22-rc2/arch/x86_64/kernel/head64.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/head64.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/head64.c	2007-05-24 15:57:42.000000000 +0200
@@ -10,6 +10,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/percpu.h>
+#include <linux/sched.h>
 
 #include <asm/processor.h>
 #include <asm/proto.h>
@@ -58,7 +59,7 @@ static void __init copy_bootdata(char *r
 	memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
 }
 
-void __init x86_64_start_kernel(char * real_mode_data)
+void __init notrace x86_64_start_kernel(char * real_mode_data)
 {
 	int i;
 
Index: linux-2.6.22-rc2/arch/x86_64/kernel/irq.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/irq.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/irq.c	2007-05-24 15:57:42.000000000 +0200
@@ -115,6 +115,12 @@ asmlinkage unsigned int do_IRQ(struct pt
 	irq_enter();
 	irq = __get_cpu_var(vector_irq)[vector];
 
+#ifdef CONFIG_EVENT_TRACE
+	if (irq == trace_user_trigger_irq)
+		user_trace_start();
+#endif
+	trace_special(regs->rip, irq, 0);
+
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 	stack_overflow_check(regs);
 #endif
Index: linux-2.6.22-rc2/arch/x86_64/kernel/setup64.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/setup64.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/setup64.c	2007-05-24 15:57:42.000000000 +0200
@@ -114,7 +114,7 @@ void __init setup_per_cpu_areas(void)
 	}
 } 
 
-void pda_init(int cpu)
+void notrace pda_init(int cpu)
 { 
 	struct x8664_pda *pda = cpu_pda(cpu);
 
@@ -188,7 +188,7 @@ unsigned long kernel_eflags;
  * 'CPU state barrier', nothing should get across.
  * A lot of state is already set up in PDA init.
  */
-void __cpuinit cpu_init (void)
+void __cpuinit notrace cpu_init (void)
 {
 	int cpu = stack_smp_processor_id();
 	struct tss_struct *t = &per_cpu(init_tss, cpu);
Index: linux-2.6.22-rc2/arch/x86_64/kernel/smpboot.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/smpboot.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/smpboot.c	2007-05-24 15:57:42.000000000 +0200
@@ -318,7 +318,7 @@ static inline void set_cpu_sibling_map(i
 /*
  * Setup code on secondary processor (after comming out of the trampoline)
  */
-void __cpuinit start_secondary(void)
+void __cpuinit notrace start_secondary(void)
 {
 	/*
 	 * Dont put anything before smp_callin(), SMP
Index: linux-2.6.22-rc2/arch/x86_64/kernel/traps.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/traps.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/traps.c	2007-05-24 15:57:42.000000000 +0200
@@ -346,6 +346,7 @@ show_trace(struct task_struct *tsk, stru
 	printk("\nCall Trace:\n");
 	dump_trace(tsk, regs, stack, &print_trace_ops, NULL);
 	printk("\n");
+	print_traces(tsk);
 }
 
 static void
Index: linux-2.6.22-rc2/arch/x86_64/kernel/vsyscall.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/vsyscall.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/vsyscall.c	2007-05-24 15:57:42.000000000 +0200
@@ -43,7 +43,7 @@
 #include <asm/desc.h>
 #include <asm/topology.h>
 
-#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
+#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) notrace
 #define __syscall_clobber "r11","rcx","memory"
 #define __pa_vsymbol(x)			\
 	({unsigned long v;  		\
Index: linux-2.6.22-rc2/include/asm-x86_64/calling.h
===================================================================
--- linux-2.6.22-rc2.orig/include/asm-x86_64/calling.h	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/asm-x86_64/calling.h	2007-05-24 15:57:43.000000000 +0200
@@ -160,3 +160,53 @@
 	.macro icebp
 	.byte 0xf1
 	.endm
+
+/*
+ * latency-tracing helpers:
+ */
+
+	.macro TRACE_SYS_CALL
+
+#ifdef CONFIG_EVENT_TRACE
+	SAVE_ARGS
+
+	mov     %rdx, %rcx
+	mov     %rsi, %rdx
+	mov     %rdi, %rsi
+	mov     %rax, %rdi
+
+	call sys_call
+
+	RESTORE_ARGS
+#endif
+	.endm
+
+
+	.macro TRACE_SYS_IA32_CALL
+
+#ifdef CONFIG_EVENT_TRACE
+	SAVE_ARGS
+
+	mov     %rdx, %rcx
+	mov     %rsi, %rdx
+	mov     %rdi, %rsi
+	mov     %rax, %rdi
+
+	call sys_ia32_call
+
+	RESTORE_ARGS
+#endif
+	.endm
+
+	.macro TRACE_SYS_RET
+
+#ifdef CONFIG_EVENT_TRACE
+	SAVE_ARGS
+
+	mov     %rax, %rdi
+
+	call sys_ret
+
+	RESTORE_ARGS
+#endif
+	.endm
Index: linux-2.6.22-rc2/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22-rc2.orig/include/asm-x86_64/unistd.h	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/include/asm-x86_64/unistd.h	2007-05-24 15:57:43.000000000 +0200
@@ -11,6 +11,8 @@
  * Note: holes are not allowed.
  */
 
+#define NR_syscalls (__NR_syscall_max+1)
+
 /* at least 8 syscall per cacheline */
 #define __NR_read                                0
 __SYSCALL(__NR_read, sys_read)
Index: linux-2.6.22-rc2/include/linux/prctl.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/prctl.h	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/prctl.h	2007-05-24 15:57:43.000000000 +0200
@@ -3,6 +3,7 @@
 
 /* Values to pass as first argument to prctl() */
 
+#define PR_SET_TRACING    0  /* Second arg is tracing on/off */
 #define PR_SET_PDEATHSIG  1  /* Second arg is a signal */
 #define PR_GET_PDEATHSIG  2  /* Second arg is a ptr to return the signal */
 
Index: linux-2.6.22-rc2/kernel/sys.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/sys.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/sys.c	2007-05-24 15:57:42.000000000 +0200
@@ -2149,6 +2149,14 @@ asmlinkage long sys_prctl(int option, un
 {
 	long error;
 
+#ifdef CONFIG_EVENT_TRACE
+	if (option == PR_SET_TRACING) {
+		if (arg2)
+			return user_trace_start();
+		return user_trace_stop();
+	}
+#endif
+
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
 	if (error)
 		return error;
Index: linux-2.6.22-rc2/kernel/time/timekeeping.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/time/timekeeping.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/time/timekeeping.c	2007-05-24 15:57:42.000000000 +0200
@@ -195,6 +195,34 @@ static void change_clocksource(void)
 	printk(KERN_INFO "Time: %s clocksource has been installed.\n",
 	       clock->name);
 }
+
+cycle_t notrace get_monotonic_cycles(void)
+{
+	cycle_t cycle_now, cycle_delta;
+
+	/* read clocksource: */
+	cycle_now = clocksource_read(clock);
+
+	/* calculate the delta since the last update_wall_time: */
+	cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
+
+	return clock->cycle_last + cycle_delta;
+}
+
+unsigned long notrace cycles_to_usecs(cycle_t cycles)
+{
+	u64 ret = cyc2ns(clock, cycles);
+
+	do_div(ret, 1000);
+
+	return ret;
+}
+
+cycle_t notrace usecs_to_cycles(unsigned long usecs)
+{
+	return ns2cyc(clock, (u64)usecs * 1000);
+}
+
 #else
 static inline void change_clocksource(void) { }
 #endif
Index: linux-2.6.22-rc2/arch/x86_64/kernel/stacktrace.c
===================================================================
--- linux-2.6.22-rc2.orig/arch/x86_64/kernel/stacktrace.c	2007-05-22 16:25:10.000000000 +0200
+++ linux-2.6.22-rc2/arch/x86_64/kernel/stacktrace.c	2007-05-24 15:57:42.000000000 +0200
@@ -24,7 +24,7 @@ static int save_stack_stack(void *data, 
 	return -1;
 }
 
-static void save_stack_address(void *data, unsigned long addr)
+static void notrace save_stack_address(void *data, unsigned long addr)
 {
 	struct stack_trace *trace = (struct stack_trace *)data;
 	if (trace->skip > 0) {
Index: linux-2.6.22-rc2/kernel/hrtimer.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/hrtimer.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/hrtimer.c	2007-05-24 15:57:42.000000000 +0200
@@ -553,6 +553,8 @@ static inline int hrtimer_enqueue_reprog
 	return 0;
 }
 
+void trace_start_ht_debug(void);
+
 /*
  * Switch to high resolution mode
  */
@@ -576,6 +578,9 @@ static int hrtimer_switch_to_hres(void)
 
 	tick_setup_sched_timer();
 
+	if (!cpu)
+		trace_start_ht_debug();
+
 	/* "Retrigger" the interrupt to get things going */
 	retrigger_next_event(NULL);
 	local_irq_restore(flags);
Index: linux-2.6.22-rc2/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.22-rc2.orig/kernel/time/tick-sched.c	2007-05-22 16:25:19.000000000 +0200
+++ linux-2.6.22-rc2/kernel/time/tick-sched.c	2007-05-24 15:57:42.000000000 +0200
@@ -167,9 +167,21 @@ void tick_nohz_stop_sched_tick(void)
 		goto end;
 
 	cpu = smp_processor_id();
-	if (unlikely(local_softirq_pending()))
-		printk(KERN_ERR "NOHZ: local_softirq_pending %02x\n",
-		       local_softirq_pending());
+	if (unlikely(local_softirq_pending())) {
+		static int ratelimit;
+
+		if (ratelimit < 10) {
+			if (!cpu) {
+				trace_special(0, 0, 0);
+				user_trace_stop();
+				ratelimit = 10;
+			}
+			printk(KERN_ERR
+			       "NOHZ: local_softirq_pending %02x on CPU %d\n",
+			       local_softirq_pending(), cpu);
+			ratelimit++;
+		}
+	}
 
 	now = ktime_get();
 	/*

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 14:03   ` Miklos Szeredi
@ 2007-05-24 14:10     ` Ingo Molnar
  2007-05-24 14:28       ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-05-24 14:10 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-acpi, tglx


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > how reproducable are these lockups - could you possibly trace it? If 
> > yes then please apply:
> > 
> >   http://www.tglx.de/private/tglx/ht-debug/tracer.diff
> 
> With this patch boot stops at segfaulting fsck.  I enabled all the new 
> config options, is that not a good idea?  Which one exactly do I need?

hm, you should only need these:

 CONFIG_EVENT_TRACE=y
 CONFIG_FUNCTION_TRACE=y
 # CONFIG_WAKEUP_TIMING is not set
 # CONFIG_CRITICAL_IRQSOFF_TIMING is not set
 CONFIG_MCOUNT=y

does it boot with these?

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 14:10     ` Ingo Molnar
@ 2007-05-24 14:28       ` Miklos Szeredi
  2007-05-24 14:42         ` Ingo Molnar
  2007-05-24 14:44         ` Ingo Molnar
  0 siblings, 2 replies; 88+ messages in thread
From: Miklos Szeredi @ 2007-05-24 14:28 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, linux-acpi, tglx

> > > how reproducable are these lockups - could you possibly trace it? If 
> > > yes then please apply:
> > > 
> > >   http://www.tglx.de/private/tglx/ht-debug/tracer.diff
> > 
> > With this patch boot stops at segfaulting fsck.  I enabled all the new 
> > config options, is that not a good idea?  Which one exactly do I need?
> 
> hm, you should only need these:
> 
>  CONFIG_EVENT_TRACE=y
>  CONFIG_FUNCTION_TRACE=y
>  # CONFIG_WAKEUP_TIMING is not set
>  # CONFIG_CRITICAL_IRQSOFF_TIMING is not set
>  CONFIG_MCOUNT=y
> 
> does it boot with these?

Nope.  Same segfault.  If I try to continue manually with 'init 5',
then init segfaults as well :(

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 14:28       ` Miklos Szeredi
@ 2007-05-24 14:42         ` Ingo Molnar
  2007-05-24 14:44         ` Ingo Molnar
  1 sibling, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-05-24 14:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-acpi, tglx


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > hm, you should only need these:
> > 
> >  CONFIG_EVENT_TRACE=y
> >  CONFIG_FUNCTION_TRACE=y
> >  # CONFIG_WAKEUP_TIMING is not set
> >  # CONFIG_CRITICAL_IRQSOFF_TIMING is not set
> >  CONFIG_MCOUNT=y
> > 
> > does it boot with these?
> 
> Nope.  Same segfault.  If I try to continue manually with 'init 5', 
> then init segfaults as well :(

does it go away if you turn off CONFIG_FUNCTION_TRACE? (that will make 
the trace a lot less verbose, but still informative)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 14:28       ` Miklos Szeredi
  2007-05-24 14:42         ` Ingo Molnar
@ 2007-05-24 14:44         ` Ingo Molnar
  2007-05-24 17:09           ` Miklos Szeredi
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-05-24 14:44 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-acpi, tglx


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> >  CONFIG_EVENT_TRACE=y
> >  CONFIG_FUNCTION_TRACE=y
> >  # CONFIG_WAKEUP_TIMING is not set
> >  # CONFIG_CRITICAL_IRQSOFF_TIMING is not set
> >  CONFIG_MCOUNT=y
> > 
> > does it boot with these?
> 
> Nope.  Same segfault.  If I try to continue manually with 'init 5', 
> then init segfaults as well :(

could you just try v2.6.21 plus the -rt patch, which has the tracer 
built-in? That's a combination that should work well. You can pick it up 
from:

   http://people.redhat.com/mingo/realtime-preempt/

same config options as above. If you dont turn on PREEMPT_RT you'll get 
an almost-vanilla kernel.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 14:44         ` Ingo Molnar
@ 2007-05-24 17:09           ` Miklos Szeredi
  2007-05-24 21:01             ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-05-24 17:09 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, linux-acpi, tglx

> could you just try v2.6.21 plus the -rt patch, which has the tracer 
> built-in? That's a combination that should work well. You can pick it up 
> from:
> 
>    http://people.redhat.com/mingo/realtime-preempt/
> 
> same config options as above. If you dont turn on PREEMPT_RT you'll get 
> an almost-vanilla kernel.

2.6.22-rc2, only EVENT_TRACE - boots, can't rerpoduce
2.6.21-vanila - can reproduce
2.6.21-rt7, trace options off - can reproduce
2.6.21-rt7, trace options on - can't reproduce

Possibly something timing related, that's altered by the trace code.
I tried the trace kernel without starting the trace app, but still no
bug.

Any other ideas?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 17:09           ` Miklos Szeredi
@ 2007-05-24 21:01             ` Ingo Molnar
  2007-05-25  9:51               ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-05-24 21:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, linux-acpi, tglx


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> 2.6.22-rc2, only EVENT_TRACE - boots, can't rerpoduce
> 2.6.21-vanila - can reproduce
> 2.6.21-rt7, trace options off - can reproduce
> 2.6.21-rt7, trace options on - can't reproduce
> 
> Possibly something timing related, that's altered by the trace code. I 
> tried the trace kernel without starting the trace app, but still no 
> bug.
> 
> Any other ideas?

perhaps try 2.6.21-rt7 with EVENT_TRACE on (the other trace options off) 
- does that hide the bug too?

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 12:04 [BUG] long freezes on thinkpad t60 Miklos Szeredi
  2007-05-24 12:54 ` Ingo Molnar
@ 2007-05-24 22:08 ` Henrique de Moraes Holschuh
  2007-05-24 22:13   ` Kok, Auke
  1 sibling, 1 reply; 88+ messages in thread
From: Henrique de Moraes Holschuh @ 2007-05-24 22:08 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-kernel, mingo, linux-acpi

On Thu, 24 May 2007, Miklos Szeredi wrote:
> Tried nmi_watchdog=1, but then the machine locks up hard shortly after
> boot.

NMIs in some thinkpads are bad trouble, they lock up the blasted IBM/Lenovo
SMBIOS if they happen to hit it while it is servicing a SMI, and thinkpads
do SMIs like crazy.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 22:08 ` Henrique de Moraes Holschuh
@ 2007-05-24 22:13   ` Kok, Auke
  2007-05-25  6:58     ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Kok, Auke @ 2007-05-24 22:13 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Miklos Szeredi, linux-kernel, mingo, linux-acpi

Henrique de Moraes Holschuh wrote:
> On Thu, 24 May 2007, Miklos Szeredi wrote:
>> Tried nmi_watchdog=1, but then the machine locks up hard shortly after
>> boot.
> 
> NMIs in some thinkpads are bad trouble, they lock up the blasted IBM/Lenovo
> SMBIOS if they happen to hit it while it is servicing a SMI, and thinkpads
> do SMIs like crazy.

there's also a L1 ASPM issue with the e1000 chipset in the way (for T60/X60 
only). Make sure you are using 2.6.21 or newer. See netdev archives for more on 
that.

Auke

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 22:13   ` Kok, Auke
@ 2007-05-25  6:58     ` Ingo Molnar
  0 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-05-25  6:58 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Henrique de Moraes Holschuh, Miklos Szeredi, linux-kernel, linux-acpi


* Kok, Auke <auke-jan.h.kok@intel.com> wrote:

> Henrique de Moraes Holschuh wrote:
> >On Thu, 24 May 2007, Miklos Szeredi wrote:
> >>Tried nmi_watchdog=1, but then the machine locks up hard shortly after
> >>boot.
> >
> >NMIs in some thinkpads are bad trouble, they lock up the blasted IBM/Lenovo
> >SMBIOS if they happen to hit it while it is servicing a SMI, and thinkpads
> >do SMIs like crazy.
> 
> there's also a L1 ASPM issue with the e1000 chipset in the way (for 
> T60/X60 only). Make sure you are using 2.6.21 or newer. See netdev 
> archives for more on that.

Miklos is using latest -git and he has nmi_watchdog disabled - still the 
long pauses persist. (i've never seen that on my t60)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-24 21:01             ` Ingo Molnar
@ 2007-05-25  9:51               ` Miklos Szeredi
  2007-06-14 16:04                 ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-05-25  9:51 UTC (permalink / raw)
  To: mingo; +Cc: chris, linux-kernel, linux-acpi, tglx

> > 2.6.22-rc2, only EVENT_TRACE - boots, can't rerpoduce
> > 2.6.21-vanila - can reproduce
> > 2.6.21-rt7, trace options off - can reproduce
> > 2.6.21-rt7, trace options on - can't reproduce
> > 
> > Possibly something timing related, that's altered by the trace code. I 
> > tried the trace kernel without starting the trace app, but still no 
> > bug.
> > 
> > Any other ideas?
> 
> perhaps try 2.6.21-rt7 with EVENT_TRACE on (the other trace options off) 
> - does that hide the bug too?

The option itself doesn't hide the bug this time, I got one freeze
pretty quickly.  But after starting the trace-it-1sec loop, I couldn't
get it any more.

Normally I get a freeze after 3-5 minutes of testing, but with
trace-it-1sec there's still nothing after 30 minutes.

If I stop trace-it, and do "echo 0 > /proc/sys/kernel/trace_enabled",
I get the freeze again. It's a perfect heisenbug.

This issue came up when I was testing a userspace fuse bug, and now
the reporter of that bug (added to CC) says that he also sometimes
experienced a hard lockup during testing but ignored it up to now.  So
we may yet get some info from him.

It could be something fuse related, but it's very hard to imagine how
fuse could trigger such a low-level problem.

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-05-25  9:51               ` Miklos Szeredi
@ 2007-06-14 16:04                 ` Miklos Szeredi
  2007-06-15 21:25                   ` Chuck Ebbert
  2007-06-16 10:37                   ` Ingo Molnar
  0 siblings, 2 replies; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-14 16:04 UTC (permalink / raw)
  To: mingo; +Cc: chris, linux-kernel, tglx

I've got some more info about this bug.  It is gathered with
nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of
calling die_nmi() just prints a line and calls show_registers().

This makes the machine actually survive the NMI tracing.  The attached
traces are gathered over about an hour of stressing.  An mp3 player is
also going on continually, and I can hear a couple of seconds of
"looping" quite often, but it gets as far as the NMI trace only
rarely.  AFAICS only the last pair shows a trace for both CPUs during
the same "freeze".

I've put some effort into understanding what's going on, but I'm not
familiar with how interrupts work and that sort of thing.

The pattern that emerges is that on CPU0 we have an interrupt, which
is trying to acquire the rq lock, but can't.

On CPU1 we have strace which is doing wait_task_inactive(), which sort
of spins acquiring and releasing the rq lock.  I've checked some of
the traces and it is just before acquiring the rq lock, or just after
releasing it, but is not actually holding it.

So is it possible that wait_task_inactive() could be starving the
other waiters of the rq spinlock?  Any ideas?

Thanks,
Miklos

NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: fuse e1000
Pid: 4625, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff80227ce5>]  [<ffffffff80227ce5>] task_rq_lock+0x14/0x6f
RSP: 0018:ffff81001cf17ed8  EFLAGS: 00000046
RAX: 0000000000000246 RBX: ffff81001c9da540 RCX: ffff81003fd1e5e8
RDX: 0000000000000001 RSI: ffff81001cf17f10 RDI: ffff81001c9da540
RBP: ffff81001cf17ef8 R08: 0000000000000003 R09: 0000000000000000
R10: 00007fffbcd7c018 R11: 0000000000000246 R12: ffff81001c9da540
R13: ffff81001cf17f10 R14: ffff81001c9da540 R15: 00007fffbcd7d44c
FS:  00002b05ee28a6f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaac02f000 CR3: 000000001ce52000 CR4: 00000000000006e0
Process strace (pid: 4625, threadinfo ffff81001cf16000, task ffff810037d35860)
Stack:  ffff81001c9da540 ffff81001c9da540 00007fffbcd7c018 0000000000000078
 ffff81001cf17f28 ffffffff8022c252 ffff810037d35860 0000000000000246
 ffff81001c9da540 0000000000000000 0000000000000000 ffffffff802365e3
Call Trace:
 [<ffffffff8022c252>] wait_task_inactive+0x1a/0x5f
 [<ffffffff802365e3>] ptrace_check_attach+0xaf/0xb6
 [<ffffffff8023664c>] sys_ptrace+0x62/0xa2
 [<ffffffff8020954e>] system_call+0x7e/0x83


Code: 49 89 45 00 49 8b 46 08 48 c7 c3 40 a6 5e 80 49 89 dc 8b 40 
NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: fuse e1000
Pid: 4625, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043bb1b>]  [<ffffffff8043bb1b>] _spin_unlock_irqrestore+0x8/0x9
RSP: 0018:ffff81001cf17f00  EFLAGS: 00000246
RAX: 0000000000000001 RBX: ffff81001c9dabe0 RCX: ffff81003fdabee8
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff810001e0e640
RBP: ffff81001cf17f28 R08: 0000000000000003 R09: 0000000000000000
R10: 00007fffbcd7c018 R11: 0000000000000246 R12: ffff81001c9dabe0
R13: 00007fffbcd7c018 R14: 0000000000000078 R15: 00007fffbcd7d44c
FS:  00002b05ee28a6f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b9e0d9c4000 CR3: 000000001ce52000 CR4: 00000000000006e0
Process strace (pid: 4625, threadinfo ffff81001cf16000, task ffff810037d35860)
Stack:  ffffffff8022c27c ffff810037d35860 0000000000000246 ffff81001c9dabe0
 0000000000000000 0000000000000000 ffffffff802365e3 ffff81001c9dabe0
 ffff81001c9dabe0 0000000000000003 ffffffff8023664c 00002b05ee28a690
Call Trace:
 [<ffffffff8022c27c>] wait_task_inactive+0x44/0x5f
 [<ffffffff802365e3>] ptrace_check_attach+0xaf/0xb6
 [<ffffffff8023664c>] sys_ptrace+0x62/0xa2
 [<ffffffff8020954e>] system_call+0x7e/0x83


Code: c3 c7 07 01 00 00 00 fb c3 f0 83 2f 01 79 05 e8 51 41 ea ff 
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: fuse e1000
Pid: 4663, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043ba34>]  [<ffffffff8043ba34>] _spin_lock+0xa/0xf
RSP: 0018:ffffffff805f2d98  EFLAGS: 00000046
RAX: ffffffff805a2000 RBX: ffffffff805ea640 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff805f2e00 RDI: ffff810001e0e640
RBP: ffffffff805f2dc0 R08: ffff81003fe2ff08 R09: ffffffff803cac69
R10: 0000000000000282 R11: 0000000000000000 R12: ffff810001e0e640
R13: ffffffff805f2e00 R14: ffff81003fe00e00 R15: 0000000000000000
FS:  0000000042804940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b9e0dc00000 CR3: 000000001ce65000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4663, threadinfo ffff81001ca86000, task ffff81003f1cb520)
Stack:  ffffffff80227d0e 0000000000000003 0000000000000001 ffff81003fe00e00
 ffff810001fd8e58 ffffffff805f2e30 ffffffff80229c58 ffffffff805f2ec0
 0000000000000000 0000001100000011 ffffffff803cac69 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff80227d0e>] task_rq_lock+0x3d/0x6f
 [<ffffffff80229c58>] try_to_wake_up+0x24/0x362
 [<ffffffff803cac69>] azx_interrupt+0x76/0xc3
 [<ffffffff8024071d>] autoremove_wake_function+0x9/0x2e
 [<ffffffff802276e3>] __wake_up_common+0x3e/0x68
 [<ffffffff80227c7d>] __wake_up+0x38/0x4f
 [<ffffffff8023d621>] __queue_work+0x23/0x33
 [<ffffffff802f422b>] cursor_timer_handler+0x0/0x2c
 [<ffffffff8023d68d>] queue_work+0x37/0x40
 [<ffffffff802f423f>] cursor_timer_handler+0x14/0x2c
 [<ffffffff80236e2d>] run_timer_softirq+0x130/0x19f
 [<ffffffff80233d7e>] __do_softirq+0x55/0xc3
 [<ffffffff8020a6dc>] call_softirq+0x1c/0x28
 [<ffffffff8020c05b>] do_softirq+0x2c/0x7d
 [<ffffffff8020c162>] do_IRQ+0xb6/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff80210d79>] sched_clock+0x0/0x1a
 [<ffffffff80439a36>] __sched_text_start+0xce/0x76b
 [<ffffffff80240714>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff80273550>] sys_read+0x60/0x6e
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 f0 81 2f 00 00 00 01 74 05 e8 19 42 ea ff c3 
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: fuse e1000
Pid: 4628, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043bae1>]  [<ffffffff8043bae1>] _spin_lock_irq+0xb/0x10
RSP: 0018:ffff81001c861dd8  EFLAGS: 00000046
RAX: 000000000000000a RBX: 0000000000000000 RCX: 00000000000000fa
RDX: 0000000000000000 RSI: ffff81001cf64b20 RDI: ffff810001e0e640
RBP: ffff81001c861e80 R08: ffff8100373ff410 R09: ffff81001c861ea0
R10: 0000000000000008 R11: ffff81001c861bc8 R12: 0000000000604030
R13: 00000000408008e0 R14: ffff810001e0e640 R15: 0000000000021000
FS:  0000000040800940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b9e107a7000 CR3: 000000001ce65000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4628, threadinfo ffff81001c860000, task ffff81001cf64b20)
Stack:  ffffffff80439ab2 0000000000000011 0000000000040004 0000000000001214
 000000000000000a ffff81001cf64b20 0000000000001154 000000f385ef0145
 0000000000001565 0000000000000000 0000000000000000 ffff81001cf64b20
Call Trace:
 [<ffffffff80439ab2>] __sched_text_start+0x14a/0x76b
 [<ffffffff80240714>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff80273550>] sys_read+0x60/0x6e
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 53 48 89 fb e8 4b 84 df ff f0 ff 0b 79 09 f3 
NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: fuse e1000
Pid: 4805, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff80227ce5>]  [<ffffffff80227ce5>] task_rq_lock+0x14/0x6f
RSP: 0018:ffff81003b477ed8  EFLAGS: 00000046
RAX: 0000000000000246 RBX: ffff81000afbd6a0 RCX: ffff81003f420f28
RDX: 0000000000000001 RSI: ffff81003b477f10 RDI: ffff81000afbd6a0
RBP: ffff81003b477ef8 R08: 0000000000000003 R09: 0000000000000000
R10: 00007fff64f0c038 R11: 0000000000000246 R12: ffff81000afbd6a0
R13: ffff81003b477f10 R14: ffff81000afbd6a0 R15: 00007fff64f0d46c
FS:  00002b55460fa6f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b27b874c000 CR3: 0000000037140000 CR4: 00000000000006e0
Process strace (pid: 4805, threadinfo ffff81003b476000, task ffff810037d35860)
Stack:  ffff81000afbd6a0 ffff81000afbd6a0 00007fff64f0c038 0000000000000078
 ffff81003b477f28 ffffffff8022c252 ffff810037d35860 0000000000000246
 ffff81000afbd6a0 0000000000000000 0000000000000000 ffffffff802365e3
Call Trace:
 [<ffffffff8022c252>] wait_task_inactive+0x1a/0x5f
 [<ffffffff802365e3>] ptrace_check_attach+0xaf/0xb6
 [<ffffffff8023664c>] sys_ptrace+0x62/0xa2
 [<ffffffff8020954e>] system_call+0x7e/0x83


Code: 49 89 45 00 49 8b 46 08 48 c7 c3 40 a6 5e 80 49 89 dc 8b 40 
NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: fuse e1000
Pid: 4929, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff80227ce5>]  [<ffffffff80227ce5>] task_rq_lock+0x14/0x6f
RSP: 0018:ffff810008b11ed8  EFLAGS: 00000046
RAX: 0000000000000246 RBX: ffff810036b65620 RCX: ffff81003f613ca8
RDX: 0000000000000001 RSI: ffff810008b11f10 RDI: ffff810036b65620
RBP: ffff810008b11ef8 R08: 0000000000000003 R09: 0000000000000000
R10: 00007ffff9c16a78 R11: 0000000000000246 R12: ffff810036b65620
R13: ffff810008b11f10 R14: ffff810036b65620 R15: 00007ffff9c17eac
FS:  00002b87b13f16f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ac368d9c000 CR3: 0000000005a4a000 CR4: 00000000000006e0
Process strace (pid: 4929, threadinfo ffff810008b10000, task ffff810037d35860)
Stack:  ffff810036b65620 ffff810036b65620 00007ffff9c16a78 0000000000000078
 ffff810008b11f28 ffffffff8022c252 ffff810037d35860 0000000000000246
 ffff810036b65620 0000000000000000 0000000000000000 ffffffff802365e3
Call Trace:
 [<ffffffff8022c252>] wait_task_inactive+0x1a/0x5f
 [<ffffffff802365e3>] ptrace_check_attach+0xaf/0xb6
 [<ffffffff8023664c>] sys_ptrace+0x62/0xa2
 [<ffffffff8020954e>] system_call+0x7e/0x83


Code: 49 89 45 00 49 8b 46 08 48 c7 c3 40 a6 5e 80 49 89 dc 8b 40 
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: fuse e1000
Pid: 4936, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043ba34>]  [<ffffffff8043ba34>] _spin_lock+0xa/0xf
RSP: 0018:ffffffff805f2d20  EFLAGS: 00000046
RAX: ffffffff805a2000 RBX: ffffffff805ea640 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff805f2d88 RDI: ffff810001e0e640
RBP: ffffffff805f2d48 R08: ffff810016fc5bf0 R09: ffff81003f595fe0
R10: ffff81003f122200 R11: ffffffff803625b7 R12: ffff810001e0e640
R13: ffffffff805f2d88 R14: ffff81001cfab760 R15: 0000000000000000
FS:  0000000042003940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001cbc038 CR3: 0000000003cc0000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4936, threadinfo ffff8100120a6000, task ffff81001cf64480)
Stack:  ffffffff80227d0e 0000000000000003 0000000000000001 ffff81001cfab760
 ffff810034577f68 ffffffff805f2db8 ffffffff80229c58 0000000000000001
 0000000000000282 0000000000000003 ffffffff805f2db0 ffffffff80227c7d
Call Trace:
 <IRQ>  [<ffffffff80227d0e>] task_rq_lock+0x3d/0x6f
 [<ffffffff80229c58>] try_to_wake_up+0x24/0x362
 [<ffffffff80227c7d>] __wake_up+0x38/0x4f
 [<ffffffff80237103>] lock_timer_base+0x26/0x4b
 [<ffffffff802276e3>] __wake_up_common+0x3e/0x68
 [<ffffffff80227c7d>] __wake_up+0x38/0x4f
 [<ffffffff802d3e4d>] blk_run_queue+0x28/0x73
 [<ffffffff803b69e3>] snd_timer_user_tinterrupt+0x13c/0x147
 [<ffffffff803b5104>] snd_timer_interrupt+0x264/0x2d4
 [<ffffffff803bf0d7>] snd_pcm_period_elapsed+0x21a/0x283
 [<ffffffff803cac61>] azx_interrupt+0x6e/0xc3
 [<ffffffff80252710>] handle_IRQ_event+0x25/0x53
 [<ffffffff80233d7e>] __do_softirq+0x55/0xc3
 [<ffffffff80253bbf>] handle_fasteoi_irq+0x94/0xd1
 [<ffffffff8020a6dc>] call_softirq+0x1c/0x28
 [<ffffffff8020c118>] do_IRQ+0x6c/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff802385af>] ptrace_stop+0x73/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff80273550>] sys_read+0x60/0x6e
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 f0 81 2f 00 00 00 01 74 05 e8 19 42 ea ff c3 
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: fuse e1000
Pid: 4933, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043ba34>]  [<ffffffff8043ba34>] _spin_lock+0xa/0xf
RSP: 0018:ffffffff805f2bd0  EFLAGS: 00000046
RAX: ffffffff805a2000 RBX: ffffffff805ea640 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff805f2c38 RDI: ffff810001e0e640
RBP: ffffffff805f2bf8 R08: ffff81003fe2ff08 R09: ffffffff805ea640
R10: ffff810081824000 R11: ffffffff8021a502 R12: ffff810001e0e640
R13: ffffffff805f2c38 R14: ffff81003fe00e00 R15: 0000000000000000
FS:  0000000041001940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaab0190718 CR3: 0000000003cc0000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4933, threadinfo ffff81000ae20000, task ffff81001cf64b20)
Stack:  ffffffff80227d0e 0000000000000003 0000000000000001 ffff81003fe00e00
 ffff810001fd8e58 ffffffff805f2c68 ffffffff80229c58 0000000000000282
 000000003f580500 0000000000000286 ffff810017f5a240 ffff810017f5a29c
Call Trace:
 <IRQ>  [<ffffffff80227d0e>] task_rq_lock+0x3d/0x6f
 [<ffffffff80229c58>] try_to_wake_up+0x24/0x362
 [<ffffffff803618f9>] scsi_delete_timer+0xd/0x25
 [<ffffffff8024071d>] autoremove_wake_function+0x9/0x2e
 [<ffffffff802276e3>] __wake_up_common+0x3e/0x68
 [<ffffffff80227c7d>] __wake_up+0x38/0x4f
 [<ffffffff8023d621>] __queue_work+0x23/0x33
 [<ffffffff8023d68d>] queue_work+0x37/0x40
 [<ffffffff8039b980>] input_event+0x422/0x44a
 [<ffffffff803a01a5>] atkbd_interrupt+0x248/0x503
 [<ffffffff8020c162>] do_IRQ+0xb6/0xd4
 [<ffffffff8039882a>] serio_interrupt+0x37/0x6f
 [<ffffffff80399473>] i8042_interrupt+0x1f4/0x20a
 [<ffffffff80252710>] handle_IRQ_event+0x25/0x53
 [<ffffffff80253ce0>] handle_edge_irq+0xe4/0x128
 [<ffffffff8020c118>] do_IRQ+0x6c/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 [<ffffffff8021a502>] flat_send_IPI_mask+0x0/0x4c
 [<ffffffff80233d73>] __do_softirq+0x4a/0xc3
 [<ffffffff8020a6dc>] call_softirq+0x1c/0x28
 [<ffffffff8020c05b>] do_softirq+0x2c/0x7d
 [<ffffffff8020c162>] do_IRQ+0xb6/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff804399f1>] __sched_text_start+0x89/0x76b
 [<ffffffff80240714>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff80273550>] sys_read+0x60/0x6e
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 f0 81 2f 00 00 00 01 74 05 e8 19 42 ea ff c3 
NMI Watchdog detected LOCKUP on CPU 0
CPU 0 
Modules linked in: fuse e1000
Pid: 4936, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043bae1>]  [<ffffffff8043bae1>] _spin_lock_irq+0xb/0x10
RSP: 0018:ffff8100120a7dd8  EFLAGS: 00000046
RAX: 000000000000000a RBX: 0000000000000000 RCX: 00000000000000fa
RDX: 0000000000000000 RSI: ffff81001cf64480 RDI: ffff810001e0e640
RBP: ffff8100120a7e80 R08: ffff8100373ff410 R09: ffff8100120a7ea0
R10: 0000000000000005 R11: ffff8100120a7bc8 R12: 0000000000604030
R13: 00000000420038e0 R14: ffff810001e0e640 R15: 0000000000021000
FS:  0000000042003940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaab046d978 CR3: 0000000003cc0000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4936, threadinfo ffff8100120a6000, task ffff81001cf64480)
Stack:  ffffffff80439ab2 0000000000000011 0000000000040004 0000000000001348
 000000000000000a ffff81001cf64480 00000000000007d4 000002b1f9fb8559
 000000000000910f 0000000000000005 ffff8100120a7ea0 0000000000000000
Call Trace:
 [<ffffffff80439ab2>] __sched_text_start+0x14a/0x76b
 [<ffffffff802386ac>] ptrace_stop+0x170/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff80273550>] sys_read+0x60/0x6e
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 53 48 89 fb e8 4b 84 df ff f0 ff 0b 79 09 f3 
NMI Watchdog detected LOCKUP on CPU 0
NMI Watchdog detected LOCKUP on CPU 1
CPU 0 
Modules linked in: fuse e1000
Pid: 4937, comm: fusexmp_fh Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043ba34>]  [<ffffffff8043ba34>] _spin_lock+0xa/0xf
RSP: 0018:ffffffff805f2bd0  EFLAGS: 00000046
RAX: ffffffff805a2000 RBX: ffffffff805ea640 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff805f2c38 RDI: ffff810001e0e640
RBP: ffffffff805f2bf8 R08: ffff81003fe2ff08 R09: ffffffff805ea640
R10: ffff810081824000 R11: ffffffff8021a502 R12: ffff810001e0e640
R13: ffffffff805f2c38 R14: ffff81003fe00e00 R15: 0000000000000000
FS:  0000000042804940(0063) GS:ffffffff805a2000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ac36f0b2000 CR3: 0000000003cc0000 CR4: 00000000000006e0
Process fusexmp_fh (pid: 4937, threadinfo ffff8100255a8000, task ffff810036b65620)
Stack:  ffffffff80227d0e 0000000000000003 0000000000000001 ffff81003fe00e00
 ffff810001fd8e58 ffffffff805f2c68 ffffffff80229c58 ffffffff805f2c58
 0000000000000001 0000000000000086 0000000042803fc0 0000000000000092
Call Trace:
 <IRQ>  [<ffffffff80227d0e>] task_rq_lock+0x3d/0x6f
 [<ffffffff80229c58>] try_to_wake_up+0x24/0x362
 [<ffffffff803b69e3>] snd_timer_user_tinterrupt+0x13c/0x147
 [<ffffffff8024071d>] autoremove_wake_function+0x9/0x2e
 [<ffffffff802276e3>] __wake_up_common+0x3e/0x68
 [<ffffffff80227c7d>] __wake_up+0x38/0x4f
 [<ffffffff8023d621>] __queue_work+0x23/0x33
 [<ffffffff8023d68d>] queue_work+0x37/0x40
 [<ffffffff8039b980>] input_event+0x422/0x44a
 [<ffffffff803a01a5>] atkbd_interrupt+0x248/0x503
 [<ffffffff8020c162>] do_IRQ+0xb6/0xd4
 [<ffffffff8039882a>] serio_interrupt+0x37/0x6f
 [<ffffffff80399473>] i8042_interrupt+0x1f4/0x20a
 [<ffffffff80252710>] handle_IRQ_event+0x25/0x53
 [<ffffffff80253ce0>] handle_edge_irq+0xe4/0x128
 [<ffffffff8020c118>] do_IRQ+0x6c/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 [<ffffffff8021a502>] flat_send_IPI_mask+0x0/0x4c
 [<ffffffff80233d73>] __do_softirq+0x4a/0xc3
 [<ffffffff8020a6dc>] call_softirq+0x1c/0x28
 [<ffffffff8020c05b>] do_softirq+0x2c/0x7d
 [<ffffffff8020c162>] do_IRQ+0xb6/0xd4
 [<ffffffff80209a61>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff802385af>] ptrace_stop+0x73/0x172
 [<ffffffff802398cf>] ptrace_notify+0x71/0x96
 [<ffffffff8020c479>] syscall_trace+0x26/0x5d
 [<ffffffff8020974b>] int_very_careful+0x35/0x3e


Code: 7e f9 eb f2 c3 f0 81 2f 00 00 00 01 74 05 e8 19 42 ea ff c3 
CPU 1 
Modules linked in: fuse e1000
Pid: 4929, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[<ffffffff8043bb1b>]  [<ffffffff8043bb1b>] _spin_unlock_irqrestore+0x8/0x9
RSP: 0018:ffff810008b11f00  EFLAGS: 00000246
RAX: 0000000000000001 RBX: ffff810036b65620 RCX: ffff81003f613ca8
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff810001e0e640
RBP: ffff810008b11f28 R08: 0000000000000003 R09: 0000000000000000
R10: 00007ffff9c16a78 R11: 0000000000000246 R12: ffff810036b65620
R13: 00007ffff9c16a78 R14: 0000000000000078 R15: 00007ffff9c17eac
FS:  00002b87b13f16f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002adb2c3cf000 CR3: 0000000005a4a000 CR4: 00000000000006e0
Process strace (pid: 4929, threadinfo ffff810008b10000, task ffff810037d35860)
Stack:  ffffffff8022c27c ffff810037d35860 0000000000000246 ffff810036b65620
 0000000000000000 0000000000000000 ffffffff802365e3 ffff810036b65620
 ffff810036b65620 0000000000000003 ffffffff8023664c 00002b87b13f1690
Call Trace:
 [<ffffffff8022c27c>] wait_task_inactive+0x44/0x5f
 [<ffffffff802365e3>] ptrace_check_attach+0xaf/0xb6
 [<ffffffff8023664c>] sys_ptrace+0x62/0xa2
 [<ffffffff8020954e>] system_call+0x7e/0x83


Code: c3 c7 07 01 00 00 00 fb c3 f0 83 2f 01 79 05 e8 51 41 ea ff 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-14 16:04                 ` Miklos Szeredi
@ 2007-06-15 21:25                   ` Chuck Ebbert
  2007-06-16 10:37                   ` Ingo Molnar
  1 sibling, 0 replies; 88+ messages in thread
From: Chuck Ebbert @ 2007-06-15 21:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, chris, linux-kernel, tglx

On 06/14/2007 12:04 PM, Miklos Szeredi wrote:
> I've got some more info about this bug.  It is gathered with
> nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of
> calling die_nmi() just prints a line and calls show_registers().
> 
> This makes the machine actually survive the NMI tracing.  The attached
> traces are gathered over about an hour of stressing.  An mp3 player is
> also going on continually, and I can hear a couple of seconds of
> "looping" quite often, but it gets as far as the NMI trace only
> rarely.  AFAICS only the last pair shows a trace for both CPUs during
> the same "freeze".
> 
> I've put some effort into understanding what's going on, but I'm not
> familiar with how interrupts work and that sort of thing.
> 
> The pattern that emerges is that on CPU0 we have an interrupt, which
> is trying to acquire the rq lock, but can't.
> 
> On CPU1 we have strace which is doing wait_task_inactive(), which sort
> of spins acquiring and releasing the rq lock.  I've checked some of
> the traces and it is just before acquiring the rq lock, or just after
> releasing it, but is not actually holding it.
> 
> So is it possible that wait_task_inactive() could be starving the
> other waiters of the rq spinlock?  Any ideas?

Spinlocks aren't fair, so this kind of problem is always a possibility.
I think maybe we need another kind of unlock that gives another processor
a fair chance at the lock. Some things you could try to see if they help:

- add smp_mb() after the unlock
- replace cpu_relax() with usleep()
- use an xchcg instruction to do the unlock, like i386 does when
  CONFIG_X86_OOSTORE is set


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-14 16:04                 ` Miklos Szeredi
  2007-06-15 21:25                   ` Chuck Ebbert
@ 2007-06-16 10:37                   ` Ingo Molnar
  2007-06-17 21:46                     ` Miklos Szeredi
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-16 10:37 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: chris, linux-kernel, tglx, Linus Torvalds, Andrew Morton


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> I've got some more info about this bug.  It is gathered with 
> nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of 
> calling die_nmi() just prints a line and calls show_registers().

great!

> The pattern that emerges is that on CPU0 we have an interrupt, which 
> is trying to acquire the rq lock, but can't.
> 
> On CPU1 we have strace which is doing wait_task_inactive(), which sort 
> of spins acquiring and releasing the rq lock.  I've checked some of 
> the traces and it is just before acquiring the rq lock, or just after 
> releasing it, but is not actually holding it.
> 
> So is it possible that wait_task_inactive() could be starving the 
> other waiters of the rq spinlock?  Any ideas?

hm, this is really interesting, and indeed a smoking gun. The T60 has a 
Core2Duo and i've _never_ seen MESI starvation happen on dual-core 
single-socket CPUs! (The only known serious MESI starvation i know about 
is on multi-socket Opterons: there the trylock loop of spinlock 
debugging is known to starve some CPUs out of those locks that are being 
polled, so we had to turn off that aspect of spinlock debugging.)

wait_task_inactive(), although it busy-loops, is pretty robust: it does 
a proper spin-lock/spin-unlock sequence and has a cpu_relax() inbetween. 
Furthermore, the rep_nop() that cpu_relax() is based on is 
unconditional, so it's not like we could somehow end up not having the 
REP; NOP sequence there (which should make the lock polling even more 
fair)

could you try the quick hack below, ontop of cfs-v17? It adds two things 
to wait_task_inactive():

- a cond_resched() [in case you are running !PREEMPT]

- use MONITOR+MWAIT to monitor memory transactions to the rq->curr 
  cacheline. This should make the polling loop definitely fair.

If this solves the problem on your box then i'll do a proper fix and 
introduce a cpu_relax_memory_change(*addr) type of API to around 
monitor/mwait. This patch boots fine on my T60 - but i never saw your 
problem.

[ btw., utrace IIRC fixes ptrace to get rid of wait_task_interactive(). ]

	Ingo

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -834,6 +834,16 @@ repeat:
 		cpu_relax();
 		if (preempted)
 			yield();
+		else
+			cond_resched();
+		/*
+		 * Wait for "curr" to change:
+		 */
+		__monitor((void *)&rq->curr, 0, 0);
+		smp_mb();
+		if (rq->curr != p)
+			__mwait(0, 0);
+
 		goto repeat;
 	}
 	task_rq_unlock(rq, &flags);

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-16 10:37                   ` Ingo Molnar
@ 2007-06-17 21:46                     ` Miklos Szeredi
  2007-06-18  6:43                       ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-17 21:46 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm

Chuck, Ingo, thanks for the responses.

> > The pattern that emerges is that on CPU0 we have an interrupt, which 
> > is trying to acquire the rq lock, but can't.
> > 
> > On CPU1 we have strace which is doing wait_task_inactive(), which sort 
> > of spins acquiring and releasing the rq lock.  I've checked some of 
> > the traces and it is just before acquiring the rq lock, or just after 
> > releasing it, but is not actually holding it.
> > 
> > So is it possible that wait_task_inactive() could be starving the 
> > other waiters of the rq spinlock?  Any ideas?
> 
> hm, this is really interesting, and indeed a smoking gun. The T60 has a 
> Core2Duo and i've _never_ seen MESI starvation happen on dual-core 
> single-socket CPUs! (The only known serious MESI starvation i know about 
> is on multi-socket Opterons: there the trylock loop of spinlock 
> debugging is known to starve some CPUs out of those locks that are being 
> polled, so we had to turn off that aspect of spinlock debugging.)
> 
> wait_task_inactive(), although it busy-loops, is pretty robust: it does 
> a proper spin-lock/spin-unlock sequence and has a cpu_relax() inbetween. 
> Furthermore, the rep_nop() that cpu_relax() is based on is 
> unconditional, so it's not like we could somehow end up not having the 
> REP; NOP sequence there (which should make the lock polling even more 
> fair)
> 
> could you try the quick hack below, ontop of cfs-v17? It adds two things 
> to wait_task_inactive():
> 
> - a cond_resched() [in case you are running !PREEMPT]
> 
> - use MONITOR+MWAIT to monitor memory transactions to the rq->curr 
>   cacheline. This should make the polling loop definitely fair.

Is it not possible for the mwait to get stuck?

> If this solves the problem on your box then i'll do a proper fix and 
> introduce a cpu_relax_memory_change(*addr) type of API to around 
> monitor/mwait. This patch boots fine on my T60 - but i never saw your 
> problem.

Yes, the patch does make the pauses go away.  In fact just the
smp_mb() seems to suffice.

> [ btw., utrace IIRC fixes ptrace to get rid of wait_task_interactive(). ]

I looked at the utrace patch, and it still has wait_task_inactive(),
and I can still reproduce the freezes with the utrace patch applied.

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-17 21:46                     ` Miklos Szeredi
@ 2007-06-18  6:43                       ` Ingo Molnar
  2007-06-18  7:24                         ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  6:43 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > could you try the quick hack below, ontop of cfs-v17? It adds two 
> > things to wait_task_inactive():
> > 
> > - a cond_resched() [in case you are running !PREEMPT]
> > 
> > - use MONITOR+MWAIT to monitor memory transactions to the rq->curr
> >   cacheline. This should make the polling loop definitely fair.
> 
> Is it not possible for the mwait to get stuck?

it is - when the other CPU does nothing.

> > If this solves the problem on your box then i'll do a proper fix and 
> > introduce a cpu_relax_memory_change(*addr) type of API to around 
> > monitor/mwait. This patch boots fine on my T60 - but i never saw 
> > your problem.
> 
> Yes, the patch does make the pauses go away.  In fact just the 
> smp_mb() seems to suffice.

cool! Could you send me the smallest patch you tried that still made the 
hangs go away?

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  6:43                       ` Ingo Molnar
@ 2007-06-18  7:24                         ` Miklos Szeredi
  2007-06-18  8:12                           ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18  7:24 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm

> > > If this solves the problem on your box then i'll do a proper fix and 
> > > introduce a cpu_relax_memory_change(*addr) type of API to around 
> > > monitor/mwait. This patch boots fine on my T60 - but i never saw 
> > > your problem.
> > 
> > Yes, the patch does make the pauses go away.  In fact just the 
> > smp_mb() seems to suffice.
> 
> cool! Could you send me the smallest patch you tried that still made the 
> hangs go away?

My previous attempt was just commenting out parts of your patch.  But
maybe it's more logical to move the barrier to immediately after the
unlock.

With this patch I can't reproduce the problem, which may not mean very
much, since it was rather a "fragile" bug anyway.  But at least the
fix looks pretty harmless.

Thanks,
Miklos

Index: linux-2.6.22-rc4/kernel/sched.c
===================================================================
--- linux-2.6.22-rc4.orig/kernel/sched.c	2007-06-18 08:59:17.000000000 +0200
+++ linux-2.6.22-rc4/kernel/sched.c	2007-06-18 09:04:13.000000000 +0200
@@ -1168,6 +1168,11 @@ repeat:
 		/* If it's preempted, we yield.  It could be a while. */
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &flags);
+		/*
+		 * Without this barrier, wait_task_inactive() can starve
+		 * waiters of rq->lock (observed on Core2Duo)
+		 */
+		smp_mb();
 		cpu_relax();
 		if (preempted)
 			yield();

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  7:24                         ` Miklos Szeredi
@ 2007-06-18  8:12                           ` Ingo Molnar
  2007-06-18  8:20                             ` Andrew Morton
                                               ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  8:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> My previous attempt was just commenting out parts of your patch.  But 
> maybe it's more logical to move the barrier to immediately after the 
> unlock.
> 
> With this patch I can't reproduce the problem, which may not mean very 
> much, since it was rather a "fragile" bug anyway.  But at least the 
> fix looks pretty harmless.

>  		task_rq_unlock(rq, &flags);
> +		/*
> +		 * Without this barrier, wait_task_inactive() can starve
> +		 * waiters of rq->lock (observed on Core2Duo)
> +		 */
> +		smp_mb();
>  		cpu_relax();

yeah. The problem is, that the open-coded loop there is totally fine, 
and we have similar loops elsewhere, so this problem could hit us again, 
in an even harder to debug place! Since this affects our basic SMP 
primitives, quite some caution is warranted i think.

So ... to inquire about the scope of the problem, another possibility 
would be for the _spin loop_ being 'too nice', not wait_task_inactive() 
being too agressive!

To test this theory, could you try the patch below, does this fix your 
hangs too? This change causes the memory access of the "easy" spin-loop 
portion to be more agressive: after the REP; NOP we'd not do the 
'easy-loop' with a simple CMPB, but we'd re-attempt the atomic op. (in 
theory the non-LOCK-ed memory accesses should have a similar effect, but 
maybe the Core2Duo has some special optimization for non-LOCK-ed 
cacheline accesses that causes cacheline starvation?)

	Ingo

---------------------------------------------------->
Subject: [patch] x86: fix spin-loop starvation bug
From: Ingo Molnar <mingo@elte.hu>

Miklos Szeredi reported very long pauses (several seconds, sometimes 
more) on his T60 (with a Core2Duo) which he managed to track down to 
wait_task_inactive()'s open-coded busy-loop. He observed that an 
interrupt on one core tries to acquire the runqueue-lock but does not 
succeed in doing so for a very long time - while wait_task_inactive() on 
the other core loops waiting for the first core to deschedule a task 
(which it wont do while spinning in an interrupt handler).

The problem is: both the spin_lock() code and the wait_task_inactive() 
loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
guaranteed MESI-fairness to the two cores - but that didnt happen: one 
of the cores was able to monopolize the cacheline that holds the 
runqueue lock, for extended periods of time.

This patch changes the spin-loop to assert an atomic op after every REP 
NOP instance - this will cause the CPU to express its "MESI interest" in 
that cacheline after every REP NOP.

Reported-and-debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock.h    |   16 ++++------------
 include/asm-x86_64/processor.h |    8 ++++++--
 include/asm-x86_64/spinlock.h  |   15 +++------------
 3 files changed, 13 insertions(+), 26 deletions(-)

Index: linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-i386/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
@@ -37,10 +37,7 @@ static inline void __raw_spin_lock(raw_s
 	asm volatile("\n1:\t"
 		     LOCK_PREFIX " ; decb %0\n\t"
 		     "jns 3f\n"
-		     "2:\t"
-		     "rep;nop\n\t"
-		     "cmpb $0,%0\n\t"
-		     "jle 2b\n\t"
+		     "rep; nop\n\t"
 		     "jmp 1b\n"
 		     "3:\n\t"
 		     : "+m" (lock->slock) : : "memory");
@@ -65,21 +62,16 @@ static inline void __raw_spin_lock_flags
 		"testl $0x200, %[flags]\n\t"
 		"jz 4f\n\t"
 		STI_STRING "\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		CLI_STRING "\n\t"
 		"jmp 1b\n"
 		"4:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
 		: [slock] "+m" (lock->slock)
 		: [flags] "r" (flags)
-	 	  CLI_STI_INPUT_ARGS
+		  CLI_STI_INPUT_ARGS
 		: "memory" CLI_STI_CLOBBERS);
 }
 #endif
Index: linux-cfs-2.6.22-rc5.q/include/asm-x86_64/processor.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-x86_64/processor.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-x86_64/processor.h
@@ -358,7 +358,7 @@ struct extended_sigtable {
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
 static inline void rep_nop(void)
 {
-	__asm__ __volatile__("rep;nop": : :"memory");
+	__asm__ __volatile__("rep; nop" : : : "memory");
 }
 
 /* Stop speculative execution */
@@ -389,7 +389,11 @@ static inline void prefetchw(void *x) 
 
 #define spin_lock_prefetch(x)  prefetchw(x)
 
-#define cpu_relax()   rep_nop()
+static inline void cpu_relax(void)
+{
+	smp_mb(); /* Core2Duo needs this to not starve other cores */
+	rep_nop();
+}
 
 /*
  *      NSC/Cyrix CPU indexed register access macros
Index: linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-x86_64/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
@@ -28,10 +28,7 @@ static inline void __raw_spin_lock(raw_s
 		"\n1:\t"
 		LOCK_PREFIX " ; decl %0\n\t"
 		"jns 2f\n"
-		"3:\n"
-		"rep;nop\n\t"
-		"cmpl $0,%0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"jmp 1b\n"
 		"2:\t" : "=m" (lock->slock) : : "memory");
 }
@@ -49,16 +46,10 @@ static inline void __raw_spin_lock_flags
 		"testl $0x200, %1\n\t"	/* interrupts were disabled? */
 		"jz 4f\n\t"
 	        "sti\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"cli\n\t"
 		"jmp 1b\n"
-		"4:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
 		: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:12                           ` Ingo Molnar
@ 2007-06-18  8:20                             ` Andrew Morton
  2007-06-19  4:22                               ` Ravikiran G Thirumalai
  2007-06-18  8:25                             ` Miklos Szeredi
  2007-06-18 16:34                             ` Linus Torvalds
  2 siblings, 1 reply; 88+ messages in thread
From: Andrew Morton @ 2007-06-18  8:20 UTC (permalink / raw)
  To: Ingo Molnar, Ravikiran G Thirumalai
  Cc: Miklos Szeredi, cebbert, chris, linux-kernel, tglx, torvalds

On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> ---------------------------------------------------->
> Subject: [patch] x86: fix spin-loop starvation bug
> From: Ingo Molnar <mingo@elte.hu>
> 
> Miklos Szeredi reported very long pauses (several seconds, sometimes 
> more) on his T60 (with a Core2Duo) which he managed to track down to 
> wait_task_inactive()'s open-coded busy-loop. He observed that an 
> interrupt on one core tries to acquire the runqueue-lock but does not 
> succeed in doing so for a very long time - while wait_task_inactive() on 
> the other core loops waiting for the first core to deschedule a task 
> (which it wont do while spinning in an interrupt handler).
> 
> The problem is: both the spin_lock() code and the wait_task_inactive() 
> loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
> guaranteed MESI-fairness to the two cores - but that didnt happen: one 
> of the cores was able to monopolize the cacheline that holds the 
> runqueue lock, for extended periods of time.
> 
> This patch changes the spin-loop to assert an atomic op after every REP 
> NOP instance - this will cause the CPU to express its "MESI interest" in 
> that cacheline after every REP NOP.

Kiran, if you're still able to reproduce that zone->lru_lock starvation problem,
this would be a good one to try...

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:12                           ` Ingo Molnar
  2007-06-18  8:20                             ` Andrew Morton
@ 2007-06-18  8:25                             ` Miklos Szeredi
  2007-06-18  8:31                               ` Ingo Molnar
  2007-06-18 16:34                             ` Linus Torvalds
  2 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18  8:25 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm

> To test this theory, could you try the patch below, does this fix your 
> hangs too?

Not tried yet, but obviously it does, since it's a superset of the
previous fix.  I could try without the smb_mb(), but see below.

> This change causes the memory access of the "easy" spin-loop portion
> to be more agressive: after the REP; NOP we'd not do the 'easy-loop'
> with a simple CMPB, but we'd re-attempt the atomic op.

It looks as if this is going to overflow of the lock counter, no?

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:25                             ` Miklos Szeredi
@ 2007-06-18  8:31                               ` Ingo Molnar
  2007-06-18  8:34                                 ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  8:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm,
	Ravikiran G Thirumalai


(Ravikiran Cc:-ed too)

* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > To test this theory, could you try the patch below, does this fix 
> > your hangs too?
> 
> Not tried yet, but obviously it does, since it's a superset of the 
> previous fix.  I could try without the smb_mb(), but see below.

oops - the 64-bit processor.h bits were included by accident - updated 
patch below.

> > This change causes the memory access of the "easy" spin-loop portion 
> > to be more agressive: after the REP; NOP we'd not do the 'easy-loop' 
> > with a simple CMPB, but we'd re-attempt the atomic op.
> 
> It looks as if this is going to overflow of the lock counter, no?

hm, what do you mean? There's no lock counter.

	Ingo

-------------------------->
Subject: [patch] x86: fix spin-loop starvation bug
From: Ingo Molnar <mingo@elte.hu>

Miklos Szeredi reported very long pauses (several seconds, sometimes 
more) on his T60 (with a Core2Duo) which he managed to track down to 
wait_task_inactive()'s open-coded busy-loop. He observed that an 
interrupt on one core tries to acquire the runqueue-lock but does not 
succeed in doing so for a very long time - while wait_task_inactive() on 
the other core loops waiting for the first core to deschedule a task 
(which it wont do while spinning in an interrupt handler).

The problem is: both the spin_lock() code and the wait_task_inactive() 
loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
guaranteed MESI-fairness to the two cores - but that didnt happen: one 
of the cores was able to monopolize the cacheline that holds the 
runqueue lock, for extended periods of time.

This patch changes the spin-loop to assert an atomic op after every REP 
NOP instance - this will cause the CPU to express its "MESI interest" in 
that cacheline after every REP NOP.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock.h   |   16 ++++------------
 include/asm-x86_64/spinlock.h |   15 +++------------
 2 files changed, 7 insertions(+), 24 deletions(-)

Index: linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-i386/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
@@ -37,10 +37,7 @@ static inline void __raw_spin_lock(raw_s
 	asm volatile("\n1:\t"
 		     LOCK_PREFIX " ; decb %0\n\t"
 		     "jns 3f\n"
-		     "2:\t"
-		     "rep;nop\n\t"
-		     "cmpb $0,%0\n\t"
-		     "jle 2b\n\t"
+		     "rep; nop\n\t"
 		     "jmp 1b\n"
 		     "3:\n\t"
 		     : "+m" (lock->slock) : : "memory");
@@ -65,21 +62,16 @@ static inline void __raw_spin_lock_flags
 		"testl $0x200, %[flags]\n\t"
 		"jz 4f\n\t"
 		STI_STRING "\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		CLI_STRING "\n\t"
 		"jmp 1b\n"
 		"4:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
 		: [slock] "+m" (lock->slock)
 		: [flags] "r" (flags)
-	 	  CLI_STI_INPUT_ARGS
+		  CLI_STI_INPUT_ARGS
 		: "memory" CLI_STI_CLOBBERS);
 }
 #endif
Index: linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-x86_64/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
@@ -28,10 +28,7 @@ static inline void __raw_spin_lock(raw_s
 		"\n1:\t"
 		LOCK_PREFIX " ; decl %0\n\t"
 		"jns 2f\n"
-		"3:\n"
-		"rep;nop\n\t"
-		"cmpl $0,%0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"jmp 1b\n"
 		"2:\t" : "=m" (lock->slock) : : "memory");
 }
@@ -49,16 +46,10 @@ static inline void __raw_spin_lock_flags
 		"testl $0x200, %1\n\t"	/* interrupts were disabled? */
 		"jz 4f\n\t"
 	        "sti\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"cli\n\t"
 		"jmp 1b\n"
-		"4:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
 		: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:31                               ` Ingo Molnar
@ 2007-06-18  8:34                                 ` Miklos Szeredi
  2007-06-18  9:18                                   ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18  8:34 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran

> > > This change causes the memory access of the "easy" spin-loop portion 
> > > to be more agressive: after the REP; NOP we'd not do the 'easy-loop' 
> > > with a simple CMPB, but we'd re-attempt the atomic op.
> > 
> > It looks as if this is going to overflow of the lock counter, no?
> 
> hm, what do you mean? There's no lock counter.

I mean, the repeated calls to decb will pretty soon make lock->slock
wrap around.

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:34                                 ` Miklos Szeredi
@ 2007-06-18  9:18                                   ` Ingo Molnar
  2007-06-18  9:38                                     ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  9:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > This change causes the memory access of the "easy" spin-loop portion 
> > > > to be more agressive: after the REP; NOP we'd not do the 'easy-loop' 
> > > > with a simple CMPB, but we'd re-attempt the atomic op.
> > > 
> > > It looks as if this is going to overflow of the lock counter, no?
> > 
> > hm, what do you mean? There's no lock counter.
> 
> I mean, the repeated calls to decb will pretty soon make lock->slock 
> wrap around.

ugh, indeed, bad thinko on my part. I'll rework this.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  9:18                                   ` Ingo Molnar
@ 2007-06-18  9:38                                     ` Ingo Molnar
  2007-06-18  9:44                                       ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  9:38 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran


* Ingo Molnar <mingo@elte.hu> wrote:

> > > > > This change causes the memory access of the "easy" spin-loop 
> > > > > portion to be more agressive: after the REP; NOP we'd not do 
> > > > > the 'easy-loop' with a simple CMPB, but we'd re-attempt the 
> > > > > atomic op.
> > > > 
> > > > It looks as if this is going to overflow of the lock counter, 
> > > > no?
> > > 
> > > hm, what do you mean? There's no lock counter.
> > 
> > I mean, the repeated calls to decb will pretty soon make lock->slock 
> > wrap around.
> 
> ugh, indeed, bad thinko on my part. I'll rework this.

how about the patch below? Boot-tested on 32-bit. As a side-effect this 
change also removes the 255 CPUs limit from the 32-bit kernel.

	Ingo

------------------------->
Subject: [patch] x86: fix spin-loop starvation bug
From: Ingo Molnar <mingo@elte.hu>

Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop. He observed that an
interrupt on one core tries to acquire the runqueue-lock but does not
succeed in doing so for a very long time - while wait_task_inactive() on
the other core loops waiting for the first core to deschedule a task
(which it wont do while spinning in an interrupt handler).

The problem is: both the spin_lock() code and the wait_task_inactive()
loop uses cpu_relax()/rep_nop(), so in theory the CPU should have
guaranteed MESI-fairness to the two cores - but that didnt happen: one
of the cores was able to monopolize the cacheline that holds the
runqueue lock, for extended periods of time.

This patch changes the spin-loop to assert an atomic op after every REP
NOP instance - this will cause the CPU to express its "MESI interest" in
that cacheline after every REP NOP.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock.h   |   27 ++++++++++-----------------
 include/asm-x86_64/spinlock.h |   33 ++++++++++++++++-----------------
 2 files changed, 26 insertions(+), 34 deletions(-)

Index: linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-i386/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
@@ -35,15 +35,12 @@ static inline int __raw_spin_is_locked(r
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
 	asm volatile("\n1:\t"
-		     LOCK_PREFIX " ; decb %0\n\t"
-		     "jns 3f\n"
-		     "2:\t"
-		     "rep;nop\n\t"
-		     "cmpb $0,%0\n\t"
-		     "jle 2b\n\t"
+		     LOCK_PREFIX " ; btrl %[zero], %[slock]\n\t"
+		     "jc 3f\n"
+		     "rep; nop\n\t"
 		     "jmp 1b\n"
 		     "3:\n\t"
-		     : "+m" (lock->slock) : : "memory");
+		     : [slock] "+m" (lock->slock) : [zero] "Ir" (0) : "memory");
 }
 
 /*
@@ -59,27 +56,23 @@ static inline void __raw_spin_lock_flags
 {
 	asm volatile(
 		"\n1:\t"
-		LOCK_PREFIX " ; decb %[slock]\n\t"
+		LOCK_PREFIX " ; btrl %[zero], %[slock]\n\t"
 		"jns 5f\n"
 		"2:\t"
 		"testl $0x200, %[flags]\n\t"
 		"jz 4f\n\t"
 		STI_STRING "\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		CLI_STRING "\n\t"
 		"jmp 1b\n"
 		"4:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
 		: [slock] "+m" (lock->slock)
-		: [flags] "r" (flags)
-	 	  CLI_STI_INPUT_ARGS
+		: [zero] "Ir" (0),
+		  [flags] "r" (flags)
+		  CLI_STI_INPUT_ARGS
 		: "memory" CLI_STI_CLOBBERS);
 }
 #endif
Index: linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-x86_64/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
@@ -26,14 +26,15 @@ static inline void __raw_spin_lock(raw_s
 {
 	asm volatile(
 		"\n1:\t"
-		LOCK_PREFIX " ; decl %0\n\t"
+		LOCK_PREFIX " ; btrl %[zero], %[slock]\n\t"
 		"jns 2f\n"
-		"3:\n"
-		"rep;nop\n\t"
-		"cmpl $0,%0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"jmp 1b\n"
-		"2:\t" : "=m" (lock->slock) : : "memory");
+		"2:\t"
+		: [slock] "+m" (lock->slock)
+		: [zero] "Ir" (0)
+		: "memory"
+	);
 }
 
 /*
@@ -44,24 +45,22 @@ static inline void __raw_spin_lock_flags
 {
 	asm volatile(
 		"\n1:\t"
-		LOCK_PREFIX " ; decl %0\n\t"
+		LOCK_PREFIX " ; btrl %[zero], %[slock]\n\t"
 		"jns 5f\n"
-		"testl $0x200, %1\n\t"	/* interrupts were disabled? */
+		"testl $0x200, %[flags]\n\t"	/* were interrupts disabled? */
 		"jz 4f\n\t"
 	        "sti\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jle 3b\n\t"
+		"rep; nop\n\t"
 		"cli\n\t"
 		"jmp 1b\n"
-		"4:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jg 1b\n\t"
+		"rep; nop\n\t"
 		"jmp 4b\n"
 		"5:\n\t"
-		: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+		: [slock] "+m" (lock->slock)
+		: [zero] "Ir" (0),
+		  [flags] "r" ((unsigned)flags)
+		: "memory"
+	);
 }
 #endif
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  9:38                                     ` Ingo Molnar
@ 2007-06-18  9:44                                       ` Ingo Molnar
  2007-06-18 10:18                                         ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18  9:44 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran


* Ingo Molnar <mingo@elte.hu> wrote:

> how about the patch below? Boot-tested on 32-bit. As a side-effect 
> this change also removes the 255 CPUs limit from the 32-bit kernel.

boot-tested on 64-bit too now.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  9:44                                       ` Ingo Molnar
@ 2007-06-18 10:18                                         ` Miklos Szeredi
  2007-06-18 12:36                                           ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18 10:18 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > how about the patch below? Boot-tested on 32-bit. As a side-effect 
> > this change also removes the 255 CPUs limit from the 32-bit kernel.
> 
> boot-tested on 64-bit too now.

Strange, I can't even get past the compile stage ;)

  CC      kernel/spinlock.o
{standard input}: Assembler messages:
{standard input}:207: Error: backward ref to unknown label "4:"

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 10:18                                         ` Miklos Szeredi
@ 2007-06-18 12:36                                           ` Ingo Molnar
  2007-06-18 13:10                                             ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18 12:36 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran


* Miklos Szeredi <miklos@szeredi.hu> wrote:

> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > how about the patch below? Boot-tested on 32-bit. As a side-effect 
> > > this change also removes the 255 CPUs limit from the 32-bit kernel.
> > 
> > boot-tested on 64-bit too now.
> 
> Strange, I can't even get past the compile stage ;)
> 
>   CC      kernel/spinlock.o
> {standard input}: Assembler messages:
> {standard input}:207: Error: backward ref to unknown label "4:"

oh, sorry - i built it with !PREEMPT, which doesnt make use of the flags 
thing. I fixed the build and have cleaned up and simplified that code 
some more - does the patch below work for you? (it does for me on both 
32-bit and 64-bit)

	Ingo

----------------------------->
Subject: [patch] x86: fix spin-loop starvation bug
From: Ingo Molnar <mingo@elte.hu>

Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop. He observed that an
interrupt on one core tries to acquire the runqueue-lock but does not
succeed in doing so for a very long time - while wait_task_inactive() on
the other core loops waiting for the first core to deschedule a task
(which it wont do while spinning in an interrupt handler).

The problem is: both the spin_lock() code and the wait_task_inactive()
loop uses cpu_relax()/rep_nop(), so in theory the CPU should have
guaranteed MESI-fairness to the two cores - but that didnt happen: one
of the cores was able to monopolize the cacheline that holds the
runqueue lock, for extended periods of time.

This patch changes the spin-loop to assert an atomic op after every REP
NOP instance - this will cause the CPU to express its "MESI interest" in
that cacheline after every REP NOP.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock.h   |   94 ++++++++++++++++--------------------------
 include/asm-x86_64/spinlock.h |   69 +++++++++++++++---------------
 2 files changed, 72 insertions(+), 91 deletions(-)

Index: linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-i386/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-i386/spinlock.h
@@ -29,21 +29,24 @@
 
 static inline int __raw_spin_is_locked(raw_spinlock_t *x)
 {
-	return *(volatile signed char *)(&(x)->slock) <= 0;
+	return *(volatile unsigned int *)(&(x)->slock) == 0;
 }
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-	asm volatile("\n1:\t"
-		     LOCK_PREFIX " ; decb %0\n\t"
-		     "jns 3f\n"
-		     "2:\t"
-		     "rep;nop\n\t"
-		     "cmpb $0,%0\n\t"
-		     "jle 2b\n\t"
-		     "jmp 1b\n"
-		     "3:\n\t"
-		     : "+m" (lock->slock) : : "memory");
+	asm volatile(
+	"1:							\n"
+		/* copy bit 0 to carry-flag and clear bit 0 */
+		LOCK_PREFIX " ; btrl $0, %[slock]		\n"
+	"	jc 2f						\n"
+		/* PAUSE */
+	"	rep; nop					\n"
+	"	jmp 1b						\n"
+	"2:							\n"
+		: [slock] "+m" (lock->slock)
+		:
+		: "memory", "cc"
+	);
 }
 
 /*
@@ -58,69 +61,46 @@ static inline void __raw_spin_lock(raw_s
 static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
 {
 	asm volatile(
-		"\n1:\t"
-		LOCK_PREFIX " ; decb %[slock]\n\t"
-		"jns 5f\n"
-		"2:\t"
-		"testl $0x200, %[flags]\n\t"
-		"jz 4f\n\t"
-		STI_STRING "\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jle 3b\n\t"
-		CLI_STRING "\n\t"
-		"jmp 1b\n"
-		"4:\t"
-		"rep;nop\n\t"
-		"cmpb $0, %[slock]\n\t"
-		"jg 1b\n\t"
-		"jmp 4b\n"
-		"5:\n\t"
+	"1:							 \n"
+		/* copy bit 0 to carry-flag and clear bit 0 */
+		LOCK_PREFIX " ; btrl $0, %[slock]		 \n"
+	"	jc 3f						 \n"
+		/* were interrupts disabled? */
+	"	testl $0x200, %[flags]				 \n"
+	"	jz 2f						 \n"
+		STI_STRING					"\n"
+		/* PAUSE */
+	"2:	rep; nop					 \n"
+		CLI_STRING					"\n"
+	"	jmp 1b						 \n"
+	"3:							 \n"
 		: [slock] "+m" (lock->slock)
-		: [flags] "r" (flags)
-	 	  CLI_STI_INPUT_ARGS
-		: "memory" CLI_STI_CLOBBERS);
+		: [flags] "r"  (flags)
+		  CLI_STI_INPUT_ARGS
+		: "memory", "cc" CLI_STI_CLOBBERS);
 }
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
 {
-	char oldval;
+	unsigned int oldval;
+
 	asm volatile(
-		"xchgb %b0,%1"
+		"xchgl %0, %1"
 		:"=q" (oldval), "+m" (lock->slock)
 		:"0" (0) : "memory");
-	return oldval > 0;
+
+	return oldval != 0;
 }
 
 /*
- * __raw_spin_unlock based on writing $1 to the low byte.
- * This method works. Despite all the confusion.
- * (except on PPro SMP or if we are using OOSTORE, so we use xchgb there)
- * (PPro errata 66, 92)
+ * __raw_spin_unlock based on writing $1 to the lock.
  */
-
-#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
-
-static inline void __raw_spin_unlock(raw_spinlock_t *lock)
-{
-	asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory");
-}
-
-#else
-
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
-	char oldval = 1;
-
-	asm volatile("xchgb %b0, %1"
-		     : "=q" (oldval), "+m" (lock->slock)
-		     : "0" (oldval) : "memory");
+	asm volatile("movl $1, %0" : "+m" (lock->slock) : : "memory");
 }
 
-#endif
-
 static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
 {
 	while (__raw_spin_is_locked(lock))
Index: linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
===================================================================
--- linux-cfs-2.6.22-rc5.q.orig/include/asm-x86_64/spinlock.h
+++ linux-cfs-2.6.22-rc5.q/include/asm-x86_64/spinlock.h
@@ -19,21 +19,24 @@
 
 static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
 {
-	return *(volatile signed int *)(&(lock)->slock) <= 0;
+	return *(volatile unsigned int *)(&(lock)->slock) == 0;
 }
 
 static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
 	asm volatile(
-		"\n1:\t"
-		LOCK_PREFIX " ; decl %0\n\t"
-		"jns 2f\n"
-		"3:\n"
-		"rep;nop\n\t"
-		"cmpl $0,%0\n\t"
-		"jle 3b\n\t"
-		"jmp 1b\n"
-		"2:\t" : "=m" (lock->slock) : : "memory");
+	"1:							\n"
+		/* copy bit 0 to carry-flag and clear bit 0 */
+		LOCK_PREFIX " ; btrl $0, %[slock]		\n"
+	"	jc 2f						\n"
+		/* PAUSE */
+	"	rep; nop					\n"
+	"	jmp 1b						\n"
+	"2:							\n"
+		: [slock] "+m" (lock->slock)
+		:
+		: "memory", "cc"
+	);
 }
 
 /*
@@ -43,43 +46,41 @@ static inline void __raw_spin_lock(raw_s
 static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags)
 {
 	asm volatile(
-		"\n1:\t"
-		LOCK_PREFIX " ; decl %0\n\t"
-		"jns 5f\n"
-		"testl $0x200, %1\n\t"	/* interrupts were disabled? */
-		"jz 4f\n\t"
-	        "sti\n"
-		"3:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jle 3b\n\t"
-		"cli\n\t"
-		"jmp 1b\n"
-		"4:\t"
-		"rep;nop\n\t"
-		"cmpl $0, %0\n\t"
-		"jg 1b\n\t"
-		"jmp 4b\n"
-		"5:\n\t"
-		: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+	"1:							\n"
+		/* copy bit 0 to carry-flag and clear bit 0 */
+		LOCK_PREFIX " ; btrl $0, %[slock]		\n"
+	"	jc 3f						\n"
+		/* were interrupts disabled? */
+	"	testl $0x200, %[flags]				\n"
+	"	jz 2f						\n"
+	"	sti						\n"
+		/* PAUSE */
+	"2:	rep; nop					\n"
+	"	cli						\n"
+	"	jmp 1b						\n"
+	"3:							\n"
+		: [slock] "+m" (lock->slock)
+		: [flags] "r"  ((unsigned int)flags)
+		: "memory", "cc"
+	);
 }
 #endif
 
 static inline int __raw_spin_trylock(raw_spinlock_t *lock)
 {
-	int oldval;
+	unsigned int oldval;
 
 	asm volatile(
-		"xchgl %0,%1"
-		:"=q" (oldval), "=m" (lock->slock)
+		"xchgl %0, %1"
+		:"=q" (oldval), "+m" (lock->slock)
 		:"0" (0) : "memory");
 
-	return oldval > 0;
+	return oldval != 0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
-	asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
+	asm volatile("movl $1, %0" : "+m" (lock->slock) : : "memory");
 }
 
 static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 12:36                                           ` Ingo Molnar
@ 2007-06-18 13:10                                             ` Miklos Szeredi
  0 siblings, 0 replies; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18 13:10 UTC (permalink / raw)
  To: mingo; +Cc: cebbert, chris, linux-kernel, tglx, torvalds, akpm, kiran

> > > * Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > > how about the patch below? Boot-tested on 32-bit. As a side-effect 
> > > > this change also removes the 255 CPUs limit from the 32-bit kernel.
> > > 
> > > boot-tested on 64-bit too now.
> > 
> > Strange, I can't even get past the compile stage ;)
> > 
> >   CC      kernel/spinlock.o
> > {standard input}: Assembler messages:
> > {standard input}:207: Error: backward ref to unknown label "4:"
> 
> oh, sorry - i built it with !PREEMPT, which doesnt make use of the flags 
> thing. I fixed the build and have cleaned up and simplified that code 
> some more - does the patch below work for you? (it does for me on both 
> 32-bit and 64-bit)

Thanks.  The patch boots, and...  doesn't solve the bug.  Weird.

CPU bug?  I've upgraded the BIOS not such a long time ago.

I guess now we know what the problem is, it would be pretty easy to
create some test code, that uses two threads, one of which loops in
lock/unlock/rep_nop and the other that tries to acquire the lock and
measures latency.

Should I try to do that?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:12                           ` Ingo Molnar
  2007-06-18  8:20                             ` Andrew Morton
  2007-06-18  8:25                             ` Miklos Szeredi
@ 2007-06-18 16:34                             ` Linus Torvalds
  2007-06-18 17:41                               ` Miklos Szeredi
                                                 ` (2 more replies)
  2 siblings, 3 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-18 16:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm



On Mon, 18 Jun 2007, Ingo Molnar wrote:
> 
> To test this theory, could you try the patch below, does this fix your 
> hangs too?

I really think this the the wrong approach, although *testing* it makes 
sense.

I think we need to handle loops that take, release, and then immediately 
re-take differently.

Such loops are _usually_ of the form where they really just release the 
lock for some latency reason, but in this case I think it's actually just 
a bug.

That code does:

	if (unlikely(p->array || task_running(rq, p))) {

to decide if it needs to just unlock and repeat, but then to decide if it 
need to *yield* it only uses *one* of those tests (namely 

	preempted = !task_running(rq, p);
	..
	if (preempted)
		yield();

and I think that's just broken. It basically says:

 - if the task is running, I will busy-loop on getting/releasing the 
   task_rq_lock

and that is the _real_ bug here.

Trying to make the spinlocks do somethign else than what they do is just 
papering over the real bug. The real bug is that anybody who just 
busy-loops getting a lock is wasting resources so much that we should not 
be at all surprised that some multi-core or NUMA situations will get 
starvation.

Blaming some random Core 2 hardware implementation issue that just makes 
it show up is wrong. It's a software bug, plain and simple.

So how about this diff? The diff looks big, but the *code* is actually 
simpler and shorter, I just added tons of comments, which is what blows it 
up.

The new *code* looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
	        rq = task_rq(p);
	        while (task_running(rq, p))
	                cpu_relax();

		/* Get the *real* values */
	        rq = task_rq_lock(p, &flags);
	        running = task_running(rq, p);
	        array = p->array;
	        task_rq_unlock(rq, &flags);

		/* Check them.. */
	        if (unlikely(running)) {
	                cpu_relax();
	                goto repeat;
	        }

	        if (unlikely(array)) {
	                yield();
	                goto repeat;
	        }

and while I haven't tested it, it looks fairly obviously correct, even if 
it's a bit subtle.

Basically, that first "while()" loop is done entirely without any locking 
at all, and so it's possibly "incorrect", but we don't care. Both the 
runqueue used, and the "task_running()" check might be the wrong tests, 
but they won't oops - they just mean that we might get the wrong results 
due to lack of locking.

So then we get the proper (and careful) rq lock, and check the 
running/runnable state _safely_. And if it turns out that our 
quick-and-dirty and unsafe loop was wrong after all, we just go back and 
try it all again.

Safe, simple, efficient. And we don't ever hold the lock for long times, 
and we will never *loop* by taking, releasing and re-taking the lock. 

Hmm? Untested, I know. Maybe I overlooked something. But even the 
generated assembly code looks fine (much better than it looked before!)

		Linus

----
 kernel/sched.c |   69 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 13cdab3..66445e1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1159,21 +1159,72 @@ void wait_task_inactive(struct task_struct *p)
 {
 	unsigned long flags;
 	struct rq *rq;
-	int preempted;
+	struct prio_array *array;
+	int running;
 
 repeat:
+	/*
+	 * We do the initial early heuristics without holding
+	 * any task-queue locks at all. We'll only try to get
+	 * the runqueue lock when things look like they will
+	 * work out!
+	 */
+	rq = task_rq(p);
+
+	/*
+	 * If the task is actively running on another CPU
+	 * still, just relax and busy-wait without holding
+	 * any locks.
+	 *
+	 * NOTE! Since we don't hold any locks, it's not
+	 * even sure that "rq" stays as the right runqueue!
+	 * But we don't care, since "task_running()" will
+	 * return false if the runqueue has changed and p
+	 * is actually now running somewhere else!
+	 */
+	while (task_running(rq, p))
+		cpu_relax();
+
+	/*
+	 * Ok, time to look more closely! We need the rq
+	 * lock now, to be *sure*. If we're wrong, we'll
+	 * just go back and repeat.
+	 */
 	rq = task_rq_lock(p, &flags);
-	/* Must be off runqueue entirely, not preempted. */
-	if (unlikely(p->array || task_running(rq, p))) {
-		/* If it's preempted, we yield.  It could be a while. */
-		preempted = !task_running(rq, p);
-		task_rq_unlock(rq, &flags);
+	running = task_running(rq, p);
+	array = p->array;
+	task_rq_unlock(rq, &flags);
+
+	/*
+	 * Was it really running after all now that we
+	 * checked with the proper locks actually held?
+	 *
+	 * Oops. Go back and try again..
+	 */
+	if (unlikely(running)) {
 		cpu_relax();
-		if (preempted)
-			yield();
 		goto repeat;
 	}
-	task_rq_unlock(rq, &flags);
+
+	/*
+	 * It's not enough that it's not actively running,
+	 * it must be off the runqueue _entirely_, and not
+	 * preempted!
+	 *
+	 * So if it wa still runnable (but just not actively
+	 * running right now), it's preempted, and we should
+	 * yield - it could be a while.
+	 */
+	if (unlikely(array)) {
+		yield();
+		goto repeat;
+	}
+
+	/*
+	 * Ahh, all good. It wasn't running, and it wasn't
+	 * runnable, which means that it will never become
+	 * running in the future either. We're all done!
+	 */
 }
 
 /***

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 16:34                             ` Linus Torvalds
@ 2007-06-18 17:41                               ` Miklos Szeredi
  2007-06-18 17:48                                 ` Linus Torvalds
  2007-06-18 18:00                               ` Ingo Molnar
  2007-06-20  9:36                               ` Jarek Poplawski
  2 siblings, 1 reply; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-18 17:41 UTC (permalink / raw)
  To: torvalds; +Cc: mingo, cebbert, chris, linux-kernel, tglx, akpm

> Hmm? Untested, I know. Maybe I overlooked something. But even the 
> generated assembly code looks fine (much better than it looked before!)

Boots and runs fine.  Fixes the freezes as well, which is not such a
big surprise, since basically any change in that function seems to do
that ;)

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 17:41                               ` Miklos Szeredi
@ 2007-06-18 17:48                                 ` Linus Torvalds
  2007-06-18 18:02                                   ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-18 17:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, cebbert, chris, linux-kernel, tglx, akpm



On Mon, 18 Jun 2007, Miklos Szeredi wrote:
>
> > Hmm? Untested, I know. Maybe I overlooked something. But even the 
> > generated assembly code looks fine (much better than it looked before!)
> 
> Boots and runs fine.  Fixes the freezes as well, which is not such a
> big surprise, since basically any change in that function seems to do
> that ;)

Yeah, and that code really was *designed* to make all "locking in a loop" 
go away. So unlike the patches adding random barriers and cpu_relax() 
calls (which might fix it for some *particular* hw setup), I pretty much 
guarantee that you can never get that code into a situation where there is 
some lock being starved on *any* hw setup.

Ingo, an ack for the patch, and I'll just apply it?

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 16:34                             ` Linus Torvalds
  2007-06-18 17:41                               ` Miklos Szeredi
@ 2007-06-18 18:00                               ` Ingo Molnar
  2007-06-18 18:25                                 ` Linus Torvalds
  2007-06-20  9:36                               ` Jarek Poplawski
  2 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18 18:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> That code does:
> 
> 	if (unlikely(p->array || task_running(rq, p))) {
> 
> to decide if it needs to just unlock and repeat, but then to decide if 
> it need to *yield* it only uses *one* of those tests (namely
> 
> 	preempted = !task_running(rq, p);
> 	..
> 	if (preempted)
> 		yield();
> 
> and I think that's just broken. It basically says:
> 
>  - if the task is running, I will busy-loop on getting/releasing the 
>    task_rq_lock
> 
> and that is the _real_ bug here.
> 
> Trying to make the spinlocks do somethign else than what they do is 
> just papering over the real bug. The real bug is that anybody who just 
> busy-loops getting a lock is wasting resources so much that we should 
> not be at all surprised that some multi-core or NUMA situations will 
> get starvation.
> 
> Blaming some random Core 2 hardware implementation issue that just 
> makes it show up is wrong. It's a software bug, plain and simple.

yeah, agreed. wait_task_inactive() is butt-ugly, and Roland i think 
found a way to get rid of it in utrace (but it's not implemented yet, 
boggle) - but nevertheless this needs fixing for .22.

> So how about this diff? The diff looks big, but the *code* is actually 
> simpler and shorter, I just added tons of comments, which is what 
> blows it up.

> 
> The new *code* looks like this:
> 
> 	repeat:
> 		/* Unlocked, optimistic looping! */
> 	        rq = task_rq(p);
> 	        while (task_running(rq, p))
> 	                cpu_relax();

ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()?

> 
> 		/* Get the *real* values */
> 	        rq = task_rq_lock(p, &flags);
> 	        running = task_running(rq, p);
> 	        array = p->array;
> 	        task_rq_unlock(rq, &flags);
> 
> 		/* Check them.. */
> 	        if (unlikely(running)) {
> 	                cpu_relax();
> 	                goto repeat;
> 	        }
> 
> 	        if (unlikely(array)) {
> 	                yield();
> 	                goto repeat;
> 	        }

hm, this might still go into a non-nice busy loop on SMP: one cpu runs 
the strace, another one runs two tasks, one of which is runnable but not 
on the runqueue (the one we are waiting for). In that case we'd call 
yield() on this CPU in a loop (and likely wont pull that task over from 
that CPU). And yield() itself is a high-frequency rq-lock touching thing 
too, just a bit heavier than the other path in the wait function.

> Hmm? Untested, I know. Maybe I overlooked something. But even the 
> generated assembly code looks fine (much better than it looked 
> before!)

it looks certainly better and cleaner than what we had before!

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 17:48                                 ` Linus Torvalds
@ 2007-06-18 18:02                                   ` Ingo Molnar
  0 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-18 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Boots and runs fine.  Fixes the freezes as well, which is not such a 
> > big surprise, since basically any change in that function seems to 
> > do that ;)
> 
> Yeah, and that code really was *designed* to make all "locking in a 
> loop" go away. So unlike the patches adding random barriers and 
> cpu_relax() calls (which might fix it for some *particular* hw setup), 
> I pretty much guarantee that you can never get that code into a 
> situation where there is some lock being starved on *any* hw setup.
> 
> Ingo, an ack for the patch, and I'll just apply it?

yeah. It's definitely better than what we had before. Maybe one day 
someone frees us from this function =B-) We already had like 2-3 nasty 
SMP bugs related to it in the last 10 years.

Acked-by: Ingo Molnar <mingo@elte.hu>

and kudos to Miklos for the patience to debug this bug!

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 18:00                               ` Ingo Molnar
@ 2007-06-18 18:25                                 ` Linus Torvalds
  0 siblings, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-18 18:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm



On Mon, 18 Jun 2007, Ingo Molnar wrote:
> 
> ok. Do we have an guarantee that cpu_relax() is also an smp_rmb()?

The common use for cpu_relax() is basically for code that does

	while (*ptr != val)
		cpu_relax();

so yes, an architecture that doesn't notice writes by other CPU's on its 
own had *better* have an implied read barrier in its "cpu_relax()" 
implementation. For example, the irq handling code does

	while (desc->status & IRQ_INPROGRESS)
		cpu_relax();

which is explicitly about waiting for another CPU to get out of their 
interrupt handler. And one classic use for it in drivers is obviously the

	while (time_before (jiffies, next))
		cpu_relax();

kind of setup (and "jiffies" may well be updated on another CPU: the fact 
that it is "volatile" is just a *compiler* barrier just like cpu_relax() 
itself will also be, not a "smp_rmb()" kind of hardware barrier).

So we could certainly add the smp_rmb() to make it more explicit, and it 
wouldn't be *wrong*.

But quite frankly, I'd personally rather not - if it were to make a 
difference in some situation, it would just be papering over a bug in 
cpu_relax() itself.

The whole point of cpu_relax() is about busy-looping, after all. And the 
only thing you really *can* busy-loop on in a CPU is basically a memory 
value.

So the smp_rmb() would I think distract from the issue, and at best paper 
over some totally separate bug.

> hm, this might still go into a non-nice busy loop on SMP: one cpu runs 
> the strace, another one runs two tasks, one of which is runnable but not 
> on the runqueue (the one we are waiting for). In that case we'd call 
> yield() on this CPU in a loop

Sure. I agree - we can get into a loop that calls yield(). But I think a 
loop that calls yield() had better be ok - we're explicitly giving the 
scheduler the ability to to schedule anything else that is relevant. 

So I think yield()'ing is fundamentally different from busy-looping any 
other way. 

Would it be better to be able to have a wait-queue, and actually *sleep* 
on it, and not even busy-loop using yield? Yeah, possibly. I cannot 
personally bring myself to care about that kind of corner-case situation, 
though.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18  8:20                             ` Andrew Morton
@ 2007-06-19  4:22                               ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 88+ messages in thread
From: Ravikiran G Thirumalai @ 2007-06-19  4:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx,
	torvalds, shai

On Mon, Jun 18, 2007 at 01:20:55AM -0700, Andrew Morton wrote:
> On Mon, 18 Jun 2007 10:12:04 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > ---------------------------------------------------->
> > Subject: [patch] x86: fix spin-loop starvation bug
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > Miklos Szeredi reported very long pauses (several seconds, sometimes 
> > more) on his T60 (with a Core2Duo) which he managed to track down to 
> > wait_task_inactive()'s open-coded busy-loop. He observed that an 
> > interrupt on one core tries to acquire the runqueue-lock but does not 
> > succeed in doing so for a very long time - while wait_task_inactive() on 
> > the other core loops waiting for the first core to deschedule a task 
> > (which it wont do while spinning in an interrupt handler).
> > 
> > The problem is: both the spin_lock() code and the wait_task_inactive() 
> > loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
> > guaranteed MESI-fairness to the two cores - but that didnt happen: one 
> > of the cores was able to monopolize the cacheline that holds the 
> > runqueue lock, for extended periods of time.
> > 
> > This patch changes the spin-loop to assert an atomic op after every REP 
> > NOP instance - this will cause the CPU to express its "MESI interest" in 
> > that cacheline after every REP NOP.
> 
> Kiran, if you're still able to reproduce that zone->lru_lock starvation problem,
> this would be a good one to try...

We tried this approach a week back (speak of co-incidences), and it did not
help the problem.  I'd changed calls to the zone->lru_lock spin_lock
to do spin_trylock in a while loop with cpu_relax instead.  It did not help,
This was on top of 2.6.17 kernels.  But the good news is 2.6.21, as
is does not have the starvation issue -- that is, zone->lru_lock does not
seem to get contended that much under the same workload.

However, this was not on the same hardware I reported zone->lru_lock
contention on (8 socket dual core opteron).  I don't have access to it 
anymore :(

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-18 16:34                             ` Linus Torvalds
  2007-06-18 17:41                               ` Miklos Szeredi
  2007-06-18 18:00                               ` Ingo Molnar
@ 2007-06-20  9:36                               ` Jarek Poplawski
  2007-06-20 17:34                                 ` Linus Torvalds
  2 siblings, 1 reply; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-20  9:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm

On 18-06-2007 18:34, Linus Torvalds wrote:
> 
> On Mon, 18 Jun 2007, Ingo Molnar wrote:
>> To test this theory, could you try the patch below, does this fix your 
>> hangs too?
> 
> I really think this the the wrong approach, although *testing* it makes 
> sense.
> 
> I think we need to handle loops that take, release, and then immediately 
> re-take differently.
> 
> Such loops are _usually_ of the form where they really just release the 
> lock for some latency reason, but in this case I think it's actually just 
> a bug.
> 
> That code does:
> 
> 	if (unlikely(p->array || task_running(rq, p))) {
> 
> to decide if it needs to just unlock and repeat, but then to decide if it 
> need to *yield* it only uses *one* of those tests (namely 
> 
> 	preempted = !task_running(rq, p);
> 	..
> 	if (preempted)
> 		yield();
> 
> and I think that's just broken. It basically says:
> 
>  - if the task is running, I will busy-loop on getting/releasing the 
>    task_rq_lock
> 
> and that is the _real_ bug here.

I don't agree with this (+ I know it doesn't matter).

The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
And here they are simply lawlessly not fair.

I cannot see any reason why any of tasks doing simultaneously
"busy-loop on getting/releasing" a spinlock should starve almost
to death another one doing the same (or simply waiting to get this
lock) even without cpu_relax. Of course, lawfulness of such
behavior is questionable, and should be fixed like here.
 
> Trying to make the spinlocks do somethign else than what they do is just 
> papering over the real bug. The real bug is that anybody who just 
> busy-loops getting a lock is wasting resources so much that we should not 
> be at all surprised that some multi-core or NUMA situations will get 
> starvation.

On the other hand it seems spinlocks should be at least a little
more immune to such bugs: slowdown is OK but not freezing. Current
behavior could suggest this unfairness could harm some tasks even
without any loops present - but it's not visible enough.

So, I'm surprised this thread seems to stop after this patch, and
there is no try to make the most of this ideal testing case to
improve spinlock design btw.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-20  9:36                               ` Jarek Poplawski
@ 2007-06-20 17:34                                 ` Linus Torvalds
  2007-06-21  7:30                                   ` Ingo Molnar
  2007-06-21  7:38                                   ` Jarek Poplawski
  0 siblings, 2 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-20 17:34 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm



On Wed, 20 Jun 2007, Jarek Poplawski wrote:
> 
> I don't agree with this (+ I know it doesn't matter).
> 
> The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
> And here they are simply lawlessly not fair.

Well, that's certainly a valid standpoint. I wouldn't claim you're 
_wrong_.

At least in theory.

In *practice*, fair spinlocks are simply not possible. Not as in "it's 
hard to do", but as in "you simply cannot do it".

The thing is, most spinlocks get held for a rather short time (if that 
wasn't the case, you'd use something like a mutex), and it's acually a 
short time not on a "human scale", but on a "CPU core scale".

IOW, the cost of doing the cacheline bouncing is often equal to, or bigger 
than, the actual cost of the operation that happens inside the spinlock!

What does this result in? It automatically means that software simply 
*cannot* do fair spinlocks, because all the timing costs are in parts that 
software doesn't even have any visibility into, or control over!

Yeah, you could add artificial delays, by doing things like cycle counting 
around the spinlock operation to see if you got the lock when it was 
_contended_, or whether you got a lock that wasn't, and then adding some 
statistics etc, but at that point, you don't have a spinlock any more, you 
have something else. I don't know what to call it.

And you could add flags like "this spinlock is getting a lot of 
contention", try some other algorithm etc. But it's all complicated and 
against the whole *point* of a spinlock, which is to get in and out as 
fast as humanly possible.

So in practice, spinlock fairness is inherently tied to the hardware 
behaviour of the cache coherency algorithm.

Which gets us to the next level: we can consider hardware that isn't 
"fair" in its cache coherency to be buggy hardware.

That's also a perfectly fine standpoint, and it's actually one that I have 
a lot of sympathy for. I think it's much easier to some degree to do 
fairness in the cache coherency than at any other level, because it's 
something where the hardware really *could* do things like counting 
bouncing etc.

However, while it's a perfectly fine standpoint, it's also totally 
unrealistic. First off, the hardware doesn't even know whether the 
spinlock "failed" or not on any architecture platform I'm aware of. On 
x86, the spinlock operation under Linux is actually just an atomic 
decrement, and it so happens that the rules for failure was that it didn't 
go negative. But that's just a Linux internal rule - the hardware doesn't 
know.

So you'd have to actualyl have some specific sequence with specific 
fairness rules (maybe you could make the rule be that a failed atomic 
"cmpxchg" counts as a real failure, and if you give higher priority to 
cores with lots of failures etc).

IOW, it's certainly possible *in*theory* to try to make hardware that has 
fairness guarantees. However, any hardware designer will tell you that 
 (a) they have enough problems as it is
 (b) it's simply not worth their time, since there are other (simpler) 
     things that are much much more important.

So the end result:
 - hardware won't do it, and they'd be crazy to really even try (apart 
   from maybe some really simplistic stuff that doesn't _guarantee_ 
   anything, but maybe helps fairness a bit)
 - software cannot do it, without turning spinlocks into something really 
   slow and complicated, at which point they've lost all meaning, and you 
   should just use a higher-level construct like a mutex instead.

In other words, spinlocks are optimized for *lack* of contention. If a 
spinlock has contention, you don't try to make the spinlock "fair". No, 
you try to fix the contention instead!

That way, you fix many things. You probably speed things up, _and_ you 
make it fairer. It might sometimes take some brains and effort, but it's 
worth it.

The patch I sent out was an example of that. You *can* fix contention 
problems. Does it take clever approaches? Yes. It's why we have hashed 
spinlocks, RCU, and code sequences that are entirely lockless and use 
optimistic approaches. And suddenly you get fairness *and* performance!

It's a win-win situation. It does require a bit of effort, but hey, we're 
good at effort.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-20 17:34                                 ` Linus Torvalds
@ 2007-06-21  7:30                                   ` Ingo Molnar
  2007-06-21 15:50                                     ` Linus Torvalds
  2007-06-21  7:38                                   ` Jarek Poplawski
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jarek Poplawski, Miklos Szeredi, cebbert, chris, linux-kernel,
	tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> In other words, spinlocks are optimized for *lack* of contention. If a 
> spinlock has contention, you don't try to make the spinlock "fair". 
> No, you try to fix the contention instead!

yeah, and if there's no easy solution, change it to a mutex. Fastpath 
performance of spinlocks and mutexes is essentially the same, and if 
there's any measurable contention then the scheduler is pretty good at 
sorting things out. Say if the average contention is longer than 10-20 
microseconds then likely we could already win by scheduling away to some 
other task. (the best is of course to have no contention at all - but 
there are causes where it is real hard, and there are cases where it's 
outright unmaintainable.)

Hw makers are currently producing transistors disproportionatly faster 
than humans are producing parallel code, as a result of which we've got 
more CPU cache than ever, even taking natural application bloat into 
account. (it just makes no sense to spend those transistors on 
parallelism when applications are just not making use of it yet. Plus 
caches are a lot less power intense than functional units of the CPU, 
and the limit these days is power input.)

So scheduling more frequently and more agressively makes more sense than 
ever before and that trend will likely not stop for some time to come. 

> The patch I sent out was an example of that. You *can* fix contention 
> problems. Does it take clever approaches? Yes. It's why we have hashed 
> spinlocks, RCU, and code sequences that are entirely lockless and use 
> optimistic approaches. And suddenly you get fairness *and* 
> performance!

what worries me a bit though is that my patch that made spinlocks 
equally agressive to that loop didnt solve the hangs! So there is some 
issue we dont understand yet - why was the wait_inactive_task() 
open-coded spin-trylock loop starving the other core which had ... an 
open-coded spin-trylock loop coded up in assembly? And we've got a 
handful of other open-coded loops in the kernel (networking for example) 
so this issue could come back and haunt us in a situation where we dont 
have a gifted hacker like Miklos being able to spend _weeks_ to track 
down the problem...

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-20 17:34                                 ` Linus Torvalds
  2007-06-21  7:30                                   ` Ingo Molnar
@ 2007-06-21  7:38                                   ` Jarek Poplawski
  2007-06-21  8:39                                     ` Ingo Molnar
  2007-06-21 16:01                                     ` Linus Torvalds
  1 sibling, 2 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-21  7:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm

On Wed, Jun 20, 2007 at 10:34:15AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 20 Jun 2007, Jarek Poplawski wrote:
> > 
> > I don't agree with this (+ I know it doesn't matter).
> > 
> > The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
> > And here they are simply lawlessly not fair.
> 
> Well, that's certainly a valid standpoint. I wouldn't claim you're 
> _wrong_.
> 
> At least in theory.
> 
> In *practice*, fair spinlocks are simply not possible. Not as in "it's 
> hard to do", but as in "you simply cannot do it".

I think we can agree it was more about some minimal fairness.

...
> Which gets us to the next level: we can consider hardware that isn't 
> "fair" in its cache coherency to be buggy hardware.

IMHO, we shouldn't try to blame the hardware until we know exactly
the source of this bug.

...
> In other words, spinlocks are optimized for *lack* of contention. If a 
> spinlock has contention, you don't try to make the spinlock "fair". No, 
> you try to fix the contention instead!
> 
> That way, you fix many things. You probably speed things up, _and_ you 
> make it fairer. It might sometimes take some brains and effort, but it's 
> worth it.

I could agree with this, but only when we know exactly what place
should be fixed if we don't care about speed. Then, of course, the
cost could be estimated. But after last Ingo's patch I'm not sure
he, or anybody else here, has this knowledge. I'd also remind that
adding one smp_mb() also did the work, and it doesn't look like a
big performance hit. We should only better know why this works.

> The patch I sent out was an example of that. You *can* fix contention 
> problems. Does it take clever approaches? Yes. It's why we have hashed 
> spinlocks, RCU, and code sequences that are entirely lockless and use 
> optimistic approaches. And suddenly you get fairness *and* performance!
> 
> It's a win-win situation. It does require a bit of effort, but hey, we're 
> good at effort.

Not necessarily so. Until the exact reason isn't known "for sure",
this one place could be fixed, but the same problem could appear
somewhere else in more masked form or is far less repeatable.

BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
even sure it's necessarily the spinlock's problem (but I don't exclude
this possibility yet). It seems both processors use task_rq_lock(), so
there could be also a problem with that loop. The way the correctness
of the taken lock is verified is racy: there is a small probability
that if we have taken the wrong lock the check inside the loop is done
just before the value is beeing changed elsewhere under the right lock.
Another possible problem could be a result of some wrong optimization
or wrong propagation of change of this task_rq(p) value.

Thanks for response & best regards,
Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21  7:38                                   ` Jarek Poplawski
@ 2007-06-21  8:39                                     ` Ingo Molnar
  2007-06-21 11:09                                       ` Jarek Poplawski
  2007-06-21 16:01                                     ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21  8:39 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Linus Torvalds, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm


* Jarek Poplawski <jarkao2@o2.pl> wrote:

> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
> even sure it's necessarily the spinlock's problem (but I don't exclude 
> this possibility yet). It seems both processors use task_rq_lock(), so 
> there could be also a problem with that loop. The way the correctness 
> of the taken lock is verified is racy: there is a small probability 
> that if we have taken the wrong lock the check inside the loop is done 
> just before the value is beeing changed elsewhere under the right 
> lock. Another possible problem could be a result of some wrong 
> optimization or wrong propagation of change of this task_rq(p) value.

ok, could you elaborate this in a bit more detail? You say it's racy - 
any correctness bug in task_rq_lock() will cause the kernel to blow up 
in spectacular ways. It's a fairly straightforward loop:

 static inline struct rq *__task_rq_lock(struct task_struct *p)
         __acquires(rq->lock)
 {
         struct rq *rq;

 repeat_lock_task:
         rq = task_rq(p);
         spin_lock(&rq->lock);
         if (unlikely(rq != task_rq(p))) {
                 spin_unlock(&rq->lock);
                 goto repeat_lock_task;
         }
         return rq;
 }

the result of task_rq() depends on p->thread_info->cpu wich will only 
change if a task has migrated over to another CPU. That is a 
fundamentally 'slow' operation, but even if a task does it intentionally 
in a high frequency way (for example via repeated calls to 
sched_setaffinity) there's no way it could be faster than the spinlock 
code here. So ... what problems can you see with it?

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21  8:39                                     ` Ingo Molnar
@ 2007-06-21 11:09                                       ` Jarek Poplawski
  0 siblings, 0 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-21 11:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm

On Thu, Jun 21, 2007 at 10:39:31AM +0200, Ingo Molnar wrote:
> 
> * Jarek Poplawski <jarkao2@o2.pl> wrote:
> 
> > BTW, I've looked a bit at these NMI watchdog traces, and now I'm not 
> > even sure it's necessarily the spinlock's problem (but I don't exclude 
> > this possibility yet). It seems both processors use task_rq_lock(), so 
> > there could be also a problem with that loop. The way the correctness 
> > of the taken lock is verified is racy: there is a small probability 
> > that if we have taken the wrong lock the check inside the loop is done 
> > just before the value is beeing changed elsewhere under the right 
> > lock. Another possible problem could be a result of some wrong 
> > optimization or wrong propagation of change of this task_rq(p) value.
> 
> ok, could you elaborate this in a bit more detail? You say it's racy - 
> any correctness bug in task_rq_lock() will cause the kernel to blow up 
> in spectacular ways. It's a fairly straightforward loop:
> 
>  static inline struct rq *__task_rq_lock(struct task_struct *p)
>          __acquires(rq->lock)
>  {
>          struct rq *rq;
> 
>  repeat_lock_task:
>          rq = task_rq(p);
>          spin_lock(&rq->lock);
>          if (unlikely(rq != task_rq(p))) {
>                  spin_unlock(&rq->lock);
>                  goto repeat_lock_task;
>          }
>          return rq;
>  }
> 
> the result of task_rq() depends on p->thread_info->cpu wich will only 
> change if a task has migrated over to another CPU. That is a 
> fundamentally 'slow' operation, but even if a task does it intentionally 
> in a high frequency way (for example via repeated calls to 
> sched_setaffinity) there's no way it could be faster than the spinlock 
> code here. So ... what problems can you see with it?

OK, you are right - I withdraw this "idea". Sorry!

Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21  7:30                                   ` Ingo Molnar
@ 2007-06-21 15:50                                     ` Linus Torvalds
  2007-06-21 16:08                                       ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 15:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Miklos Szeredi, cebbert, chris, linux-kernel,
	tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> what worries me a bit though is that my patch that made spinlocks 
> equally agressive to that loop didnt solve the hangs!

Your parch kept doing "spin_trylock()", didn't it?

That's a read-modify-write thing, and keeps bouncing the cacheline back 
and forth, and together with the fact that even *after* you get the 
spinlock the "wait_for_inactive()" would actually end up looping back, 
releasing it, and re-getting it.

So the problem was that "wait_for_inactive()" kept the lock (because it 
actually *got* it), and looped over getting it, and because it was an 
exclusive cacheline ownership, that implies that somebody else is not 
getting it, and is kept from ever getting it.

So trying to use "trylock" doesn't help. It still has all the same bad 
sides - it still gets the lock (getting the lock wasn't the problem: 
_holding_ the lock was the problem), and it still kept the cache line for 
the lock on one core.

The only way to avoid lock contention is to avoid any exclusive use at 
all.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21  7:38                                   ` Jarek Poplawski
  2007-06-21  8:39                                     ` Ingo Molnar
@ 2007-06-21 16:01                                     ` Linus Torvalds
  2007-06-22 10:38                                       ` Jarek Poplawski
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 16:01 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Jarek Poplawski wrote:
> 
> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
> even sure it's necessarily the spinlock's problem (but I don't exclude
> this possibility yet). It seems both processors use task_rq_lock(), so
> there could be also a problem with that loop.

I agree that that is also this kind of "loop getting a lock" loop, but if 
you can see it actually happening there, we have some other (and major) 
bug.

That loop is indeed a loop, but realistically speaking it should never be 
run through more than once. The only reason it's a loop is that there's a 
small small race where we get the information at the wrong time, get the 
wrong lock, and try again.

So the loop certainly *can* trigger (or it would be pointless), but I'd 
normally expect it not to, and even if it does end up looping around it 
should happen maybe *one* more time, absolutely not "get stuck" in the 
loop for any appreciable number of iterations.

So I don't see how you could possibly having two different CPU's getting 
into some lock-step in that loop: changing "task_rq()" is a really quite 
heavy operation (it's about migrating between CPU's), and generally 
happens at a fairly low frequency (ie "a couple of times a second" kind of 
thing, not "tight CPU loop").

But bugs happen..

> Another possible problem could be a result of some wrong optimization
> or wrong propagation of change of this task_rq(p) value.

I agree, but that kind of bug would likely not cause temporary hangs, but 
actual "the machine is dead" operations. If you get the totally *wrong* 
value due to some systematic bug, you'd be waiting forever for it to 
match, not get into a loop for a half second and then it clearing up..

But I don't think we can throw the "it's another bug" theory _entirely_ 
out the window.

That said, both Ingo and me have done "fairness" testing of hardware, and 
yes, if you have a reasonably tight loop with spinlocks, existing hardware 
really *is* unfair. Not by a "small amount" either - I had this program 
that did statistics (and Ingo improved on it and had some other tests 
too), and basically if you have two CPU's that try to get the same 
spinlock, they really *can* get into situations where one of them gets it 
millions of times in a row, and the other _never_ gets it.

But it only happens with badly coded software: the rule simply is that you 
MUST NOT release and immediately re-acquire the same spinlock on the same 
core, because as far as other cores are concerned, that's basically the 
same as never releasing it in the first place.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 15:50                                     ` Linus Torvalds
@ 2007-06-21 16:08                                       ` Ingo Molnar
  2007-06-21 16:32                                         ` Linus Torvalds
  2007-06-21 16:44                                         ` Chuck Ebbert
  0 siblings, 2 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jarek Poplawski, Miklos Szeredi, cebbert, chris, linux-kernel,
	tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > what worries me a bit though is that my patch that made spinlocks 
> > equally agressive to that loop didnt solve the hangs!
> 
> Your parch kept doing "spin_trylock()", didn't it?

yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a 
trylock which tries to dirty the cacheline. There was a "REP NOP" after 
it and a loop back to the "LOCK BTRL".

> That's a read-modify-write thing, and keeps bouncing the cacheline 
> back and forth, and together with the fact that even *after* you get 
> the spinlock the "wait_for_inactive()" would actually end up looping 
> back, releasing it, and re-getting it.
> 
> So the problem was that "wait_for_inactive()" kept the lock (because 
> it actually *got* it), and looped over getting it, and because it was 
> an exclusive cacheline ownership, that implies that somebody else is 
> not getting it, and is kept from ever getting it.

ok, it's not completely clear where exactly the other core was spinning, 
but i took it from Miklos' observations that the other core was hanging 
in the _very same_ task_rq_lock() - which is a true spinlock as well 
that acquires it. So on one core the spin_lock() was starving, on 
another one it was always succeeding.

> So trying to use "trylock" doesn't help. It still has all the same bad 
> sides - it still gets the lock (getting the lock wasn't the problem: 
> _holding_ the lock was the problem), and it still kept the cache line 
> for the lock on one core.

so the problem was not the trylock based spin_lock() itself (no matter 
how it's structured in the assembly), the problem was actually modifying 
the lock and re-modifying it again and again in a very tight 
high-frequency loop, and hence not giving it to the other core?

> The only way to avoid lock contention is to avoid any exclusive use at 
> all.

yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
liked the 'nicer' inner loop of spinlocks, which could btw also easily 
use MONITOR/MWAIT. (my patch is also quite close to what we did in 
spinlocks many years ago, so it's more of a step backwards than real 
progress.)

So it seems the problem was that if a core kept _truly_ modifying a 
cacheline via atomics in a high enough frequency, it could artificially 
starve the other core. (which would keep waiting for the cacheline to be 
released one day, and which kept the first core from ever making any 
progress) To me that looks like a real problem on the hardware side - 
shouldnt cacheline ownership be arbitrated a bit better than that?

Up to the point where some external event (perhaps a periodic SMM 
related to thermal management) broke the deadlock/livelock scenario?

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 16:08                                       ` Ingo Molnar
@ 2007-06-21 16:32                                         ` Linus Torvalds
  2007-06-21 16:44                                         ` Chuck Ebbert
  1 sibling, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jarek Poplawski, Miklos Szeredi, cebbert, chris, linux-kernel,
	tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> so the problem was not the trylock based spin_lock() itself (no matter 
> how it's structured in the assembly), the problem was actually modifying 
> the lock and re-modifying it again and again in a very tight 
> high-frequency loop, and hence not giving it to the other core?

That's what I think, yes.

And it matches the behaviour we've seen before (remember the whole thread 
about whether Opteron or Core 2 hardware is "more fair" between cores?)

It all simply boils down to the fact that releasing and almost immediately 
re-taking a lock is "invisible" to outside cores, because it will happen 
entirely within the L1 cache of one core if the cacheline is already in 
"owned" state.

Another core that spins wildly trying to get it ends up using a much 
slower "beat" (the cache coherency clock), and with just a bit of bad luck 
the L1 cache would simply never get probed, and the line never stolen, at 
the point where the lock happens to be released.

The fact that writes (as in the store that releases the lock) 
automatically get delayed by any x86 core by the store buffer, and the 
fact that atomic read-modify-write cycles do *not* get delayed, just means 
that if the "spinlock release" was "fairly close" to the "reacquire" in a 
software sense, the hardware will actually make them *even*closer*.

So you could make the spinlock release be a read-modify-write cycle, and 
it would make the spinlock much slower, but it would also make it much 
more likely that the other core will *see* the release.

For the same reasons, if you add a "smp_mb()" in between the release and 
the re-acquire, the other core is much more likely to see it: it means 
that the release won't be delayed, and thus it just won't be as "close" to 
the re-acquire.

So you can *hide* this problem in many ways, but it's still just hiding 
it.

The proper fix is to not do that kind of "release and re-acquire" 
behaviour in a loop.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 16:08                                       ` Ingo Molnar
  2007-06-21 16:32                                         ` Linus Torvalds
@ 2007-06-21 16:44                                         ` Chuck Ebbert
  2007-06-21 17:31                                           ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Chuck Ebbert @ 2007-06-21 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm

On 06/21/2007 12:08 PM, Ingo Molnar wrote:
> yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
> liked the 'nicer' inner loop of spinlocks, which could btw also easily 
> use MONITOR/MWAIT.

The "nice" inner loop is necessary or else it would generate huge amounts
of bus traffic while spinning.

> So it seems the problem was that if a core kept _truly_ modifying a 
> cacheline via atomics in a high enough frequency, it could artificially 
> starve the other core. (which would keep waiting for the cacheline to be 
> released one day, and which kept the first core from ever making any 
> progress) To me that looks like a real problem on the hardware side - 
> shouldnt cacheline ownership be arbitrated a bit better than that?
> 

A while ago I showed that spinlocks were a lot more fair when doing
unlock with the xchg instruction on x86. Probably the arbitration is all
screwed up because we use a mov instruction, which while atomic is not
locked.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 16:44                                         ` Chuck Ebbert
@ 2007-06-21 17:31                                           ` Linus Torvalds
  2007-06-21 18:29                                             ` Eric Dumazet
                                                               ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 17:31 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Ingo Molnar, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Chuck Ebbert wrote:
> 
> A while ago I showed that spinlocks were a lot more fair when doing
> unlock with the xchg instruction on x86. Probably the arbitration is all
> screwed up because we use a mov instruction, which while atomic is not
> locked.

No, the cache line arbitration doesn't know anything about "locked" vs 
"unlocked" instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which 
makes them extremely slow (compared to just a write), and then by being 
slow they effectively just make the window for another core bigger.

IOW, it doesn't "fix" anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between 
the unlock and the lock (and adding one or more serializing instruction in 
between is generally the best way of doing that, simply because otherwise 
micro- architecture may just be re-ordering things on you, so that your 
"delay" isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things 
"fairer" by making *both* sides suck more, but especially if both sides 
are actually the same exact thing, I could well imagine that they'd both 
just suck equally, and get into some pattern where they are now both 
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc, 
it's really *really* hard to be totally unlucky, and after a while you're 
likely to break out of the steplock on your own, just because the CPU's 
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been 
there, but the NOHZ option meant that we had much longer stretches of time 
without things like timer interrupts to jumble up the timing! So maybe the 
freezes existed before, but with timer interrupts happening hundreds of 
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it 
could be one of the reasons why this showed up as a "new" problem, even 
though I don't think the "wait_for_inactive()" thing has changed lately.)

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 17:31                                           ` Linus Torvalds
@ 2007-06-21 18:29                                             ` Eric Dumazet
  2007-06-21 18:44                                               ` Linus Torvalds
  2007-06-21 20:16                                             ` Ingo Molnar
  2007-06-21 20:18                                             ` Ingo Molnar
  2 siblings, 1 reply; 88+ messages in thread
From: Eric Dumazet @ 2007-06-21 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Ingo Molnar, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm

On Thu, 21 Jun 2007 10:31:53 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 21 Jun 2007, Chuck Ebbert wrote:
> > 
> > A while ago I showed that spinlocks were a lot more fair when doing
> > unlock with the xchg instruction on x86. Probably the arbitration is all
> > screwed up because we use a mov instruction, which while atomic is not
> > locked.
> 
> No, the cache line arbitration doesn't know anything about "locked" vs 
> "unlocked" instructions (it could, but there really is no point).
> 
> The real issue is that locked instructions on x86 are serializing, which 
> makes them extremely slow (compared to just a write), and then by being 
> slow they effectively just make the window for another core bigger.
> 

This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)

http://lwn.net/Articles/227506/

extract : 

Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.

The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
"atomic_inc_return" on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.

To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).

Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.

Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues. 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 18:29                                             ` Eric Dumazet
@ 2007-06-21 18:44                                               ` Linus Torvalds
  2007-06-21 19:35                                                 ` Linus Torvalds
                                                                   ` (3 more replies)
  0 siblings, 4 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 18:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Chuck Ebbert, Ingo Molnar, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Eric Dumazet wrote:
> 
> This reminds me Nick's proposal of 'queued spinlocks' 3 months ago
> 
> Maybe this should be re-considered ? (unlock is still a non atomic op, 
> so we dont pay the serializing cost twice)

No. The point is simple:

	IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.

Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks "easier"? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
to encourage bad programming!

> extract : 
> 
> Implement queued spinlocks for i386. This shouldn't increase the size of
> the spinlock structure, while still able to handle 2^16 CPUs.

Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an "int" just wastes memory. All the actual 
code uses "decb", so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.

Ingo, any memory of that?

(And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think 
such an insane machine has ever existed).

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 18:44                                               ` Linus Torvalds
@ 2007-06-21 19:35                                                 ` Linus Torvalds
  2007-06-21 20:09                                                   ` Ingo Molnar
  2007-06-21 20:36                                                   ` [BUG] long freezes on thinkpad t60 Eric Dumazet
  2007-06-21 19:56                                                 ` Ingo Molnar
                                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 19:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Chuck Ebbert, Ingo Molnar, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Linus Torvalds wrote:
> 
> We don't do nesting locking either, for exactly the same reason. Are 
> nesting locks "easier"? Absolutely. They are also almost always a sign of 
> a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
> to encourage bad programming!

Side note, and as a "truth in advertising" section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.

If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of "fair spinlocks".

So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to "solve" (aka hide) other problems.

In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.

I just don't think we need it, and we're better off solving problems other 
ways.

(For example, we might also solve such problems by creating a separate
"fair_spin_lock" abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 18:44                                               ` Linus Torvalds
  2007-06-21 19:35                                                 ` Linus Torvalds
@ 2007-06-21 19:56                                                 ` Ingo Molnar
  2007-06-21 20:10                                                   ` Linus Torvalds
  2007-06-21 20:12                                                 ` Ingo Molnar
  2007-06-26  8:42                                                 ` Nick Piggin
  3 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Umm. i386 spinlocks could and should be *one*byte*.
> 
> In fact, I don't even know why they are wasting four bytes right now: 
> the fact that somebody made them an "int" just wastes memory. All the 
> actual code uses "decb", so it's not even a question of safety. I 
> wonder why we have that 32-bit thing and the ugly casts.
> 
> Ingo, any memory of that?

no real reason that i can recall - i guess nobody dared to touch it 
because it used to have that 'volatile', indicating black voodoo ;-) Now 
that the bad stigma has been removed, we could try the patch below. It 
boots fine here, and we save 1K of kernel text size:

     text    data     bss     dec     hex filename
  6236003  611992  401408 7249403  6e9dfb vmlinux.before
  6235075  611992  401408 7248475  6e9a5b vmlinux.after

I can understand why no data is saved by this change: gcc is aligning 
the next field to a natural boundary anyway and we dont really have 
arrays of spinlocks (fortunately). [and we save no data even if using 
the ((packed)) attribute.] Perhaps some data structure that is never in 
the kernel image itself still got smaller? Any good way to determine 
that?

But why is the text size different? Ah: i think it's spin_lock_init() 
getting shorter :-)

but this is certainly not something for 2.6.22, it's an early 2.6.23 
matter i suspect.

	Ingo

------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [patch] spinlocks i386: change them to byte fields

all spinlock ops are on byte operands, so change the spinlock field to 
be unsigned char. This saves a bit of kernel text size:

   text    data     bss     dec     hex filename
6236003  611992  401408 7249403  6e9dfb vmlinux.before
6235075  611992  401408 7248475  6e9a5b vmlinux.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock_types.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-i386/spinlock_types.h
===================================================================
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
 #endif
 
 typedef struct {
-	unsigned int slock;
+	unsigned char slock;
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED	{ 1 }

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 19:35                                                 ` Linus Torvalds
@ 2007-06-21 20:09                                                   ` Ingo Molnar
  2007-06-21 20:14                                                     ` Linus Torvalds
  2007-06-21 20:36                                                   ` [BUG] long freezes on thinkpad t60 Eric Dumazet
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> If somebody can actually come up with a sequence where we have 
> spinlock starvation, and it's not about an example of bad locking, and 
> nobody really can come up with any other way to fix it, we may 
> eventually have to add the notion of "fair spinlocks".

there was one bad case i can remember: the spinlock debugging code had a 
trylock open-coded loop and on certain Opterons CPUs were starving each 
other. This used to trigger with the ->tree_lock rwlock i think, on 
heavy MM loads. The starvation got so bad that the NMI watchdog started 
triggering ...

interestingly, this only triggered for certain rwlocks. Thus we, after a 
few failed attempts to pacify this open-coded loop, currently have that 
code disabled in lib/spinlock_debug.c:

 #if 0           /* This can cause lockups */
 static void __write_lock_debug(rwlock_t *lock)
 {
         u64 i;
         u64 loops = loops_per_jiffy * HZ;
         int print_once = 1;

         for (;;) {
                 for (i = 0; i < loops; i++) {
                         if (__raw_write_trylock(&lock->raw_lock))
                                 return;
                         __delay(1);
                 }

the weird thing is that we still have the _very same_ construct in 
__spin_lock_debug():

                 for (i = 0; i < loops; i++) {
                         if (__raw_spin_trylock(&lock->raw_lock))
                                 return;
                         __delay(1);
                 }

if there are any problems with this then people are not complaining loud 
enough :-)

note that because this is a trylock based loop, the acquire+release 
sequence problem should not apply to this problem.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 19:56                                                 ` Ingo Molnar
@ 2007-06-21 20:10                                                   ` Linus Torvalds
  2007-06-21 20:23                                                     ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 20:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> I can understand why no data is saved by this change: gcc is aligning 
> the next field to a natural boundary anyway and we dont really have 
> arrays of spinlocks (fortunately).

Actually, some data structures could well shrink.

Look at "struct task_struct", for example. Right now it has two spinlocks 
right next to each other (alloc_lock and pi_lock).

Other data structures may have things like bitfields etc.

But yeah, I'd not expect that to be very common, and in some cases you 
might have to re-order data structures to take advantage of better 
packing, and even then it's probably not all that noticeable.

> but this is certainly not something for 2.6.22, it's an early 2.6.23 
> matter i suspect.

Oh, absolutely.

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 18:44                                               ` Linus Torvalds
  2007-06-21 19:35                                                 ` Linus Torvalds
  2007-06-21 19:56                                                 ` Ingo Molnar
@ 2007-06-21 20:12                                                 ` Ingo Molnar
  2007-06-26  8:42                                                 ` Nick Piggin
  3 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't 
> think such an insane machine has ever existed).

and if people _really_ want to boot a large-smp 32-bit kernel on some 
new, tons-of-cpus box, as a workaround they can enable the spinlock 
debugging code, which has no limitation on the number of CPUs supported.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:09                                                   ` Ingo Molnar
@ 2007-06-21 20:14                                                     ` Linus Torvalds
  2007-06-21 20:30                                                       ` Ingo Molnar
  2007-06-21 20:42                                                       ` [patch] spinlock debug: make looping nicer Ingo Molnar
  0 siblings, 2 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 20:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > If somebody can actually come up with a sequence where we have 
> > spinlock starvation, and it's not about an example of bad locking, and 
> > nobody really can come up with any other way to fix it, we may 
> > eventually have to add the notion of "fair spinlocks".
> 
> there was one bad case i can remember: the spinlock debugging code had a 
> trylock open-coded loop and on certain Opterons CPUs were starving each 
> other.

But this is a perfect example of exactly what I'm talking about:

 THAT CODE IS HORRIBLY BUGGY!

It's not the spinlocks that are broken, it's that damn code.

>          for (;;) {
>                  for (i = 0; i < loops; i++) {
>                          if (__raw_write_trylock(&lock->raw_lock))
>                                  return;
>                          __delay(1);
>                  }

What a piece of crap. 

Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
dammit!

The only correct way to wait for a lock is:

  (a) try it *once* with an atomic r-m-w 
  (b) loop over just _reading_ it (and something that implies a memory 
      barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()")
  (c) rinse and repeat.

and code like the above should just be shot on sight.

So don't blame the spinlocks or the hardware for crap code.

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 17:31                                           ` Linus Torvalds
  2007-06-21 18:29                                             ` Eric Dumazet
@ 2007-06-21 20:16                                             ` Ingo Molnar
  2007-06-22  8:17                                               ` Ingo Molnar
  2007-06-21 20:18                                             ` Ingo Molnar
  2 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It's in fact entirely possible that the long freezes have always been 
> there, but the NOHZ option meant that we had much longer stretches of 
> time without things like timer interrupts to jumble up the timing! So 
> maybe the freezes existed before, but with timer interrupts happening 
> hundreds of times a second, they weren't noticeable to humans.

the freezes that Miklos was seeing were hardirq contexts blocking in 
task_rq_lock() - that is done with interrupts disabled. (Miklos i think 
also tried !NOHZ kernels and older kernels, with a similar result.)

plus on the ptrace side, the wait_task_inactive() code had most of its 
overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
likely while we were still holding the runqueue lock!

i think the only thing that eventually got Miklos' laptop out of the 
wedge were timer irqs hitting the ptrace CPU in exactly those 
instructions where it was not holding the runqueue lock. (or perhaps an 
asynchronous SMM event delaying it for a long time)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 17:31                                           ` Linus Torvalds
  2007-06-21 18:29                                             ` Eric Dumazet
  2007-06-21 20:16                                             ` Ingo Molnar
@ 2007-06-21 20:18                                             ` Ingo Molnar
  2007-06-21 20:36                                               ` Linus Torvalds
  2 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> No, the cache line arbitration doesn't know anything about "locked" vs 
> "unlocked" instructions (it could, but there really is no point).
> 
> The real issue is that locked instructions on x86 are serializing, 
> which makes them extremely slow (compared to just a write), and then 
> by being slow they effectively just make the window for another core 
> bigger.
> 
> IOW, it doesn't "fix" anything, it just hides the bug with timing.

yeah. I think Linux is i think the only OS on the planet that is using 
the movb trick for unlock, it even triggered a hardware erratum ;) So it 
might surprise some hw makers who might rely on the heuristics that each 
critical section lock and unlock is a LOCK-ed instruction.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:10                                                   ` Linus Torvalds
@ 2007-06-21 20:23                                                     ` Ingo Molnar
  0 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > I can understand why no data is saved by this change: gcc is 
> > aligning the next field to a natural boundary anyway and we dont 
> > really have arrays of spinlocks (fortunately).
> 
> Actually, some data structures could well shrink.
> 
> Look at "struct task_struct", for example. Right now it has two 
> spinlocks right next to each other (alloc_lock and pi_lock).

yeah. We've got init_task's task-struct embedded in the vmlinux, but 
it's aligned to 32 bytes, which probably hides this effect. We'd only 
see it if the size change just happened to bring a key data structure 
(which is also embedded in the vmlinux) just below a modulo 32 bytes 
boundary. The chance for that is around 6:32 per 'merge' event. That 
means that there cannot be all that many such cases ;-)

anyway, the shorter init sequence is worth it already.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:14                                                     ` Linus Torvalds
@ 2007-06-21 20:30                                                       ` Ingo Molnar
  2007-06-21 20:48                                                         ` Linus Torvalds
  2007-06-21 20:42                                                       ` [patch] spinlock debug: make looping nicer Ingo Molnar
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >          for (;;) {
> >                  for (i = 0; i < loops; i++) {
> >                          if (__raw_write_trylock(&lock->raw_lock))
> >                                  return;
> >                          __delay(1);
> >                  }
> 
> What a piece of crap. 
> 
> Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
> dammit!
> 
> The only correct way to wait for a lock is:
> 
>   (a) try it *once* with an atomic r-m-w 
>   (b) loop over just _reading_ it (and something that implies a memory 
>       barrier, _not_ "__delay()". Use "cpu_relax()" or "smp_rmb()")
>   (c) rinse and repeat.

damn, i first wrote up an explanation about why that ugly __delay(1) is 
there (it almost hurts my eyes when i look at it!) but then deleted it 
as superfluous :-/

really, it's not because i'm stupid (although i might still be stupid 
for other resons ;-), it wasnt there in earlier spin-debug versions. We 
even had an inner spin_is_locked() loop at a stage (and should add it 
again).

the reason for the __delay(1) was really mundane: to be able to figure 
out when to print a 'we locked up' message to the user. If it's 1 
second, it causes false positive on some systems. If it's 10 minutes, 
people press reset before we print out any useful data. It used to be 
just a loop of rep_nop()s, but that was hard to calibrate: on certain 
newer hardware it was triggering as fast as in 2 seconds, causing many 
false positives. We cannot use jiffies nor any other clocksource in this 
debug code.

so i settled for the butt-ugly but working __delay(1) thing, to be able 
to time the debug messages.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:18                                             ` Ingo Molnar
@ 2007-06-21 20:36                                               ` Linus Torvalds
  0 siblings, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 20:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chuck Ebbert, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> yeah. I think Linux is i think the only OS on the planet that is using 
> the movb trick for unlock, it even triggered a hardware erratum ;)

I'm pretty sure others do it too.

Maybe not on an OS level (but I actually doubt that - I'd be surprised if 
Windows doesn't do the exact same thing), but I know for a fact that a lot 
of people in threaded libraries end up depending very much on the "simple 
store closes a locked section".

Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards 
for Windows developers with open-coded spinlocks like


	int ResourceFlag = 0; // 0=Free, 1=Inuse
	...
	// Wait until we get the resource
	while(InterlockedExchange(&ResourceFlag, 1) != 0) {
	   Sleep(0); } // Wait a tad
	// Have the resource
	... // do your thing
	ResourceFlag = 0; // Release the resource


and that's definitely Windows code, not some Linux person doing it.

And this is from an OS2 forum

	unsigned owned=0;

	void request() {
	  while(LockedExchanged(&owned,1)!=0)
	    ;
	}

	void release() {
	  owned = 0;
	}

so it's not even something unusual.

So while arguably these people don't know (and don't care) about subtle 
issues like memory ordering, I can *guarantee* that a lot of programs 
depend on them, even if that dependency may often come from a lack of 
knowledge, rather than actively understanding what we do like in the Linux 
kernel community.

(And yes, they rely on compilers not reordering either. Tough.)

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 19:35                                                 ` Linus Torvalds
  2007-06-21 20:09                                                   ` Ingo Molnar
@ 2007-06-21 20:36                                                   ` Eric Dumazet
  1 sibling, 0 replies; 88+ messages in thread
From: Eric Dumazet @ 2007-06-21 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Ingo Molnar, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm

Linus Torvalds a écrit :
> 
> On Thu, 21 Jun 2007, Linus Torvalds wrote:
>> We don't do nesting locking either, for exactly the same reason. Are 
>> nesting locks "easier"? Absolutely. They are also almost always a sign of 
>> a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
>> to encourage bad programming!
> 
> Side note, and as a "truth in advertising" section: I'll have to admit 
> that I argued against fair semaphores on the same grounds. I was wrong 
> then (and eventually admitted it, and we obviously try to make our mutexes 
> and semaphores fair these days!), and maybe I'm wrong now.
> 
> If somebody can actually come up with a sequence where we have spinlock 
> starvation, and it's not about an example of bad locking, and nobody 
> really can come up with any other way to fix it, we may eventually have to 
> add the notion of "fair spinlocks".
> 

I tried to find such a sequence, but I think its more a matter of hardware 
evolution, and some degenerated cases.

In some years (months ?), it might possible to starve say the file struct 
spinlock of a process in a open()/close() infinite loop. This because the 
number of instruction per 'memory cache line transfert between cpus/core' is 
raising.

But then one can say its a bug in user code :)

Another way to starve kernel might be a loop doing settime() , since seqlock 
are quite special in serialization :

Only seqlock's writers perform atomic ops, readers could be starved because of 
some hardware 'optimization'.


> So my arguments are purely pragmatic. It's not that I hate fairness per 
> se. I dislike it only when it's used to "solve" (aka hide) other problems.
> 
> In the end, some situations do need fairness, and the fact that aiming for 
> fairness is often harder, slower, and more complicated than not doing so 
> at that point turns into a non-argument. If you need it, you need it.

Maybe some *big* NUMA machines really want this fairness (even if it cost some 
cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am 
just guessing since I cannot test such monsters. I tested Davide program on a 
Dual Opteron and got some perf difference.


$ ./qspins  -n 2
now testing: TICKLOCK
timeres=4000
uscycles=1991
AVG[0]: 2195.250000 cycles/loop
SIG[0]: 11.813657
AVG[1]: 2212.312500 cycles/loop
SIG[1]: 38.038991

$ ./qspins  -n 2 -s
now testing: SPINLOCK
timeres=4000
uscycles=1991
AVG[0]: 2066.000000 cycles/loop
SIG[0]: 0.000000
AVG[1]: 2115.687500 cycles/loop
SIG[1]: 63.083000


> 
> I just don't think we need it, and we're better off solving problems other 
> ways.
> 
> (For example, we might also solve such problems by creating a separate
> "fair_spin_lock" abstraction, and only making the particular users that 
> need it actually use it. It would depend a bit on whether the cost of 
> implementing the fairness is noticeable enough for it to be worth having 
> a separate construct for it).
> 
> 		Linus
> 
> 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch] spinlock debug: make looping nicer
  2007-06-21 20:14                                                     ` Linus Torvalds
  2007-06-21 20:30                                                       ` Ingo Molnar
@ 2007-06-21 20:42                                                       ` Ingo Molnar
  2007-06-21 20:58                                                         ` Linus Torvalds
  1 sibling, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Anybody who ever waits for a lock by busy-looping over it is BUGGY, 
> dammit!

btw., back then we also tried a spin_is_locked() based inner loop but it 
didnt help the ->tree_lock lockups either. In any case i very much agree 
that the 'nicer' looping should be added again - the patch below does 
that. (build and boot tested)

and the reason that this didnt help the ->tree_lock lockup is likely the 
same why wait_task_inactive() broke _independently_ of the 'niceness' of 
the spin-lock operation: there were too few instructions between 
releasing the lock and re-acquiring it again can cause permanent 
starvation of another CPU. No amount of logic on the spinning side can 
overcome this, if acquire/release critical sections are following each 
other too fast.

	Ingo

------------------------------>
Subject: [patch] spinlock debug: make looping nicer
From: Ingo Molnar <mingo@elte.hu>

make the spin-trylock loops nicer - and reactive the read and
write loops as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/spinlock_debug.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -106,9 +106,14 @@ static void __spin_lock_debug(spinlock_t
 
 	for (;;) {
 		for (i = 0; i < loops; i++) {
+			/*
+			 * Ugly: we do the __delay() so that we know how
+			 * long to loop before printing a debug message:
+			 */
+			while (spin_is_locked(lock))
+				__delay(1);
 			if (__raw_spin_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
 		}
 		/* lockup suspected: */
 		if (print_once) {
@@ -167,7 +172,6 @@ static void rwlock_bug(rwlock_t *lock, c
 
 #define RWLOCK_BUG_ON(cond, lock, msg) if (unlikely(cond)) rwlock_bug(lock, msg)
 
-#if 0		/* __write_lock_debug() can lock up - maybe this can too? */
 static void __read_lock_debug(rwlock_t *lock)
 {
 	u64 i;
@@ -176,9 +180,10 @@ static void __read_lock_debug(rwlock_t *
 
 	for (;;) {
 		for (i = 0; i < loops; i++) {
+			while (!read_can_lock(lock))
+				__delay(1);
 			if (__raw_read_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
 		}
 		/* lockup suspected: */
 		if (print_once) {
@@ -191,12 +196,11 @@ static void __read_lock_debug(rwlock_t *
 		}
 	}
 }
-#endif
 
 void _raw_read_lock(rwlock_t *lock)
 {
 	RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic");
-	__raw_read_lock(&lock->raw_lock);
+	__read_lock_debug(lock);
 }
 
 int _raw_read_trylock(rwlock_t *lock)
@@ -242,7 +246,6 @@ static inline void debug_write_unlock(rw
 	lock->owner_cpu = -1;
 }
 
-#if 0		/* This can cause lockups */
 static void __write_lock_debug(rwlock_t *lock)
 {
 	u64 i;
@@ -251,9 +254,10 @@ static void __write_lock_debug(rwlock_t 
 
 	for (;;) {
 		for (i = 0; i < loops; i++) {
+			while (!write_can_lock(lock))
+				__delay(1);
 			if (__raw_write_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
 		}
 		/* lockup suspected: */
 		if (print_once) {
@@ -266,12 +270,11 @@ static void __write_lock_debug(rwlock_t 
 		}
 	}
 }
-#endif
 
 void _raw_write_lock(rwlock_t *lock)
 {
 	debug_write_lock_before(lock);
-	__raw_write_lock(&lock->raw_lock);
+	__write_lock_debug(lock);
 	debug_write_lock_after(lock);
 }
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:30                                                       ` Ingo Molnar
@ 2007-06-21 20:48                                                         ` Linus Torvalds
  2007-06-21 21:06                                                           ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 20:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> damn, i first wrote up an explanation about why that ugly __delay(1) is 
> there (it almost hurts my eyes when i look at it!) but then deleted it 
> as superfluous :-/

I'm fine with a delay, but the __delay(1) is simply not "correct". It 
doesn't do anything.

"udelay()" waits for a certain time. Use that. 

> the reason for the __delay(1) was really mundane: to be able to figure 
> out when to print a 'we locked up' message to the user.

No it does not.

You may think it does, but it does nothing of the sort.

Use "udelay()" or somethign that actually takes a *time*.

Just __delay() is nothing but a loop, and calling it with an argument of 1 
is stupid and buggy. 

The only *possibly* valid use of "__delay()" implies using a counter that 
is based on the "loops_per_sec" thing, which depends on what the delay  
function actually is.

For example, the delay function may well turn out to be this:

        __asm__ __volatile__(
                "\tjmp 1f\n"
                ".align 16\n"
                "1:\tjmp 2f\n"
                ".align 16\n"
                "2:\tdecl %0\n\tjns 2b"
                :"=&a" (d0)
                :"0" (loops));

Notice? "Your code, it does nothing!"

When I said that the code was buggy, I meant it.

It has nothing to do with spinlocks. And "__delay(1)" is *always* a bug.

You migth want to replace it with

	smp_rmb();
	udelay(1);

instead, at which point it *does* something: it has that read barrier 
(which is not actually needed on x86, but whatever), and it has a delay 
that is *meaningful*.

A plain "__delay(1)" is neither.

So let me repeat my statement: "What a piece of crap".

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch] spinlock debug: make looping nicer
  2007-06-21 20:42                                                       ` [patch] spinlock debug: make looping nicer Ingo Molnar
@ 2007-06-21 20:58                                                         ` Linus Torvalds
  2007-06-21 21:15                                                           ` Ingo Molnar
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-21 20:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm



On Thu, 21 Jun 2007, Ingo Molnar wrote:
> 
> btw., back then we also tried a spin_is_locked() based inner loop but it 
> didnt help the ->tree_lock lockups either. In any case i very much agree 
> that the 'nicer' looping should be added again - the patch below does 
> that. (build and boot tested)

Ok, I'm definitely not going to apply it right now, though.

> and the reason that this didnt help the ->tree_lock lockup is likely the 
> same why wait_task_inactive() broke _independently_ of the 'niceness' of 
> the spin-lock operation: there were too few instructions between 
> releasing the lock and re-acquiring it again can cause permanent 
> starvation of another CPU. No amount of logic on the spinning side can 
> overcome this, if acquire/release critical sections are following each 
> other too fast.

Exactly.

The only way to handle that case is to make sure that the person who 
*gets* the spinlock will slow down. The person who doesn't get it can't do 
anything at all about the fact that he's locked out.

A way to do that (as already mentioned) is to have a "this lock is 
contended" flag, and have the person who gets the lock do something about 
it (where the "something" might actually be as simple as saying "When I 
release a lock that somebody marked as having lots of contention, I will 
clear the contention flag, and then just delay myself").

Side note: that trivial approach only really helps for a *single* thread 
that gets it very much (like the example in wait_task_inactive). For true 
contention with multiple different CPU's that can *all* have the bad 
behaviour, you do actually need real queueing.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:48                                                         ` Linus Torvalds
@ 2007-06-21 21:06                                                           ` Ingo Molnar
  0 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > damn, i first wrote up an explanation about why that ugly __delay(1) is 
> > there (it almost hurts my eyes when i look at it!) but then deleted it 
> > as superfluous :-/
> 
> I'm fine with a delay, but the __delay(1) is simply not "correct". It 
> doesn't do anything.

it's a bit trickier than that. Yes, it's a simple 1-entry loop and thus 
makes little sense to call. But it's a loop that got boot-time 
calibrated, so we can do this in the spinlock-debug code:

        u64 loops = loops_per_jiffy * HZ;

this guarantees that we will loop for _at least_ 1 second before 
printing a message. (in practice it's much longer, especially with the 
current naive trylock approach)

Why? Because as part of the activities that the spin-loop does, we also 
do everything that an mdelay(1000) would do. We do it 'piecemail-wise', 
and we do it very inefficiently, but the lower time-bound should be 
guaranteed. This is done because most of the problems were caused by too 
short looping and bogus debug printouts. So this is basically an 
open-coded udelay implementation.

Furthermore, with the spin_is_locked() fix i just sent, the __loop(1) 
solution should actually be quite close to a real udelay() thing, 
without the delay effect. It would probably more accurate to increase it 
to loops_per_jiffy*10*HZ/16 and to call a __loop(16) thing, to get 
closer timings. (but then some people would argue 'why dont you take the 
lock as soon as it's released'.)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch] spinlock debug: make looping nicer
  2007-06-21 20:58                                                         ` Linus Torvalds
@ 2007-06-21 21:15                                                           ` Ingo Molnar
  2007-06-22  7:00                                                             ` Jarek Poplawski
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-21 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Jarek Poplawski, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > 
> > btw., back then we also tried a spin_is_locked() based inner loop 
> > but it didnt help the ->tree_lock lockups either. In any case i very 
> > much agree that the 'nicer' looping should be added again - the 
> > patch below does that. (build and boot tested)
> 
> Ok, I'm definitely not going to apply it right now, though.

yeah, very much so. Lots of distros ship with spinlock debugging enabled 
(at least in their betas), we dont want to break that accidentally.

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch] spinlock debug: make looping nicer
  2007-06-21 21:15                                                           ` Ingo Molnar
@ 2007-06-22  7:00                                                             ` Jarek Poplawski
  0 siblings, 0 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-22  7:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Eric Dumazet, Chuck Ebbert, Miklos Szeredi,
	chris, linux-kernel, tglx, akpm

On Thu, Jun 21, 2007 at 11:15:25PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, 21 Jun 2007, Ingo Molnar wrote:
> > > 
> > > btw., back then we also tried a spin_is_locked() based inner loop 
> > > but it didnt help the ->tree_lock lockups either. In any case i very 
> > > much agree that the 'nicer' looping should be added again - the 
> > > patch below does that. (build and boot tested)
> > 
> > Ok, I'm definitely not going to apply it right now, though.
> 
> yeah, very much so. Lots of distros ship with spinlock debugging enabled 
> (at least in their betas), we dont want to break that accidentally.

But, I hope, Miklos will manage to find some time to try this
patch or spinlock debugging on, to confirm the nature isn't
mocking us here... 

Miklos, I'd appreciate it very much if you could also exclude my
another extremely silly suspicion, and try with only cpu_relax()
instead of yield() (but without removing if (preempted)) in
wait_task_inactive().

Thanks & regards,
Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 20:16                                             ` Ingo Molnar
@ 2007-06-22  8:17                                               ` Ingo Molnar
  2007-06-23 10:36                                                 ` Miklos Szeredi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2007-06-22  8:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Jarek Poplawski, Miklos Szeredi, chris,
	linux-kernel, tglx, akpm


* Ingo Molnar <mingo@elte.hu> wrote:

> the freezes that Miklos was seeing were hardirq contexts blocking in 
> task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> think also tried !NOHZ kernels and older kernels, with a similar 
> result.)
> 
> plus on the ptrace side, the wait_task_inactive() code had most of its 
> overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> likely while we were still holding the runqueue lock!
> 
> i think the only thing that eventually got Miklos' laptop out of the 
> wedge were timer irqs hitting the ptrace CPU in exactly those 
> instructions where it was not holding the runqueue lock. (or perhaps 
> an asynchronous SMM event delaying it for a long time)

even considering that the 'LOCK'-ed intruction was the heaviest in the 
busy-loop, the numbers still just dont add up to 'tens of seconds of 
lockups', so there must be something else happening too.

So here's an addition to the existing theories: the Core2Duo is a 
4-issue CPU architecture. Now, why does this matter? It matters to the 
timing of the delivery of interrupts. For example, on a 3-issue 
architecture, the instruction level profile of well-cached workloads 
often looks like this:

c05a3b71:      710      89 d6                   mov    %edx,%esi
c05a3b73:        0      8b 55 c0                mov    0xffffffc0(%ebp),%edx
c05a3b76:        0      89 c3                   mov    %eax,%ebx
c05a3b78:      775      8b 82 e8 00 00 00       mov    0xe8(%edx),%eax
c05a3b7e:        0      8b 48 18                mov    0x18(%eax),%ecx
c05a3b81:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
c05a3b84:      792      89 1c 24                mov    %ebx,(%esp)
c05a3b87:        0      89 74 24 04             mov    %esi,0x4(%esp)
c05a3b8b:        0      ff d1                   call   *%ecx
c05a3b8d:        0      8b 4d c8                mov    0xffffffc8(%ebp),%ecx
c05a3b90:      925      8b 41 6c                mov    0x6c(%ecx),%eax
c05a3b93:        0      39 41 10                cmp    %eax,0x10(%ecx)
c05a3b96:        0      0f 85 a8 01 00 00       jne    c05a3d44 <schedule+0x2a4>
c05a3b9c:      949      89 da                   mov    %ebx,%edx
c05a3b9e:        0      89 f1                   mov    %esi,%ecx
c05a3ba0:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax

the second column is the number of times the profiling interrupt has hit 
that particular instruction.

Note the many zero entries - this means that for instructions that are 
well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
to within a bundle of micro-ops that the decoder will issue! The above 
workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
traffic were coming and going. Still the bundling of instructions was 
very strong.

There's no guarantee of 'instruction bundling': a cachemiss can still 
stall the pipeline and allow an interrupt to hit any instruction [where 
interrupt delivery is valid], but on a well-cached workload like the 
above, even a 3-issue architecture can effectively 'merge' instructions 
to each other, and can make them essentially 'atomic' as far as external 
interrupts go.

[ also note another interesting thing in the profile above: the
  CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
  effect that is even larger than 3 instructions. ]

i think that is what might have happened on Miklos's laptop too: the 
'movb' of the spin_unlock() done by the wait_task_inactive() got 
'bundled' together with the first LOCK instruction that took it again, 
making it very unlikely for a timer interrupt to ever hit that small 
window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
a simple NOP, because the Core2Duo is not an SMT platform.

to check this theory, adding 3 NOPs to the critical section should make 
the lockups a lot less prominent too. (While NOPs are not actually 
'issued', they do take up decoder bandwidth, so they hopefully are able 
to break up any 'bundle' of instructions.)

Miklos, if you've got some time to test this - could you revert the 
fa490cfd15d7 commit and apply the patch below - does it have any impact 
on the lockups you were experiencing?

	Ingo

---
 kernel/sched.c |    1 +
 1 file changed, 1 insertion(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1131,6 +1131,7 @@ repeat:
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &flags);
 		cpu_relax();
+		asm volatile ("nop; nop; nop;");
 		if (preempted)
 			yield();
 		goto repeat;

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 16:01                                     ` Linus Torvalds
@ 2007-06-22 10:38                                       ` Jarek Poplawski
  0 siblings, 0 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-22 10:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Miklos Szeredi, cebbert, chris, linux-kernel, tglx, akpm

On Thu, Jun 21, 2007 at 09:01:28AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 21 Jun 2007, Jarek Poplawski wrote:
...
> So I don't see how you could possibly having two different CPU's getting 
> into some lock-step in that loop: changing "task_rq()" is a really quite 
> heavy operation (it's about migrating between CPU's), and generally 
> happens at a fairly low frequency (ie "a couple of times a second" kind of 
> thing, not "tight CPU loop").

Yes, I've agreed with Ingo it was only one of my illusions...

> But bugs happen..
> 
> > Another possible problem could be a result of some wrong optimization
> > or wrong propagation of change of this task_rq(p) value.
> 
> I agree, but that kind of bug would likely not cause temporary hangs, but 
> actual "the machine is dead" operations. If you get the totally *wrong* 
> value due to some systematic bug, you'd be waiting forever for it to 
> match, not get into a loop for a half second and then it clearing up..
> 
> But I don't think we can throw the "it's another bug" theory _entirely_ 
> out the window.

Alas, I'm the last here who should talk with you or Ingo about
hardware, but my point is that until it's not 100% proven this
is spinlocks vs. cpu case any nearby possibilities should be
considered. One of them is this loop, which can ... loop. Of
course we can't see any reason for this, but if something is
theoretically allowed it can happen. Here it's enough the
task_rq(p) is for some very unprobable reason (maybe buggy
hardware, code or compiling) cached or read too late, and
maybe it's only on this one only notebook in the world? If
you're sure this is not the case, let's forget about this.
Of course, I don't mean it's a direct optimization. I think
about something that could be triggered by something like this
tested smp_mb() in entirely another place. IMHO, it would be
very interesting to assure if this barrier fixed the spinlock
or maybe some variable.

There is also another interesting question: if it was only about
spinlocks (which may be) why on those watchdog traces there
are mainly two "fighters": wait_task_inactive() and
try_to_wake_up(). It seems there should be seen more clients
of task_rq_lock(). So, my another unlikely idea is: maybe
for some other strange, unprobable but only theoretically
possibility there could be something wrong around this
yield() in wait_task_inactive(). The comment above this
function reads about the task that *will* unschedule.
So, maybe it would be not nice if e.g. during this yield()
something wakes it up?

...
> But it only happens with badly coded software: the rule simply is that you 
> MUST NOT release and immediately re-acquire the same spinlock on the same 
> core, because as far as other cores are concerned, that's basically the 
> same as never releasing it in the first place.

I totally agree! Let's only get certainty there is nothing
more here. BTW, is there any reason not to add some test with
a warning under spinlock debugging against such badly behaving
places?

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-22  8:17                                               ` Ingo Molnar
@ 2007-06-23 10:36                                                 ` Miklos Szeredi
  2007-06-23 16:39                                                   ` Linus Torvalds
  2007-06-25  6:45                                                   ` Jarek Poplawski
  0 siblings, 2 replies; 88+ messages in thread
From: Miklos Szeredi @ 2007-06-23 10:36 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, cebbert, jarkao2, miklos, chris, linux-kernel, tglx, akpm

> > the freezes that Miklos was seeing were hardirq contexts blocking in 
> > task_rq_lock() - that is done with interrupts disabled. (Miklos i 
> > think also tried !NOHZ kernels and older kernels, with a similar 
> > result.)
> > 
> > plus on the ptrace side, the wait_task_inactive() code had most of its 
> > overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
> > likely while we were still holding the runqueue lock!
> > 
> > i think the only thing that eventually got Miklos' laptop out of the 
> > wedge were timer irqs hitting the ptrace CPU in exactly those 
> > instructions where it was not holding the runqueue lock. (or perhaps 
> > an asynchronous SMM event delaying it for a long time)
> 
> even considering that the 'LOCK'-ed intruction was the heaviest in the 
> busy-loop, the numbers still just dont add up to 'tens of seconds of 
> lockups', so there must be something else happening too.
> 
> So here's an addition to the existing theories: the Core2Duo is a 
> 4-issue CPU architecture. Now, why does this matter? It matters to the 
> timing of the delivery of interrupts. For example, on a 3-issue 
> architecture, the instruction level profile of well-cached workloads 
> often looks like this:
> 
> c05a3b71:      710      89 d6                   mov    %edx,%esi
> c05a3b73:        0      8b 55 c0                mov    0xffffffc0(%ebp),%edx
> c05a3b76:        0      89 c3                   mov    %eax,%ebx
> c05a3b78:      775      8b 82 e8 00 00 00       mov    0xe8(%edx),%eax
> c05a3b7e:        0      8b 48 18                mov    0x18(%eax),%ecx
> c05a3b81:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
> c05a3b84:      792      89 1c 24                mov    %ebx,(%esp)
> c05a3b87:        0      89 74 24 04             mov    %esi,0x4(%esp)
> c05a3b8b:        0      ff d1                   call   *%ecx
> c05a3b8d:        0      8b 4d c8                mov    0xffffffc8(%ebp),%ecx
> c05a3b90:      925      8b 41 6c                mov    0x6c(%ecx),%eax
> c05a3b93:        0      39 41 10                cmp    %eax,0x10(%ecx)
> c05a3b96:        0      0f 85 a8 01 00 00       jne    c05a3d44 <schedule+0x2a4>
> c05a3b9c:      949      89 da                   mov    %ebx,%edx
> c05a3b9e:        0      89 f1                   mov    %esi,%ecx
> c05a3ba0:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
> 
> the second column is the number of times the profiling interrupt has hit 
> that particular instruction.
> 
> Note the many zero entries - this means that for instructions that are 
> well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
> to within a bundle of micro-ops that the decoder will issue! The above 
> workload was a plain lat_ctx, so nothing special, and interrupts and DMA 
> traffic were coming and going. Still the bundling of instructions was 
> very strong.
> 
> There's no guarantee of 'instruction bundling': a cachemiss can still 
> stall the pipeline and allow an interrupt to hit any instruction [where 
> interrupt delivery is valid], but on a well-cached workload like the 
> above, even a 3-issue architecture can effectively 'merge' instructions 
> to each other, and can make them essentially 'atomic' as far as external 
> interrupts go.
> 
> [ also note another interesting thing in the profile above: the
>   CALL *%ecx was likely BTB-optimized and hence we have a 'bundling' 
>   effect that is even larger than 3 instructions. ]
> 
> i think that is what might have happened on Miklos's laptop too: the 
> 'movb' of the spin_unlock() done by the wait_task_inactive() got 
> 'bundled' together with the first LOCK instruction that took it again, 
> making it very unlikely for a timer interrupt to ever hit that small 
> window in wait_task_inactive(). The cpu_relax()'s "REP; NOP" was likely 
> a simple NOP, because the Core2Duo is not an SMT platform.
> 
> to check this theory, adding 3 NOPs to the critical section should make 
> the lockups a lot less prominent too. (While NOPs are not actually 
> 'issued', they do take up decoder bandwidth, so they hopefully are able 
> to break up any 'bundle' of instructions.)
> 
> Miklos, if you've got some time to test this - could you revert the 
> fa490cfd15d7 commit and apply the patch below - does it have any impact 
> on the lockups you were experiencing?

No.  If anything it made the freezes somwhat more frequent.

And it's not a NO_HZ kernel.

What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:

           CPU0       CPU1
  0:     220496         42   IO-APIC-edge      timer
  1:       3841          0   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc
  9:       2756          0   IO-APIC-fasteoi   acpi
 12:       2638          0   IO-APIC-edge      i8042
 14:       7776          0   IO-APIC-edge      ide0
 16:       6083          0   IO-APIC-fasteoi   uhci_hcd:usb2
 17:      34414          3   IO-APIC-fasteoi   uhci_hcd:usb3, HDA Intel
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 19:         32          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
313:      11405          1   PCI-MSI-edge      eth0
314:      29417         10   PCI-MSI-edge      ahci
NMI:        164        118
LOC:     220499     220463
ERR:          0

and the freezes don't really change that.  And the NMI traces show,
that it's always CPU1 which is spinning in wait_task_inactive().

Miklos

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-23 10:36                                                 ` Miklos Szeredi
@ 2007-06-23 16:39                                                   ` Linus Torvalds
  2007-06-25  6:45                                                   ` Jarek Poplawski
  1 sibling, 0 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-23 16:39 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, cebbert, jarkao2, chris, linux-kernel, tglx, akpm



On Sat, 23 Jun 2007, Miklos Szeredi wrote:
> 
> What I notice is that the interrupt distribution between the CPUs is
> very asymmetric like this:
> 
>            CPU0       CPU1
>   0:     220496         42   IO-APIC-edge      timer
>   1:       3841          0   IO-APIC-edge      i8042
...
> LOC:     220499     220463
> ERR:          0
> 
> and the freezes don't really change that.  And the NMI traces show,
> that it's always CPU1 which is spinning in wait_task_inactive().

Well, the LOC thing is for the local apic timer, so while regular 
interrupts are indeed very skewed, both CPU's is nicely getting the local 
apic timer thing..

That said, the timer interrupt generally happens just a few hundred times 
a second, and if there's just a higher likelihood that it happens when the 
spinlock is taken, then half-a-second pauses could easily be just because 
even when the interrupt happens, it could be skewed to happen when the 
lock is held.

And that definitely is the case: the most expensive instruction _by_far_ 
in that loop is the actual locked instruction that acquires the lock 
(especially with the cache-line bouncing around), so an interrupt would be 
much more likely to happen right after that one rather than after the 
store that releases the lock, which can be buffered.

It can be quite interesting to look at instruction-level cycle profiling 
with oprofile, just to see where the costs are..

		Linus



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-23 10:36                                                 ` Miklos Szeredi
  2007-06-23 16:39                                                   ` Linus Torvalds
@ 2007-06-25  6:45                                                   ` Jarek Poplawski
  1 sibling, 0 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-25  6:45 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: mingo, torvalds, cebbert, chris, linux-kernel, tglx, akpm

On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote:
...
> And it's not a NO_HZ kernel.
...

BTW, maybe I've missed this and it's unconnected, but I hope the
first config has been changed - especially this CONFIG_AGP_AMD64 = y,
and this bug from mm/slab.c has gone long ago...

Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-21 18:44                                               ` Linus Torvalds
                                                                   ` (2 preceding siblings ...)
  2007-06-21 20:12                                                 ` Ingo Molnar
@ 2007-06-26  8:42                                                 ` Nick Piggin
  2007-06-26 10:56                                                   ` Jarek Poplawski
  2007-06-26 17:23                                                   ` Linus Torvalds
  3 siblings, 2 replies; 88+ messages in thread
From: Nick Piggin @ 2007-06-26  8:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm

Linus Torvalds wrote:
> 
> On Thu, 21 Jun 2007, Eric Dumazet wrote:
> 
>>This reminds me Nick's proposal of 'queued spinlocks' 3 months ago
>>
>>Maybe this should be re-considered ? (unlock is still a non atomic op, 
>>so we dont pay the serializing cost twice)
> 
> 
> No. The point is simple:
> 
> 	IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!
> 
> I don't understand why this is even controversial. Especially since we 
> have a patch for the problem that proves my point: the _proper_ way to fix 
> things is to just not do the bad thing, instead of trying to allow the bad 
> behaviour and try to handle it.
> 
> Things like queued spinlocks just make excuses for bad code. 
> 
> We don't do nesting locking either, for exactly the same reason. Are 
> nesting locks "easier"? Absolutely. They are also almost always a sign of 
> a *bug*. So making spinlocks and/or mutexes nest by default is just a way 
> to encourage bad programming!

Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.

They should also improve performance in heavily contended case due to
the nature of how they spin, but I know that's not something you want
to hear about. And theoretically there should be no reason why xadd is
any slower than dec and look at the status flags, should there? I never
implementedit in optimised assembly to test though...

Some hardware seems to have no idea of fair cacheline scheduling, and
especially when there are more than 2 CPUs contending for the
cacheline, there can be large starvations.

And actually some times we have code that really wants to drop the
lock and queue behind other contenders. Most of the lockbreak stuff
for example.

Suppose we could have a completely fair spinlock primitive that has
*virtually* no downsides over the unfair version, you'd take the fair
one, right?

Not that I'm saying they'd ever be a good solution to bad code, but I
do think fairness is better than none, all else being equal.


>>extract : 
>>
>>Implement queued spinlocks for i386. This shouldn't increase the size of
>>the spinlock structure, while still able to handle 2^16 CPUs.
> 
> 
> Umm. i386 spinlocks could and should be *one*byte*.
> 
> In fact, I don't even know why they are wasting four bytes right now: the 
> fact that somebody made them an "int" just wastes memory. All the actual 
> code uses "decb", so it's not even a question of safety. I wonder why we 
> have that 32-bit thing and the ugly casts.

Anyway, I think the important point is that they can remain within 4
bytes, which is obviously the most important boundary (they could be
2 bytes on i386).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-26  8:42                                                 ` Nick Piggin
@ 2007-06-26 10:56                                                   ` Jarek Poplawski
  2007-06-26 17:23                                                   ` Linus Torvalds
  1 sibling, 0 replies; 88+ messages in thread
From: Jarek Poplawski @ 2007-06-26 10:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm

On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote:
...
> They should also improve performance in heavily contended case due to
> the nature of how they spin, but I know that's not something you want
> to hear about. And theoretically there should be no reason why xadd is
> any slower than dec and look at the status flags, should there? I never
> implementedit in optimised assembly to test though...
...

BTW, could you explain why the below diagnose doesn't relate
to your solution?

On 06/21/2007 12:08 PM, Ingo Molnar wrote:
...
> So it seems the problem was that if a core kept _truly_ modifying a 
> cacheline via atomics in a high enough frequency, it could artificially 
> starve the other core. (which would keep waiting for the cacheline to be 
> released one day, and which kept the first core from ever making any 
> progress) To me that looks like a real problem on the hardware side - 
> shouldnt cacheline ownership be arbitrated a bit better than that?
> 

Thanks & regards,
Jarek P.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-26  8:42                                                 ` Nick Piggin
  2007-06-26 10:56                                                   ` Jarek Poplawski
@ 2007-06-26 17:23                                                   ` Linus Torvalds
  2007-06-27  5:23                                                     ` Nick Piggin
  1 sibling, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-26 17:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm



On Tue, 26 Jun 2007, Nick Piggin wrote:
> 
> Hmm, not that I have a strong opinion one way or the other, but I
> don't know that they would encourage bad code. They are not going to
> reduce latency under a locked section, but will improve determinism
> in the contended case.

xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.

But the real problem is that your "unlock" sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.

But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.

(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-26 17:23                                                   ` Linus Torvalds
@ 2007-06-27  5:23                                                     ` Nick Piggin
  2007-06-27  6:04                                                       ` Linus Torvalds
  0 siblings, 1 reply; 88+ messages in thread
From: Nick Piggin @ 2007-06-27  5:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm

Linus Torvalds wrote:
> 
> On Tue, 26 Jun 2007, Nick Piggin wrote:
> 
>>Hmm, not that I have a strong opinion one way or the other, but I
>>don't know that they would encourage bad code. They are not going to
>>reduce latency under a locked section, but will improve determinism
>>in the contended case.
> 
> 
> xadd really generally *is* slower than an add. One is often microcoded, 
> the other is not.

Oh. I found xadd to be not hugely slower on my P4, but it was a little
bit.


> But the real problem is that your "unlock" sequence is now about two 
> orders of magnitude slower than it used to be. So it used to be that a 
> spinlocked sequence only had a single synchronization point, now it has 
> two. *That* is really bad, and I guarantee that it makes your spinlocks 
> effectively twice as slow for the non-contended parts.

I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
twice as slow (IIRC).


> But your xadd thing might be worth looking at, just to see how expensive 
> it is. As an _alternative_ to spinlocks, it's certainly viable.
> 
> (Side note: why make it a word? Word operations are slower on many x86 
> implementations, because they add yet another prefix. You only need a 
> byte)

No real reason I guess. I'll change it.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27  5:23                                                     ` Nick Piggin
@ 2007-06-27  6:04                                                       ` Linus Torvalds
  2007-06-27  6:20                                                         ` Nick Piggin
  2007-06-27 19:47                                                         ` Linus Torvalds
  0 siblings, 2 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-27  6:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm



On Wed, 27 Jun 2007, Nick Piggin wrote:
> 
> I don't know why my unlock sequence should be that much slower? Unlocked
> mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
> twice as slow (IIRC).

Oh, that releasing "add" can be unlocked, and only the holder of the lock 
ever touches that field?

I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.

Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.

In fact, I think a "incb <mem>" instruction is even a byte shorter than 
"movb $1,mem", and with "unlock" being inlined, that could actually be a 
slight _win_.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27  6:04                                                       ` Linus Torvalds
@ 2007-06-27  6:20                                                         ` Nick Piggin
  2007-06-27 19:47                                                         ` Linus Torvalds
  1 sibling, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2007-06-27  6:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm

Linus Torvalds wrote:
> 
> On Wed, 27 Jun 2007, Nick Piggin wrote:
> 
>>I don't know why my unlock sequence should be that much slower? Unlocked
>>mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't
>>twice as slow (IIRC).
> 
> 
> Oh, that releasing "add" can be unlocked, and only the holder of the lock 
> ever touches that field?

Right.


> I must not have looked closely enough. In that case, I withdraw that 
> objection, and the sequence-number-based spinlock sounds like a perfectly 
> fine one.
> 
> Yes, the add will be slightly slower than the plain byte move, and the 
> locked xadd will be slightly slower than a regular locked add, but 
> compared to the serialization cost, that should be small. For some reason 
> I thought you needed a locked instruction for the unlock too.
> 
> So try it with just a byte counter, and test some stupid micro-benchmark 
> on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
> it the normal spinlock sequence just because it isn't noticeably slower.
> 
> In fact, I think a "incb <mem>" instruction is even a byte shorter than 
> "movb $1,mem", and with "unlock" being inlined, that could actually be a 
> slight _win_.

OK, I'll try running some tests and get back to you on it.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27  6:04                                                       ` Linus Torvalds
  2007-06-27  6:20                                                         ` Nick Piggin
@ 2007-06-27 19:47                                                         ` Linus Torvalds
  2007-06-27 20:10                                                           ` Ingo Molnar
                                                                             ` (2 more replies)
  1 sibling, 3 replies; 88+ messages in thread
From: Linus Torvalds @ 2007-06-27 19:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm


Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:
> 
> So try it with just a byte counter, and test some stupid micro-benchmark 
> on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
> it the normal spinlock sequence just because it isn't noticeably slower.

So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.

In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:

	CPU#0					CPU#1

	A (= code before the spinlock)
						lock release

	lock decb mem	(serializing instruction)

	B (= code after the spinlock)

and there is no question that memory operations in B cannot leak into A.

With the sequence counters, the situation is more complex:

	CPU #0					CPU #1

	A (= code before the spinlock)

	lock xadd mem	(serializing instruction)

	B (= code afte xadd, but not inside lock)

						lock release

	cmp head, tail

	C (= code inside the lock)

Now, B is basically the empty set, but that's not the issue I worry about. 
The thing is, I can guarantee by the Intel memory ordering rules that 
neither B nor C will ever have memops that leak past the "xadd", but I'm 
not at all as sure that we cannot have memops in C that leak into B!

And B really isn't protected by the lock - it may run while another CPU 
still holds the lock, and we know the other CPU released it only as part 
of the compare. But that compare isn't a serializing instruction!

IOW, I could imagine a load inside C being speculated, and being moved 
*ahead* of the load that compares the spinlock head with the tail! IOW, 
the load that is _inside_ the spinlock has effectively moved to outside 
the protected region, and the spinlock isn't really a reliable mutual 
exclusion barrier any more!

(Yes, there is a data-dependency on the compare, but it is only used for a 
conditional branch, and conditional branches are control dependencies and 
can be speculated, so CPU speculation can easily break that apparent 
dependency chain and do later loads *before* the spinlock load completes!)

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory ordering, and that your spinlock may 
actually work perfectly well. But it still worries me. As far as I can 
tell, there's a theoretical problem with your spinlock implementation.

So I'd like you to ask around some CPU people, and get people from both 
Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
have the required contacts, but if you don't, I can send things off to the 
appropriate people at least inside Intel.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 19:47                                                         ` Linus Torvalds
@ 2007-06-27 20:10                                                           ` Ingo Molnar
  2007-06-27 20:17                                                           ` Davide Libenzi
  2007-07-02  7:06                                                           ` Nick Piggin
  2 siblings, 0 replies; 88+ messages in thread
From: Ingo Molnar @ 2007-06-27 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> With the sequence counters, the situation is more complex:
> 
> 	CPU #0					CPU #1
> 
> 	A (= code before the spinlock)
> 
> 	lock xadd mem	(serializing instruction)
> 
> 	B (= code afte xadd, but not inside lock)
> 
> 						lock release
> 
> 	cmp head, tail
> 
> 	C (= code inside the lock)
> 
> Now, B is basically the empty set, but that's not the issue I worry 
> about. The thing is, I can guarantee by the Intel memory ordering 
> rules that neither B nor C will ever have memops that leak past the 
> "xadd", but I'm not at all as sure that we cannot have memops in C 
> that leak into B!
> 
> And B really isn't protected by the lock - it may run while another 
> CPU still holds the lock, and we know the other CPU released it only 
> as part of the compare. But that compare isn't a serializing 
> instruction!
> 
> IOW, I could imagine a load inside C being speculated, and being moved 
> *ahead* of the load that compares the spinlock head with the tail! 
> IOW, the load that is _inside_ the spinlock has effectively moved to 
> outside the protected region, and the spinlock isn't really a reliable 
> mutual exclusion barrier any more!
> 
> (Yes, there is a data-dependency on the compare, but it is only used 
> for a conditional branch, and conditional branches are control 
> dependencies and can be speculated, so CPU speculation can easily 
> break that apparent dependency chain and do later loads *before* the 
> spinlock load completes!)
> 
> Now, I have good reason to believe that all Intel and AMD CPU's have a 
> stricter-than-documented memory ordering, and that your spinlock may 
> actually work perfectly well. But it still worries me. As far as I can 
> tell, there's a theoretical problem with your spinlock implementation.

hm, i agree with you that this is problematic. Especially on an SMT CPU 
it would be a big architectural restriction if prefetches couldnt cross 
cache misses. (and that's the only way i could see Nick's scheme 
working: MESI coherency coupled with the speculative use of that 
cacheline's value never "surviving" a MESI invalidation of that 
cacheline. That would guarantee that once we have the lock, any 
speculative result is fully coherent and no other CPU has modified it.)

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 19:47                                                         ` Linus Torvalds
  2007-06-27 20:10                                                           ` Ingo Molnar
@ 2007-06-27 20:17                                                           ` Davide Libenzi
  2007-06-27 22:11                                                             ` Linus Torvalds
  2007-07-02  7:06                                                           ` Nick Piggin
  2 siblings, 1 reply; 88+ messages in thread
From: Davide Libenzi @ 2007-06-27 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Jarek Poplawski, Miklos Szeredi, chris,
	Linux Kernel Mailing List, tglx, Andrew Morton

On Wed, 27 Jun 2007, Linus Torvalds wrote:

> On Tue, 26 Jun 2007, Linus Torvalds wrote:
> > 
> > So try it with just a byte counter, and test some stupid micro-benchmark 
> > on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
> > it the normal spinlock sequence just because it isn't noticeably slower.
> 
> So I thought about this a bit more, and I like your sequence counter 
> approach, but it still worried me.
> 
> In the current spinlock code, we have a very simple setup for a 
> successful grab of the spinlock:
> 
> 	CPU#0					CPU#1
> 
> 	A (= code before the spinlock)
> 						lock release
> 
> 	lock decb mem	(serializing instruction)
> 
> 	B (= code after the spinlock)
> 
> and there is no question that memory operations in B cannot leak into A.
> 
> With the sequence counters, the situation is more complex:
> 
> 	CPU #0					CPU #1
> 
> 	A (= code before the spinlock)
> 
> 	lock xadd mem	(serializing instruction)
> 
> 	B (= code afte xadd, but not inside lock)
> 
> 						lock release
> 
> 	cmp head, tail
> 
> 	C (= code inside the lock)
> 
> Now, B is basically the empty set, but that's not the issue I worry about. 
> The thing is, I can guarantee by the Intel memory ordering rules that 
> neither B nor C will ever have memops that leak past the "xadd", but I'm 
> not at all as sure that we cannot have memops in C that leak into B!
> 
> And B really isn't protected by the lock - it may run while another CPU 
> still holds the lock, and we know the other CPU released it only as part 
> of the compare. But that compare isn't a serializing instruction!
> 
> IOW, I could imagine a load inside C being speculated, and being moved 
> *ahead* of the load that compares the spinlock head with the tail! IOW, 
> the load that is _inside_ the spinlock has effectively moved to outside 
> the protected region, and the spinlock isn't really a reliable mutual 
> exclusion barrier any more!
> 
> (Yes, there is a data-dependency on the compare, but it is only used for a 
> conditional branch, and conditional branches are control dependencies and 
> can be speculated, so CPU speculation can easily break that apparent 
> dependency chain and do later loads *before* the spinlock load completes!)
> 
> Now, I have good reason to believe that all Intel and AMD CPU's have a 
> stricter-than-documented memory ordering, and that your spinlock may 
> actually work perfectly well. But it still worries me. As far as I can 
> tell, there's a theoretical problem with your spinlock implementation.

Nice catch ;) But wasn't Intel suggesting in not relying on the old 
"strict" ordering rules? IOW shouldn't an mfence always be there? Not only 
loads could leak up into the wait phase, but stores too, if they have no 
dependency with the "head" and "tail" loads.



- Davide



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 20:17                                                           ` Davide Libenzi
@ 2007-06-27 22:11                                                             ` Linus Torvalds
  2007-06-27 23:30                                                               ` Davide Libenzi
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-27 22:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Jarek Poplawski, Miklos Szeredi, chris,
	Linux Kernel Mailing List, tglx, Andrew Morton



On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > 
> > Now, I have good reason to believe that all Intel and AMD CPU's have a 
> > stricter-than-documented memory ordering, and that your spinlock may 
> > actually work perfectly well. But it still worries me. As far as I can 
> > tell, there's a theoretical problem with your spinlock implementation.
> 
> Nice catch ;) But wasn't Intel suggesting in not relying on the old 
> "strict" ordering rules?

Actually, both Intel and AMD engineers have been talking about making the 
documentation _stricter_, rather than looser. They apparently already are 
pretty damn strict, because not being stricter than the docs imply just 
ends up exposing too many potential problems in software that didn't 
really follow the rules.

For example, it's quite possible to do loads out of order, but guarantee 
that the result is 100% equivalent with a totally in-order machine. One 
way you do that is to keep track of the cacheline for any speculative 
loads, and if it gets invalidated before the speculative instruction has 
completed, you just throw the speculation away.

End result: you can do any amount of speculation you damn well please at a 
micro-architectural level, but if the speculation would ever have been 
architecturally _visible_, it never happens!

(Yeah, that is just me in my non-professional capacity of hw engineer 
wanna-be: I'm not saying that that is necessarily what Intel or AMD 
actually ever do, and they may have other approaches entirely).

> IOW shouldn't an mfence always be there? Not only loads could leak up 
> into the wait phase, but stores too, if they have no dependency with the 
> "head" and "tail" loads.

Stores never "leak up". They only ever leak down (ie past subsequent loads 
or stores), so you don't need to worry about them. That's actually already 
documented (although not in those terms), and if it wasn't true, then we 
couldn't do the spin unlock with just a regular store anyway.

(There's basically never any reason to "speculate" stores before other mem 
ops. It's hard, and pointless. Stores you want to just buffer and move as 
_late_ as possible, loads you want to speculate and move as _early_ as 
possible. Anything else doesn't make sense).

So I'm fairly sure that the only thing you really need to worry about in 
this thing is the load-load ordering (the load for the spinlock compare vs 
any loads "inside" the spinlock), and I'm reasonably certain that no 
existing x86 (and likely no future x86) will make load-load reordering 
effects architecturally visible, even if the implementation may do so 
*internally* when it's not possible to see it in the end result.

			Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 22:11                                                             ` Linus Torvalds
@ 2007-06-27 23:30                                                               ` Davide Libenzi
  2007-06-28  0:46                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 88+ messages in thread
From: Davide Libenzi @ 2007-06-27 23:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Jarek Poplawski, Miklos Szeredi, chris,
	Linux Kernel Mailing List, tglx, Andrew Morton

On Wed, 27 Jun 2007, Linus Torvalds wrote:

> > IOW shouldn't an mfence always be there? Not only loads could leak up 
> > into the wait phase, but stores too, if they have no dependency with the 
> > "head" and "tail" loads.
> 
> Stores never "leak up". They only ever leak down (ie past subsequent loads 
> or stores), so you don't need to worry about them. That's actually already 
> documented (although not in those terms), and if it wasn't true, then we 
> couldn't do the spin unlock with just a regular store anyway.

Yes, Intel has never done that. They'll probably never do it since it'll 
break a lot of system software (unless they use a new mode-bit that 
allows system software to enable lose-ordering). Although I clearly 
remember to have read in one of their P4 optimization manuals to not 
assume this in the future.



- Davide



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 23:30                                                               ` Davide Libenzi
@ 2007-06-28  0:46                                                                 ` Linus Torvalds
  2007-06-28  3:03                                                                   ` Davide Libenzi
  0 siblings, 1 reply; 88+ messages in thread
From: Linus Torvalds @ 2007-06-28  0:46 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Jarek Poplawski, Miklos Szeredi, chris,
	Linux Kernel Mailing List, tglx, Andrew Morton



On Wed, 27 Jun 2007, Davide Libenzi wrote:

> On Wed, 27 Jun 2007, Linus Torvalds wrote:
> > 
> > Stores never "leak up". They only ever leak down (ie past subsequent loads 
> > or stores), so you don't need to worry about them. That's actually already 
> > documented (although not in those terms), and if it wasn't true, then we 
> > couldn't do the spin unlock with just a regular store anyway.
> 
> Yes, Intel has never done that. They'll probably never do it since it'll 
> break a lot of system software (unless they use a new mode-bit that 
> allows system software to enable lose-ordering). Although I clearly 
> remember to have read in one of their P4 optimization manuals to not 
> assume this in the future.

That optimization manual was confused. 

The Intel memory ordering documentation *clearly* states that only reads 
pass writes, not the other way around.

Some very confused people have thought that "pass" is a two-way thing. 
It's not. "Passing" in the Intel memory ordering means "go _ahead_ of", 
exactly the same way it means in traffic. You don't "pass" people by 
falling behind them.

It's also obvious from reading the manual, because any other reading would 
be very strange: it says

 1. Reads can be carried out speculatively and in any order

 2. Reads can pass buffered writes, but the processor is self-consistent

 3. Writes to memory are always carried out in program order [.. and then 
    lists exceptions that are not interesting - it's clflush and the 
    non-temporal stores, not any normal writes ]

 4. Writes can be buffered

 5. Writes are not performed speculatively; they are only performed for 
    instructions that have actually been retired.

 6. Data from buffered writes can be forwarded to waiting reads within the 
    processor.

 7. Reads or writes cannot pass (be carried out ahead of) I/O 
    instructions, locked instructions or serializing instructions.

 8. Reads cannot pass LFENCE and MFENCE instructions.

 9. Writes cannot pass SFENCE or MFENCE instructions.

The thing to note is:

 a) in (1), Intel says that reads can occur in any order, but (2) makes it 
    clear that that is only relevant wrt other _reads_

 b) in (2), they say "pass", but then they actually explain that "pass" 
    means "be carried out ahead of" in (7). 

    HOWEVER, it should be obvious in (2) even _without_ the explicit 
    clarification in (7) that "pass" is a one-way thing, because otherwise 
    (2) is totally _meaningless_. It would be meaningless for two reasons:

     - (1) already said that reads can be done in any order, so if that 
       was a "any order wrt writes", then (2) would be pointless. So (2) 
       must mean something *else* than "any order", and the only sane 
       reading of it that isn't "any order" is that "pass" is a one-way 
       thing: you pass somebody when you go ahead of them, you do *not* 
       pass somebody when you fall behind them!

     - if (2) really meant that reads and writes can just be re-ordered, 
       then the choice of words makes no sense. It would be much more 
       sensible to say that "reads can be carried out in any order wrt 
       writes", instead of talking explicitly about "passing buffered 
       writes"

Anyway, I'm pretty damn sure my reading is correct. And no, it's not a "it 
happens to work". It's _architecturally_required_ to work, and nobody has 
ever complained about the use of a simple store to unlock a spinlock 
(which would only work if the "reads can pass" only means "*later* reads 
can pass *earlier* writes").

And it turns out that I think #1 is going away. Yes, the uarch will 
internally re-order reads, wrt each other, but if it isn't architecturally 
visible, then from an architectural standpoint #1 simply doesn't happen.

I can't guarantee that will happen, of course, but from talking to both 
AMD and Intel people, I think that they'll just document the stricter 
rules as the de-facto rules.

		Linus

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-28  0:46                                                                 ` Linus Torvalds
@ 2007-06-28  3:03                                                                   ` Davide Libenzi
  0 siblings, 0 replies; 88+ messages in thread
From: Davide Libenzi @ 2007-06-28  3:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Eric Dumazet, Chuck Ebbert, Ingo Molnar,
	Jarek Poplawski, Miklos Szeredi, chris,
	Linux Kernel Mailing List, tglx, Andrew Morton

On Wed, 27 Jun 2007, Linus Torvalds wrote:

> On Wed, 27 Jun 2007, Davide Libenzi wrote:
> 
> > On Wed, 27 Jun 2007, Linus Torvalds wrote:
> > > 
> > > Stores never "leak up". They only ever leak down (ie past subsequent loads 
> > > or stores), so you don't need to worry about them. That's actually already 
> > > documented (although not in those terms), and if it wasn't true, then we 
> > > couldn't do the spin unlock with just a regular store anyway.
> > 
> > Yes, Intel has never done that. They'll probably never do it since it'll 
> > break a lot of system software (unless they use a new mode-bit that 
> > allows system software to enable lose-ordering). Although I clearly 
> > remember to have read in one of their P4 optimization manuals to not 
> > assume this in the future.
> 
> That optimization manual was confused. 
> 
> The Intel memory ordering documentation *clearly* states that only reads 
> pass writes, not the other way around.

Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I 
remember them saying to not assume any ordering besides the data 
dependency and the CPU self-consistency in the future CPUs, and to use 
*fence instructions when certain semantics were required.
But google did not help me in finding that doc, so maybe I were really on 
crack :)



- Davide



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [BUG] long freezes on thinkpad t60
  2007-06-27 19:47                                                         ` Linus Torvalds
  2007-06-27 20:10                                                           ` Ingo Molnar
  2007-06-27 20:17                                                           ` Davide Libenzi
@ 2007-07-02  7:06                                                           ` Nick Piggin
  2 siblings, 0 replies; 88+ messages in thread
From: Nick Piggin @ 2007-07-02  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Chuck Ebbert, Ingo Molnar, Jarek Poplawski,
	Miklos Szeredi, chris, linux-kernel, tglx, akpm

Linus Torvalds wrote:
> Nick,
>  call me a worry-wart, but I slept on this, and started worrying..
> 
> On Tue, 26 Jun 2007, Linus Torvalds wrote:
> 
>>So try it with just a byte counter, and test some stupid micro-benchmark 
>>on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
>>it the normal spinlock sequence just because it isn't noticeably slower.
> 
> 
> So I thought about this a bit more, and I like your sequence counter 
> approach, but it still worried me.
> 
> In the current spinlock code, we have a very simple setup for a 
> successful grab of the spinlock:
> 
> 	CPU#0					CPU#1
...

Yeah, thanks.

> Now, I have good reason to believe that all Intel and AMD CPU's have a 
> stricter-than-documented memory ordering, and that your spinlock may 
> actually work perfectly well. But it still worries me. As far as I can 
> tell, there's a theoretical problem with your spinlock implementation.
> 
> So I'd like you to ask around some CPU people, and get people from both 
> Intel and AMD to sign off on your spinlocks as safe. I suspect you already 
> have the required contacts, but if you don't, I can send things off to the 
> appropriate people at least inside Intel.

Haven't made too much progress on this, but I have asked someone@amd who
might be able to at least know the right person to ask :P (might be faster
to ask Andi to ask :))

If you know someone at Intel then that would be appreciated.

It would be nice if it is safe (and can be guaranteed to be safe in future).

However OTOH, the fastpath may be even faster if we do it in a "definitely
safe" way.

That is, do the xaddw against 16-bits with the head in 8 of those and the
tail in the other 8. Then compare the byte registers of the register
returned by xaddw for the test. Although the xaddw is going to be slower
than an xaddb, this way we subsequently avoid the extra load completely,
while avoiding ordering issues.

In the slowpath we would have to have a token locked op in there (like
the current spinlocks do), but this could be taken out iff our inquiries
come back positive.

Anyway, I'll try redoing the patch and getting some numbers.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2007-07-02  7:06 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-24 12:04 [BUG] long freezes on thinkpad t60 Miklos Szeredi
2007-05-24 12:54 ` Ingo Molnar
2007-05-24 14:03   ` Miklos Szeredi
2007-05-24 14:10     ` Ingo Molnar
2007-05-24 14:28       ` Miklos Szeredi
2007-05-24 14:42         ` Ingo Molnar
2007-05-24 14:44         ` Ingo Molnar
2007-05-24 17:09           ` Miklos Szeredi
2007-05-24 21:01             ` Ingo Molnar
2007-05-25  9:51               ` Miklos Szeredi
2007-06-14 16:04                 ` Miklos Szeredi
2007-06-15 21:25                   ` Chuck Ebbert
2007-06-16 10:37                   ` Ingo Molnar
2007-06-17 21:46                     ` Miklos Szeredi
2007-06-18  6:43                       ` Ingo Molnar
2007-06-18  7:24                         ` Miklos Szeredi
2007-06-18  8:12                           ` Ingo Molnar
2007-06-18  8:20                             ` Andrew Morton
2007-06-19  4:22                               ` Ravikiran G Thirumalai
2007-06-18  8:25                             ` Miklos Szeredi
2007-06-18  8:31                               ` Ingo Molnar
2007-06-18  8:34                                 ` Miklos Szeredi
2007-06-18  9:18                                   ` Ingo Molnar
2007-06-18  9:38                                     ` Ingo Molnar
2007-06-18  9:44                                       ` Ingo Molnar
2007-06-18 10:18                                         ` Miklos Szeredi
2007-06-18 12:36                                           ` Ingo Molnar
2007-06-18 13:10                                             ` Miklos Szeredi
2007-06-18 16:34                             ` Linus Torvalds
2007-06-18 17:41                               ` Miklos Szeredi
2007-06-18 17:48                                 ` Linus Torvalds
2007-06-18 18:02                                   ` Ingo Molnar
2007-06-18 18:00                               ` Ingo Molnar
2007-06-18 18:25                                 ` Linus Torvalds
2007-06-20  9:36                               ` Jarek Poplawski
2007-06-20 17:34                                 ` Linus Torvalds
2007-06-21  7:30                                   ` Ingo Molnar
2007-06-21 15:50                                     ` Linus Torvalds
2007-06-21 16:08                                       ` Ingo Molnar
2007-06-21 16:32                                         ` Linus Torvalds
2007-06-21 16:44                                         ` Chuck Ebbert
2007-06-21 17:31                                           ` Linus Torvalds
2007-06-21 18:29                                             ` Eric Dumazet
2007-06-21 18:44                                               ` Linus Torvalds
2007-06-21 19:35                                                 ` Linus Torvalds
2007-06-21 20:09                                                   ` Ingo Molnar
2007-06-21 20:14                                                     ` Linus Torvalds
2007-06-21 20:30                                                       ` Ingo Molnar
2007-06-21 20:48                                                         ` Linus Torvalds
2007-06-21 21:06                                                           ` Ingo Molnar
2007-06-21 20:42                                                       ` [patch] spinlock debug: make looping nicer Ingo Molnar
2007-06-21 20:58                                                         ` Linus Torvalds
2007-06-21 21:15                                                           ` Ingo Molnar
2007-06-22  7:00                                                             ` Jarek Poplawski
2007-06-21 20:36                                                   ` [BUG] long freezes on thinkpad t60 Eric Dumazet
2007-06-21 19:56                                                 ` Ingo Molnar
2007-06-21 20:10                                                   ` Linus Torvalds
2007-06-21 20:23                                                     ` Ingo Molnar
2007-06-21 20:12                                                 ` Ingo Molnar
2007-06-26  8:42                                                 ` Nick Piggin
2007-06-26 10:56                                                   ` Jarek Poplawski
2007-06-26 17:23                                                   ` Linus Torvalds
2007-06-27  5:23                                                     ` Nick Piggin
2007-06-27  6:04                                                       ` Linus Torvalds
2007-06-27  6:20                                                         ` Nick Piggin
2007-06-27 19:47                                                         ` Linus Torvalds
2007-06-27 20:10                                                           ` Ingo Molnar
2007-06-27 20:17                                                           ` Davide Libenzi
2007-06-27 22:11                                                             ` Linus Torvalds
2007-06-27 23:30                                                               ` Davide Libenzi
2007-06-28  0:46                                                                 ` Linus Torvalds
2007-06-28  3:03                                                                   ` Davide Libenzi
2007-07-02  7:06                                                           ` Nick Piggin
2007-06-21 20:16                                             ` Ingo Molnar
2007-06-22  8:17                                               ` Ingo Molnar
2007-06-23 10:36                                                 ` Miklos Szeredi
2007-06-23 16:39                                                   ` Linus Torvalds
2007-06-25  6:45                                                   ` Jarek Poplawski
2007-06-21 20:18                                             ` Ingo Molnar
2007-06-21 20:36                                               ` Linus Torvalds
2007-06-21  7:38                                   ` Jarek Poplawski
2007-06-21  8:39                                     ` Ingo Molnar
2007-06-21 11:09                                       ` Jarek Poplawski
2007-06-21 16:01                                     ` Linus Torvalds
2007-06-22 10:38                                       ` Jarek Poplawski
2007-05-24 22:08 ` Henrique de Moraes Holschuh
2007-05-24 22:13   ` Kok, Auke
2007-05-25  6:58     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).