LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Slow file transfer speeds with CFQ IO scheduler in some cases
@ 2008-11-09 18:04 Vitaly V. Bursov
  2008-11-09 18:30 ` Alexey Dobriyan
  2008-11-10 10:44 ` Jens Axboe
  0 siblings, 2 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-09 18:04 UTC (permalink / raw)
  To: linux-kernel

Hello,

I'm building small server system with openvz kernel and have ran into
some IO performance problems. Reading a single file via NFS delivers
around 9 MB/s over gigabit network, but while reading, say, 2 different
or same file 2 times at the same time I get >60MB/s.

Changing IO scheduler to deadline or anticipatory fixes problem.

Tested kernels:
  OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
                 fast local reads)
  Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)

Vanilla performs better in worst case but I believe 40 is still low
concerning test results below.


Server-side Hardware:
Athlon X2, PCI-E motherboard Gigabyte GA-MA74GM-S2H
Onboard PCI-E RTL8111C
Onboard Southbridge SATA controller
Two SATA drives in degraded RAID5, AHCI+NCQ.

Client:
Athlon X2, PCI motherboard, Asus A8V Deluxe
Onboard PCI Marvell 88E8001
Onboard Southbridge SATA controller
Kernel: 2.6.27-gentoo-r2  (2.6.27.4 + some Gentoo patches)

RTL8111C driver does not support TX checksum offloading - this
can be the source of relatively high sys CPU load below.

No readahead, etc. tuning - all are defaults.


I realize, this is not typical setup (degraded raid5, e.g.) and I can not
test non-degraded 3 disk variant right now, but this setup may exploit some
cfs+md weakness...

No other tasks are performed on server.

Tests performed on vanilla 2.6.27.5 kernel.

Client NFS mount options does not make significant difference,
so as NFSv2/NFSv3.

All iostat results are 10 seconds average when test case was performed.

=================================================
iostat output. cfq io scheduler, "dd of=/dev/null bs=1M" over nfs
file was not in cache on server

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    4,25   52,10    0,00   43,65

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            2819,60     0,00  450,60    0,00 47413,60     0,00   105,22     7,42   16,40   2,00  90,20
sdb            2818,30     0,00  453,20    0,00 47525,60     0,00   104,87     3,79    8,35   1,42  64,20

=================================================
iostat output. deadline io scheduler, "dd of=/dev/null bs=1M" over nfs:
file was not in cache on server

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,05    0,00   27,40   22,35    0,00   50,20

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            6971,70     0,00 1357,50    0,00 120888,00     0,00    89,05     5,73    4,22   0,35  47,50
sdb            6997,10     0,00 1328,00    0,00 120894,40     0,00    91,03     5,41    4,07   0,35  46,70

Anticipatory IO scheduler provides similar performance.

=================================================
cfq io scheduler, "dd of=/dev/null bs=1M" over nfs
file was cached on server

As there was no HDD IO activity, there's client's dd output:
$ LANG=C LC_ALL=C dd if=testfile of=/dev/null bs=1M
2721+1 records in
2721+1 records out
2853588692 bytes (2.9 GB) copied, 29.4541 s, 96.9 MB/s

=================================================
cfq io scheduler, reading two different files with "dd of=/dev/null bs=1M" over nfs
files were not in cache on server

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   25,10   40,20    0,00   34,70

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            7595,90     0,00 1342,30    0,00 129604,00     0,00    96,55     5,27    3,92   0,59  78,80
sdb            7599,80     0,00 1337,80    0,00 129571,20     0,00    96,85     4,99    3,73   0,58  78,10

Copying of first file finished with 59,6 MB/s average, and now:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    3,65   48,60    0,00   47,75

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            2816,40     0,00  461,70    0,00 47561,60     0,00   103,01     3,48    7,54   1,27  58,80
sdb            2819,20     0,00  458,80    0,00 47684,80     0,00   103,93     7,81   17,13   2,06  94,60

Second file finished with 43,1 MB/s




Tests on OpenVZ RHEL5 028stab059.3
=================================================
iostat output when performance is low with cfq io scheduler, "dd of=/dev/null bs=1M" over nfs:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,10    0,00    0,90   49,60    0,00   49,40

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda             919,56     0,00  195,01    0,00  8905,39     0,00    45,67     3,25   16,64   5,10  99,40
sdb             947,90     0,00  169,86    0,00  8942,12     0,00    52,64     0,85    4,99   2,07  35,13

Other case lead to pretty much alike results except for reading files locally on server.

With OpenVZ kernel files are read pretty fast:
*deadline:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   32,72   30,02    0,00   37,27

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda           19026,50     0,00  759,10    0,00 158289,60     0,00   208,52     0,63    0,84   0,79  60,03
sdb           19034,30     0,00  751,10    0,00 158262,40     0,00   210,71     0,64    0,85   0,80  60,13

*cfq:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,05    0,00   43,61   25,62    0,00   30,72

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda           19553,10     0,00 3089,50    0,00 181121,60     0,00    58,62     1,53    0,49   0,25  76,46
sdb           19475,20     0,00 3168,90    0,00 181152,00     0,00    57,17     1,63    0,52   0,24  76,90



with vanilla:


Reading files locally with cfq:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    0,10   49,90    0,00   50,00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            1975,50     0,00  385,10    0,00 34176,00     0,00    88,75     1,23    3,18   2,58  99,30
sdb            1982,20     0,00  377,20    0,00 34181,60     0,00    90,62     1,23    3,26   2,64  99,60


... and with deadline:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,40    0,00   33,25   18,10    0,00   48,25

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda           15077,70     0,00 2952,70    0,00 260654,40     0,00    88,28     2,25    0,76   0,31  93,00
sdb           15011,50     0,00 2962,60    0,00 260648,00     0,00    87,98     2,11    0,71   0,31  93,30


Processess state when things go slowly:

# ps -eo pid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm | grep nfs
13177 TS       -  -5  29   0  0.0 S<   worker_thread  nfsd4
13179 TS       -   0  24   0  0.7 S    -              nfsd
13180 TS       -   0  23   1  0.8 D    sync_page      nfsd
13181 TS       -   0  24   1  0.8 D    sync_page      nfsd
13182 TS       -   0  24   1  0.8 D    sync_page      nfsd
13183 TS       -   0  24   1  0.8 D    sync_page      nfsd
13184 TS       -   0  24   1  0.8 D    sync_page      nfsd
13185 TS       -   0  21   1  0.7 D    sync_page      nfsd
13186 TS       -   0  23   0  0.8 D    sync_page      nfsd

and for dd
5586 TS       -   0  19   1  4.2 D+   sync_page      dd

CPU load by top:

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni,  0.0%id, 99.3%wa,  0.3%hi,  0.0%si,  0.0%st



dmesg and kernel config follow
=================================================
[    0.000000] Linux version 2.6.27.5 (root@vortex) (gcc version 4.2.4 (Gentoo 4.2.4 p1.0)) #1 SMP Sun Nov 9 17:23:39
EET 2008
[    0.000000] Command line: BOOT_IMAGE=(md0)/boot/vmlinuz-2.6.27.5 iommu=memaper=3 root=/dev/md0 ro
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Centaur CentaurHauls
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
[    0.000000]  BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000dbde0000 (usable)
[    0.000000]  BIOS-e820: 00000000dbde0000 - 00000000dbde3000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000dbde3000 - 00000000dbdf0000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000dbdf0000 - 00000000dbe00000 (reserved)
[    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000120000000 (usable)
[    0.000000] last_pfn = 0x120000 max_arch_pfn = 0x3ffffffff
[    0.000000] last_pfn = 0xdbde0 max_arch_pfn = 0x3ffffffff
[    0.000000] init_memory_mapping
[    0.000000]  0000000000 - 00dbc00000 page 2M
[    0.000000]  00dbc00000 - 00dbde0000 page 4k
[    0.000000] kernel direct mapping tables up to dbde0000 @ 8000-e000
[    0.000000] last_map_addr: dbde0000 end: dbde0000
[    0.000000] init_memory_mapping
[    0.000000]  0100000000 - 0120000000 page 2M
[    0.000000] kernel direct mapping tables up to 120000000 @ c000-12000
[    0.000000] last_map_addr: 120000000 end: 120000000
[    0.000000] DMI 2.4 present.
[    0.000000] ACPI: RSDP 000F7200, 0014 (r0 GBT   )
[    0.000000] ACPI: RSDT DBDE3000, 0038 (r1 GBT    GBTUACPI 42302E31 GBTU  1010101)
[    0.000000] ACPI: FACP DBDE3040, 0074 (r1 GBT    GBTUACPI 42302E31 GBTU  1010101)
[    0.000000] ACPI: DSDT DBDE30C0, 41A8 (r1 GBT    GBTUACPI     1000 MSFT  100000C)
[    0.000000] ACPI: FACS DBDE0000, 0040
[    0.000000] ACPI: SSDT DBDE7340, 01C4 (r1 PTLTD  POWERNOW        1  LTP        1)
[    0.000000] ACPI: HPET DBDE7540, 0038 (r1 GBT    GBTUACPI 42302E31 GBTU       98)
[    0.000000] ACPI: MCFG DBDE7580, 003C (r1 GBT    GBTUACPI 42302E31 GBTU  1010101)
[    0.000000] ACPI: APIC DBDE7280, 0084 (r1 GBT    GBTUACPI 42302E31 GBTU  1010101)
[    0.000000] (6 early reservations) ==> bootmem [0000000000 - 0120000000]
[    0.000000]   #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
[    0.000000]   #1 [0000006000 - 0000008000]       TRAMPOLINE ==> [0000006000 - 0000008000]
[    0.000000]   #2 [0001000000 - 0001454014]    TEXT DATA BSS ==> [0001000000 - 0001454014]
[    0.000000]   #3 [000009f800 - 0000100000]    BIOS reserved ==> [000009f800 - 0000100000]
[    0.000000]   #4 [0000008000 - 000000c000]          PGTABLE ==> [0000008000 - 000000c000]
[    0.000000]   #5 [000000c000 - 000000d000]          PGTABLE ==> [000000c000 - 000000d000]
[    0.000000] found SMP MP-table at [ffff8800000f5840] 000f5840
[    0.000000]  [ffffe20000000000-ffffe20003ffffff] PMD -> [ffff880028200000-ffff88002c1fffff] on node 0
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA      0x00000000 -> 0x00001000
[    0.000000]   DMA32    0x00001000 -> 0x00100000
[    0.000000]   Normal   0x00100000 -> 0x00120000
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[3] active PFN ranges
[    0.000000]     0: 0x00000000 -> 0x0000009f
[    0.000000]     0: 0x00000100 -> 0x000dbde0
[    0.000000]     0: 0x00100000 -> 0x00120000
[    0.000000] On node 0 totalpages: 1031551
[    0.000000]   DMA zone: 3837 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 882200 pages, LIFO batch:31
[    0.000000]   Normal zone: 129280 pages, LIFO batch:31
[    0.000000] Detected use of extended apic ids on hypertransport bus
[    0.000000] ACPI: PM-Timer IO Port: 0x4008
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 2, version 0, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Setting APIC routing to flat
[    0.000000] ACPI: HPET id: 0x10b9a201 base: 0xfed00000
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] Allocating PCI resources starting at f1000000 (gap: f0000000:ec00000)
[    0.000000] PERCPU: Allocating 41888 bytes of per cpu data
[    0.000000] NR_CPUS: 2, nr_cpu_ids: 2, nr_node_ids 1
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1015317
[    0.000000] Kernel command line: BOOT_IMAGE=(md0)/boot/vmlinuz-2.6.27.5 iommu=memaper=3 root=/dev/md0 ro
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] TSC: PIT calibration confirmed by PMTIMER.
[    0.000000] TSC: using PIT calibration value
[    0.000000] Detected 1908.813 MHz processor.
[    0.010000] Console: colour VGA+ 80x25
[    0.010000] console [tty0] enabled
[    0.010000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.010000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
[    0.010000] Checking aperture...
[    0.010000] Node 0: aperture @ 10000000 size 32 MB
[    0.010000] Aperture pointing to e820 RAM. Ignoring.
[    0.010000] Your BIOS doesn't leave a aperture memory hole
[    0.010000] Please enable the IOMMU option in the BIOS setup
[    0.010000] This costs you 256 MB of RAM
[    0.010000] Mapping aperture over 262144 KB of RAM @ 30000000
[    0.010000] Memory: 3786696k/4718592k available (2399k kernel code, 338632k reserved, 1267k data, 268k init)
[    0.010000] CPA: page pool initialized 1 of 1 pages preallocated
[    0.010000] hpet clockevent registered
[    0.010008] Calibrating delay loop (skipped), value calculated using timer frequency.. 3817.62 BogoMIPS (lpj=19088130)
[    0.010210] Mount-cache hash table entries: 256
[    0.010434] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[    0.010510] CPU: L2 Cache: 512K (64 bytes/line)
[    0.010583] tseg: 00dbe00000
[    0.010585] CPU: Physical Processor ID: 0
[    0.010658] CPU: Processor Core ID: 0
[    0.010739] using C1E aware idle routine
[    0.010830] ACPI: Core revision 20080609
[    0.020552] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.126459] CPU0: AMD Athlon(tm) X2 Dual Core Processor BE-2300 stepping 02
[    0.126599] Using local APIC timer interrupts.
[    0.130001] APIC timer calibration result 12557978
[    0.130003] Detected 12.557 MHz APIC timer.
[    0.130229] Booting processor 1/1 ip 6000
[    0.010000] Initializing CPU#1
[    0.010000] Calibrating delay using timer specific routine.. 3817.70 BogoMIPS (lpj=19088540)
[    0.010000] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
[    0.010000] CPU: L2 Cache: 512K (64 bytes/line)
[    0.010000] CPU: Physical Processor ID: 0
[    0.010000] CPU: Processor Core ID: 1
[    0.290187] CPU1: AMD Athlon(tm) X2 Dual Core Processor BE-2300 stepping 02
[    0.290810] Brought up 2 CPUs
[    0.290884] Total of 2 processors activated (7635.33 BogoMIPS).
[    0.291037] CPU0 attaching sched-domain:
[    0.291040]  domain 0: span 0-1 level CPU
[    0.291042]   groups: 0 1
[    0.291047] CPU1 attaching sched-domain:
[    0.291049]  domain 0: span 0-1 level CPU
[    0.291051]   groups: 1 0
[    0.291307] net_namespace: 1456 bytes
[    0.291408] xor: automatically using best checksumming function: generic_sse
[    0.340002]    generic_sse:  5886.000 MB/sec
[    0.340075] xor: using function: generic_sse (5886.000 MB/sec)
[    0.340188] NET: Registered protocol family 16
[    0.340482] node 0 link 0: io port [c000, ffff]
[    0.340485] TOM: 00000000e0000000 aka 3584M
[    0.340560] node 0 link 0: mmio [a0000, bffff]
[    0.340563] node 0 link 0: mmio [fc000000, fe02ffff]
[    0.340566] node 0 link 0: mmio [e0000000, f7ffffff]
[    0.340569] node 0 link 0: mmio [f8000000, fbffffff]
[    0.340572] node 0 link 0: mmio [e0000000, efffffff]
[    0.340575] TOM2: 0000000120000000 aka 4608M
[    0.340649] bus: [00,03] on node 0 link 0
[    0.340651] bus: 00 index 0 io port: [0, ffff]
[    0.340653] bus: 00 index 1 mmio: [a0000, bffff]
[    0.340656] bus: 00 index 2 mmio: [f8000000, ffffffff]
[    0.340659] bus: 00 index 3 mmio: [e0000000, f7ffffff]
[    0.340661] bus: 00 index 4 mmio: [120000000, fcffffffff]
[    0.340669] ACPI: bus type pci registered
[    0.340839] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
[    0.340915] PCI: MCFG area at e0000000 reserved in E820
[    0.347486] PCI: Using MMCONFIG at e0000000 - efffffff
[    0.347562] PCI: Using configuration type 1 for base access
[    0.349147] ACPI: EC: Look up EC in DSDT
[    0.355565] ACPI: Interpreter enabled
[    0.355645] ACPI: (supports S0 S1 S5)
[    0.355828] ACPI: Using IOAPIC for interrupt routing
[    0.361023] ACPI: PCI Root Bridge [PCI0] (0000:00)
[    0.361209] pci 0000:00:06.0: PME# supported from D0 D3hot D3cold
[    0.361285] pci 0000:00:06.0: PME# disabled
[    0.361413] PCI: 0000:00:11.0 reg 10 io port: [ff00, ff07]
[    0.361420] PCI: 0000:00:11.0 reg 14 io port: [fe00, fe03]
[    0.361427] PCI: 0000:00:11.0 reg 18 io port: [fd00, fd07]
[    0.361433] PCI: 0000:00:11.0 reg 1c io port: [fc00, fc03]
[    0.361440] PCI: 0000:00:11.0 reg 20 io port: [fb00, fb0f]
[    0.361447] PCI: 0000:00:11.0 reg 24 32bit mmio: [fe02f000, fe02f3ff]
[    0.361488] PCI: 0000:00:12.0 reg 10 32bit mmio: [fe02e000, fe02efff]
[    0.361540] PCI: 0000:00:12.1 reg 10 32bit mmio: [fe02d000, fe02dfff]
[    0.361610] PCI: 0000:00:12.2 reg 10 32bit mmio: [fe02c000, fe02c0ff]
[    0.361655] pci 0000:00:12.2: supports D1
[    0.361657] pci 0000:00:12.2: supports D2
[    0.361659] pci 0000:00:12.2: PME# supported from D0 D1 D2 D3hot
[    0.361736] pci 0000:00:12.2: PME# disabled
[    0.361832] PCI: 0000:00:13.0 reg 10 32bit mmio: [fe02b000, fe02bfff]
[    0.361884] PCI: 0000:00:13.1 reg 10 32bit mmio: [fe02a000, fe02afff]
[    0.361954] PCI: 0000:00:13.2 reg 10 32bit mmio: [fe029000, fe0290ff]
[    0.361999] pci 0000:00:13.2: supports D1
[    0.362001] pci 0000:00:13.2: supports D2
[    0.362003] pci 0000:00:13.2: PME# supported from D0 D1 D2 D3hot
[    0.362080] pci 0000:00:13.2: PME# disabled
[    0.362250] PCI: 0000:00:14.1 reg 10 io port: [0, 7]
[    0.362256] PCI: 0000:00:14.1 reg 14 io port: [0, 3]
[    0.362263] PCI: 0000:00:14.1 reg 18 io port: [0, 7]
[    0.362270] PCI: 0000:00:14.1 reg 1c io port: [0, 3]
[    0.362276] PCI: 0000:00:14.1 reg 20 io port: [fa00, fa0f]
[    0.362394] PCI: 0000:00:14.5 reg 10 32bit mmio: [fe028000, fe028fff]
[    0.362513] PCI: 0000:01:05.0 reg 10 64bit mmio: [f8000000, fbffffff]
[    0.362519] PCI: 0000:01:05.0 reg 18 64bit mmio: [fdef0000, fdefffff]
[    0.362523] PCI: 0000:01:05.0 reg 20 io port: [ee00, eeff]
[    0.362527] PCI: 0000:01:05.0 reg 24 32bit mmio: [fdd00000, fddfffff]
[    0.362535] pci 0000:01:05.0: supports D1
[    0.362537] pci 0000:01:05.0: supports D2
[    0.362549] PCI: bridge 0000:00:01.0 io port: [e000, efff]
[    0.362552] PCI: bridge 0000:00:01.0 32bit mmio: [fdd00000, fdefffff]
[    0.362556] PCI: bridge 0000:00:01.0 64bit mmio pref: [f8000000, fbffffff]
[    0.362590] PCI: 0000:02:00.0 reg 10 io port: [de00, deff]
[    0.362605] PCI: 0000:02:00.0 reg 18 64bit mmio: [fdfff000, fdffffff]
[    0.362616] PCI: 0000:02:00.0 reg 20 64bit mmio: [fdfe0000, fdfeffff]
[    0.362622] PCI: 0000:02:00.0 reg 30 32bit mmio: [0, ffff]
[    0.362643] pci 0000:02:00.0: supports D1
[    0.362645] pci 0000:02:00.0: supports D2
[    0.362647] pci 0000:02:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    0.362723] pci 0000:02:00.0: PME# disabled
[    0.362835] PCI: bridge 0000:00:06.0 io port: [d000, dfff]
[    0.362837] PCI: bridge 0000:00:06.0 32bit mmio: [fda00000, fdafffff]
[    0.362841] PCI: bridge 0000:00:06.0 64bit mmio pref: [fdf00000, fdffffff]
[    0.362898] pci 0000:00:14.4: transparent bridge
[    0.362973] PCI: bridge 0000:00:14.4 io port: [c000, cfff]
[    0.362978] PCI: bridge 0000:00:14.4 32bit mmio: [fdc00000, fdcfffff]
[    0.362982] PCI: bridge 0000:00:14.4 32bit mmio pref: [fdb00000, fdbfffff]
[    0.362994] bus 00 -> node 0
[    0.363001] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    0.363388] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P2P_._PRT]
[    0.363531] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCE6._PRT]
[    0.363640] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
[    0.380108] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.380697] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.381281] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.381864] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.382448] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.383031] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.383615] ACPI: PCI Interrupt Link [LNK0] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.384198] ACPI: PCI Interrupt Link [LNK1] (IRQs 3 4 5 6 7 10 11) *0, disabled.
[    0.384872] Linux Plug and Play Support v0.97 (c) Adam Belay
[    0.384988] pnp: PnP ACPI init
[    0.385069] ACPI: bus type pnp registered
[    0.388621] pnp: PnP ACPI: found 13 devices
[    0.388696] ACPI: ACPI bus type pnp unregistered
[    0.389070] SCSI subsystem initialized
[    0.389204] libata version 3.00 loaded.
[    0.389349] usbcore: registered new interface driver usbfs
[    0.389502] usbcore: registered new interface driver hub
[    0.389637] usbcore: registered new device driver usb
[    0.389965] PCI: Using ACPI for IRQ routing
[    0.450159] PCI-DMA: Disabling AGP.
[    0.453299] PCI-DMA: aperture base @ 30000000 size 262144 KB
[    0.453374] PCI-DMA: using GART IOMMU.
[    0.453454] PCI-DMA: Reserving 256MB of IOMMU area in the AGP aperture
[    0.455179] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0
[    0.455423] hpet0: 4 32-bit timers, 14318180 Hz
[    0.460016] Switched to high resolution mode on CPU 0
[    0.460130] Switched to high resolution mode on CPU 1
[    0.490040] system 00:01: ioport range 0x4100-0x411f has been reserved
[    0.490116] system 00:01: ioport range 0x228-0x22f has been reserved
[    0.490192] system 00:01: ioport range 0x40b-0x40b has been reserved
[    0.490268] system 00:01: ioport range 0x4d6-0x4d6 has been reserved
[    0.490343] system 00:01: ioport range 0xc00-0xc01 has been reserved
[    0.490418] system 00:01: ioport range 0xc14-0xc14 has been reserved
[    0.490493] system 00:01: ioport range 0xc50-0xc52 has been reserved
[    0.490568] system 00:01: ioport range 0xc6c-0xc6d has been reserved
[    0.490643] system 00:01: ioport range 0xc6f-0xc6f has been reserved
[    0.490717] system 00:01: ioport range 0xcd0-0xcd1 has been reserved
[    0.493648] system 00:01: ioport range 0xcd2-0xcd3 has been reserved
[    0.493724] system 00:01: ioport range 0xcd4-0xcdf has been reserved
[    0.493800] system 00:01: ioport range 0x4000-0x40fe has been reserved
[    0.493877] system 00:01: ioport range 0x4210-0x4217 has been reserved
[    0.493953] system 00:01: ioport range 0xb10-0xb1f has been reserved
[    0.494039] system 00:07: ioport range 0x4d0-0x4d1 has been reserved
[    0.494115] system 00:07: ioport range 0x220-0x225 has been reserved
[    0.494194] system 00:07: ioport range 0x290-0x294 has been reserved
[    0.494278] system 00:0b: iomem range 0xe0000000-0xefffffff could not be reserved
[    0.494398] system 00:0c: iomem range 0xd2a00-0xd3fff has been reserved
[    0.494474] system 00:0c: iomem range 0xf0000-0xf7fff could not be reserved
[    0.494550] system 00:0c: iomem range 0xf8000-0xfbfff could not be reserved
[    0.494625] system 00:0c: iomem range 0xfc000-0xfffff could not be reserved
[    0.494701] system 00:0c: iomem range 0xdbde0000-0xdbdfffff could not be reserved
[    0.494816] system 00:0c: iomem range 0xffff0000-0xffffffff could not be reserved
[    0.494932] system 00:0c: iomem range 0x0-0x9ffff could not be reserved
[    0.495007] system 00:0c: iomem range 0x100000-0xdbddffff could not be reserved
[    0.495122] system 00:0c: iomem range 0xdbef0000-0xdfeeffff has been reserved
[    0.495198] system 00:0c: iomem range 0xfec00000-0xfec00fff could not be reserved
[    0.495312] system 00:0c: iomem range 0xfee00000-0xfee00fff could not be reserved
[    0.495427] system 00:0c: iomem range 0xfff80000-0xfffeffff could not be reserved
[    0.500725] pci 0000:00:01.0: PCI bridge, secondary bus 0000:01
[    0.500801] pci 0000:00:01.0:   IO window: 0xe000-0xefff
[    0.500877] pci 0000:00:01.0:   MEM window: 0xfdd00000-0xfdefffff
[    0.500953] pci 0000:00:01.0:   PREFETCH window: 0x000000f8000000-0x000000fbffffff
[    0.501071] pci 0000:00:06.0: PCI bridge, secondary bus 0000:02
[    0.501146] pci 0000:00:06.0:   IO window: 0xd000-0xdfff
[    0.501222] pci 0000:00:06.0:   MEM window: 0xfda00000-0xfdafffff
[    0.501298] pci 0000:00:06.0:   PREFETCH window: 0x000000fdf00000-0x000000fdffffff
[    0.501413] pci 0000:00:14.4: PCI bridge, secondary bus 0000:03
[    0.501489] pci 0000:00:14.4:   IO window: 0xc000-0xcfff
[    0.501567] pci 0000:00:14.4:   MEM window: 0xfdc00000-0xfdcfffff
[    0.501644] pci 0000:00:14.4:   PREFETCH window: 0x000000fdb00000-0x000000fdbfffff
[    0.501771] pci 0000:00:06.0: setting latency timer to 64
[    0.501779] bus: 00 index 0 io port: [0, ffff]
[    0.501853] bus: 00 index 1 mmio: [0, ffffffffffffffff]
[    0.501928] bus: 01 index 0 io port: [e000, efff]
[    0.502002] bus: 01 index 1 mmio: [fdd00000, fdefffff]
[    0.502076] bus: 01 index 2 mmio: [f8000000, fbffffff]
[    0.502150] bus: 01 index 3 mmio: [0, 0]
[    0.502224] bus: 02 index 0 io port: [d000, dfff]
[    0.502298] bus: 02 index 1 mmio: [fda00000, fdafffff]
[    0.502372] bus: 02 index 2 mmio: [fdf00000, fdffffff]
[    0.502446] bus: 02 index 3 mmio: [0, 0]
[    0.502524] bus: 03 index 0 io port: [c000, cfff]
[    0.502599] bus: 03 index 1 mmio: [fdc00000, fdcfffff]
[    0.502673] bus: 03 index 2 mmio: [fdb00000, fdbfffff]
[    0.502748] bus: 03 index 3 io port: [0, ffff]
[    0.502822] bus: 03 index 4 mmio: [0, ffffffffffffffff]
[    0.502932] NET: Registered protocol family 2
[    0.600050] IP route cache hash table entries: 131072 (order: 8, 1048576 bytes)
[    0.600961] TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
[    0.603237] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.603891] TCP: Hash tables configured (established 262144 bind 65536)
[    0.603969] TCP reno registered
[    0.630087] NET: Registered protocol family 1
[    0.631558] audit: initializing netlink socket (disabled)
[    0.631643] type=2000 audit(1226244422.630:1): initialized
[    0.632005] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    0.632198] VFS: Disk quotas dquot_6.5.1
[    0.632289] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.632455] msgmni has been set to 7397
[    0.632553] async_tx: api initialized (sync-only)
[    0.632629] io scheduler noop registered
[    0.632702] io scheduler anticipatory registered
[    0.632776] io scheduler deadline registered
[    0.632862] io scheduler cfq registered (default)
[    0.780034] pci 0000:01:05.0: Boot video device
[    0.780181] pcieport-driver 0000:00:06.0: setting latency timer to 64
[    0.780200] pcieport-driver 0000:00:06.0: found MSI capability
[    0.780298] pci_express 0000:00:06.0:pcie00: allocate port service
[    0.780348] pci_express 0000:00:06.0:pcie03: allocate port service
[    0.780605] input: Power Button (FF) as /class/input/input0
[    0.780681] ACPI: Power Button (FF) [PWRF]
[    0.780849] input: Power Button (CM) as /class/input/input1
[    0.780924] ACPI: Power Button (CM) [PWRB]
[    0.871084] Real Time Clock Driver v1.12ac
[    0.871325] hpet_resources: 0xfed00000 is busy
[    0.871449] Non-volatile memory driver v1.2
[    0.871523] Linux agpgart interface v0.103
[    0.871597] Hangcheck: starting hangcheck timer 0.9.0 (tick is 180 seconds, margin is 60 seconds).
[    0.871712] Hangcheck: Using get_cycles().
[    0.871789] Serial: 8250/16550 driver4 ports, IRQ sharing enabled
[    1.150130] serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[    1.151081] 00:08: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
[    1.151434] Driver 'sd' needs updating - please use bus_type methods
[    1.151556] ahci 0000:00:11.0: version 3.0
[    1.151578] ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
[    1.151842] ahci 0000:00:11.0: AHCI 0001.0100 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
[    1.151958] ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
[    1.152656] scsi0 : ahci
[    1.152874] scsi1 : ahci
[    1.153031] scsi2 : ahci
[    1.153186] scsi3 : ahci
[    1.153342] scsi4 : ahci
[    1.153499] scsi5 : ahci
[    1.153634] ata1: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f100 irq 318
[    1.153750] ata2: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f180 irq 318
[    1.153866] ata3: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f200 irq 318
[    1.153981] ata4: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f280 irq 318
[    1.154097] ata5: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f300 irq 318
[    1.154212] ata6: SATA max UDMA/133 abar m1024@0xfe02f000 port 0xfe02f380 irq 318
[    1.682512] ata1: softreset failed (device not ready)
[    1.682593] ata1: failed due to HW bug, retry pmp=0
[    1.862521] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    1.868882] ata1.00: HPA detected: current 1953523055, native 1953525168
[    1.868961] ata1.00: ATA-7: SAMSUNG HD103UJ, 1AA01113, max UDMA7
[    1.869037] ata1.00: 1953523055 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    1.875463] ata1.00: configured for UDMA/133
[    2.242526] ata2: SATA link down (SStatus 0 SControl 300)
[    2.792511] ata3: softreset failed (device not ready)
[    2.792587] ata3: failed due to HW bug, retry pmp=0
[    2.972520] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.978923] ata3.00: HPA detected: current 1953523055, native 1953525168
[    2.979000] ata3.00: ATA-7: SAMSUNG HD103UJ, 1AA01113, max UDMA7
[    2.979075] ata3.00: 1953523055 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    2.985551] ata3.00: configured for UDMA/133
[    3.352525] ata4: SATA link down (SStatus 0 SControl 300)
[    3.722525] ata5: SATA link down (SStatus 0 SControl 300)
[    4.092525] ata6: SATA link down (SStatus 0 SControl 300)
[    4.112606] scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG HD103UJ  1AA0 PQ: 0 ANSI: 5
[    4.112875] sd 0:0:0:0: [sda] 1953523055 512-byte hardware sectors (1000204 MB)
[    4.113001] sd 0:0:0:0: [sda] Write Protect is off
[    4.113076] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    4.113098] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.113261] sd 0:0:0:0: [sda] 1953523055 512-byte hardware sectors (1000204 MB)
[    4.113385] sd 0:0:0:0: [sda] Write Protect is off
[    4.113460] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    4.113480] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.113596]  sda: sda1 sda2 sda3 sda4
[    4.124313] sd 0:0:0:0: [sda] Attached SCSI disk
[    4.124528] scsi 2:0:0:0: Direct-Access     ATA      SAMSUNG HD103UJ  1AA0 PQ: 0 ANSI: 5
[    4.124787] sd 2:0:0:0: [sdb] 1953523055 512-byte hardware sectors (1000204 MB)
[    4.124915] sd 2:0:0:0: [sdb] Write Protect is off
[    4.124992] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    4.125014] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.125176] sd 2:0:0:0: [sdb] 1953523055 512-byte hardware sectors (1000204 MB)
[    4.125302] sd 2:0:0:0: [sdb] Write Protect is off
[    4.125379] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    4.125398] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    4.125516]  sdb: sdb1 sdb2 sdb3 sdb4
[    4.129368] sd 2:0:0:0: [sdb] Attached SCSI disk
[    4.129649] PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
[    4.129725] PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp
[    4.129972] serio: i8042 KBD port at 0x60,0x64 irq 1
[    4.130184] mice: PS/2 mouse device common for all mice
[    4.219707] input: AT Translated Set 2 keyboard as /class/input/input2
[    4.292615] md: raid1 personality registered for level 1
[    4.292690] md: raid10 personality registered for level 10
[    4.460014] raid6: int64x1   1717 MB/s
[    4.630006] raid6: int64x2   2319 MB/s
[    4.800017] raid6: int64x4   2172 MB/s
[    4.970004] raid6: int64x8   1878 MB/s
[    5.140001] raid6: sse2x1    2656 MB/s
[    5.310012] raid6: sse2x2    3526 MB/s
[    5.480003] raid6: sse2x4    3664 MB/s
[    5.480077] raid6: using algorithm sse2x4 (3664 MB/s)
[    5.480152] md: raid6 personality registered for level 6
[    5.480226] md: raid5 personality registered for level 5
[    5.483152] md: raid4 personality registered for level 4
[    5.483354] device-mapper: ioctl: 4.14.0-ioctl (2008-04-23) initialised: dm-devel@redhat.com
[    5.483535] TCP cubic registered
[    5.483609] Initializing XFRM netlink socket
[    5.483690] NET: Registered protocol family 17
[    5.483795] powernow-k8: Found 1 AMD Athlon(tm) X2 Dual Core Processor BE-2300 processors (2 cpu cores) (version 2.20.00)
[    5.483919] powernow-k8: ACPI Processor support is required for SMP systems but is absent. Please load the ACPI
Processor module before starting this driver.
[    5.484054] powernow-k8: ACPI Processor support is required for SMP systems but is absent. Please load the ACPI
Processor module before starting this driver.
[    5.484385] registered taskstats version 1
[    5.484779] md: Autodetecting RAID arrays.
[    5.500911] md: Scanned 2 and added 2 devices.
[    5.500985] md: autorun ...
[    5.501059] md: considering sdb1 ...
[    5.501138] md:  adding sdb1 ...
[    5.501215] md:  adding sda1 ...
[    5.501289] md: created md0
[    5.501363] md: bind<sda1>
[    5.501444] md: bind<sdb1>
[    5.501524] md: running: <sdb1><sda1>
[    5.501853] raid1: raid set md0 active with 2 out of 2 mirrors
[    5.501979] md: ... autorun DONE.
[    5.514774] kjournald starting.  Commit interval 5 seconds
[    5.514855] EXT3-fs: mounted filesystem with ordered data mode.
[    5.514936] VFS: Mounted root (ext3 filesystem) readonly.
[    5.515025] Freeing unused kernel memory: 268k freed
[    5.515215] Write protecting the kernel read-only data: 3436k
[    6.601962] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    6.602068] r8169 0000:02:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[    6.602158] r8169 0000:02:00.0: setting latency timer to 64
[    6.602383] eth0: RTL8168c/8111c at 0xffffc20000018000, 00:1f:d0:53:66:c4, XID 3c2000c0 IRQ 317
[    6.618466] parport_pc 00:09: reported by Plug and Play ACPI
[    6.618599] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
[    6.693110] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    6.693434] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    6.706708] processor ACPI0007:00: registered as cooling_device0
[    6.706828] processor ACPI0007:01: registered as cooling_device1
[    6.724161] ehci_hcd 0000:00:12.2: PCI INT B -> GSI 17 (level, low) -> IRQ 17
[    6.724257] ehci_hcd 0000:00:12.2: EHCI Host Controller
[    6.724385] ehci_hcd 0000:00:12.2: new USB bus registered, assigned bus number 1
[    6.724551] ehci_hcd 0000:00:12.2: debug port 1
[    6.724650] ehci_hcd 0000:00:12.2: irq 17, io mem 0xfe02c000
[    6.740062] ehci_hcd 0000:00:12.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
[    6.740340] usb usb1: configuration #1 chosen from 1 choice
[    6.740445] hub 1-0:1.0: USB hub found
[    6.740528] hub 1-0:1.0: 6 ports detected
[    6.749810] ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
[    6.777002] ppdev: user-space parallel port driver
[    6.960309] ehci_hcd 0000:00:13.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
[    6.960400] ehci_hcd 0000:00:13.2: EHCI Host Controller
[    6.960499] ehci_hcd 0000:00:13.2: new USB bus registered, assigned bus number 2
[    6.960657] ehci_hcd 0000:00:13.2: debug port 1
[    6.960754] ehci_hcd 0000:00:13.2: irq 19, io mem 0xfe029000
[    7.040014] ehci_hcd 0000:00:13.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
[    7.040279] usb usb2: configuration #1 chosen from 1 choice
[    7.040384] hub 2-0:1.0: USB hub found
[    7.040466] hub 2-0:1.0: 6 ports detected
[    7.150207] ohci_hcd 0000:00:12.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    7.150300] ohci_hcd 0000:00:12.0: OHCI Host Controller
[    7.150417] ohci_hcd 0000:00:12.0: new USB bus registered, assigned bus number 3
[    7.150574] ohci_hcd 0000:00:12.0: irq 16, io mem 0xfe02e000
[    7.216650] usb usb3: configuration #1 chosen from 1 choice
[    7.216757] hub 3-0:1.0: USB hub found
[    7.216841] hub 3-0:1.0: 3 ports detected
[    7.322761] ohci_hcd 0000:00:12.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    7.322853] ohci_hcd 0000:00:12.1: OHCI Host Controller
[    7.322953] ohci_hcd 0000:00:12.1: new USB bus registered, assigned bus number 4
[    7.323092] ohci_hcd 0000:00:12.1: irq 16, io mem 0xfe02d000
[    7.386700] usb usb4: configuration #1 chosen from 1 choice
[    7.386817] hub 4-0:1.0: USB hub found
[    7.386902] hub 4-0:1.0: 3 ports detected
[    7.602780] ohci_hcd 0000:00:13.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[    7.602874] ohci_hcd 0000:00:13.0: OHCI Host Controller
[    7.602984] ohci_hcd 0000:00:13.0: new USB bus registered, assigned bus number 5
[    7.603136] ohci_hcd 0000:00:13.0: irq 18, io mem 0xfe02b000
[    7.666670] usb usb5: configuration #1 chosen from 1 choice
[    7.666778] hub 5-0:1.0: USB hub found
[    7.666863] hub 5-0:1.0: 3 ports detected
[    7.772695] ohci_hcd 0000:00:13.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[    7.772784] ohci_hcd 0000:00:13.1: OHCI Host Controller
[    7.772882] ohci_hcd 0000:00:13.1: new USB bus registered, assigned bus number 6
[    7.773019] ohci_hcd 0000:00:13.1: irq 18, io mem 0xfe02a000
[    7.836651] usb usb6: configuration #1 chosen from 1 choice
[    7.836762] hub 6-0:1.0: USB hub found
[    7.836846] hub 6-0:1.0: 3 ports detected
[    7.902513] usb 4-1: new low speed USB device using ohci_hcd and address 2
[    7.943745] ohci_hcd 0000:00:14.5: PCI INT C -> GSI 18 (level, low) -> IRQ 18
[    7.943845] ohci_hcd 0000:00:14.5: OHCI Host Controller
[    7.943967] ohci_hcd 0000:00:14.5: new USB bus registered, assigned bus number 7
[    7.944109] ohci_hcd 0000:00:14.5: irq 18, io mem 0xfe028000
[    8.006678] usb usb7: configuration #1 chosen from 1 choice
[    8.006793] hub 7-0:1.0: USB hub found
[    8.006877] hub 7-0:1.0: 2 ports detected
[    8.086867] usb 4-1: configuration #1 chosen from 1 choice
[    8.350077] EXT3 FS on md0, internal journal
[    8.542056] it87: Found IT8718F chip at 0x228, revision 5
[    8.542143] it87: in3 is VCC (+5V)
[    8.562188] usbcore: registered new interface driver usbserial
[    8.562281] usbserial: USB Serial support registered for generic
[    8.562475] usbcore: registered new interface driver usbserial_generic
[    8.562553] usbserial: USB Serial Driver core
[    8.580004] usbserial: USB Serial support registered for DeLorme Earthmate USB
[    8.580135] usbserial: USB Serial support registered for HID->COM RS232 Adapter
[    8.580261] usbserial: USB Serial support registered for Nokia CA-42 V2 Adapter
[    8.580395] cypress 4-1:1.0: HID->COM RS232 Adapter converter detected
[    8.581889] usb 4-1: HID->COM RS232 Adapter converter now attached to ttyUSB0
[    8.581985] usbcore: registered new interface driver cypress
[    8.582061] cypress_m8: Cypress USB to Serial Driver v1.09
[    8.603590] md: md1 stopped.
[    8.661705] md: bind<sdb2>
[    8.661923] md: bind<sda2>
[    8.668856] raid5: device sda2 operational as raid disk 0
[    8.668931] raid5: device sdb2 operational as raid disk 1
[    8.669450] raid5: allocated 3218kB for md1
[    8.669526] raid5: raid level 5 set md1 active with 2 out of 3 devices, algorithm 2
[    8.669642] RAID5 conf printout:
[    8.669716]  --- rd:3 wd:2
[    8.669790]  disk 0, o:1, dev:sda2
[    8.669864]  disk 1, o:1, dev:sdb2
[    8.670568] md: md2 stopped.
[    8.718563] md: bind<sdb3>
[    8.718793] md: bind<sda3>
[    8.725965] raid10: raid set md2 active with 2 out of 3 devices
[    8.738709] md: md3 stopped.
[    8.772635] md: bind<sdb4>
[    8.782628] md: bind<sda4>
[    8.789594] raid5: device sda4 operational as raid disk 0
[    8.789669] raid5: device sdb4 operational as raid disk 1
[    8.790157] raid5: allocated 3218kB for md3
[    8.790232] raid5: raid level 5 set md3 active with 2 out of 3 devices, algorithm 2
[    8.790346] RAID5 conf printout:
[    8.790420]  --- rd:3 wd:2
[    8.790493]  disk 0, o:1, dev:sda4
[    8.790567]  disk 1, o:1, dev:sdb4
[    9.424724] kjournald starting.  Commit interval 300 seconds
[    9.424746] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
[    9.442815] EXT3 FS on dm-5, internal journal
[    9.442928] EXT3-fs: mounted filesystem with ordered data mode.
[    9.505202] kjournald starting.  Commit interval 300 seconds
[    9.532774] EXT3 FS on dm-6, internal journal
[    9.532885] EXT3-fs: mounted filesystem with ordered data mode.
[    9.563486] kjournald starting.  Commit interval 300 seconds
[    9.578137] EXT3 FS on dm-0, internal journal
[    9.578247] EXT3-fs: mounted filesystem with ordered data mode.
[    9.596801] kjournald starting.  Commit interval 300 seconds
[    9.598381] EXT3 FS on dm-1, internal journal
[    9.598492] EXT3-fs: mounted filesystem with ordered data mode.
[    9.610668] kjournald starting.  Commit interval 300 seconds
[    9.610810] EXT3 FS on dm-4, internal journal
[    9.610921] EXT3-fs: mounted filesystem with ordered data mode.
[    9.701946] kjournald starting.  Commit interval 300 seconds
[    9.701947] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
[    9.721344] EXT3 FS on dm-8, internal journal
[    9.721456] EXT3-fs: mounted filesystem with ordered data mode.
[   10.010326] Adding 1048568k swap on /dev/mapper/storage10-swap.  Priority:-1 extents:1 across:1048568k
[   17.363353] r8169: eth0: link up
[   17.363442] r8169: eth0: link up
[   23.839883] RPC: Registered udp transport module.
[   23.839963] RPC: Registered tcp transport module.
[   24.770802] NET: Registered protocol family 10
[   24.771482] lo: Disabled Privacy Extensions
[   34.962510] eth0: no IPv6 routers present
[  141.370364] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[  141.407514] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[  141.425114] NFSD: starting 90-second grace period
=================================================
# $ cat .config | grep -v '^#'
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=m
CONFIG_HAVE_OPROFILE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_BLOCK_COMPAT=y

CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_CLASSIC_RCU=y

CONFIG_TICK_ONESHOT=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
CONFIG_MK8=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=2
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MTRR=y
CONFIG_SECCOMP=y
CONFIG_HZ_100=y
CONFIG_HZ=100
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

CONFIG_PM=y
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=m
CONFIG_ACPI_BLACKLIST_YEAR=0
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=m

CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=y


CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y

CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_IA32_EMULATION=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_DEFAULT_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m

CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m

CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m
CONFIG_IP_DCCP_ACKVEC=y

CONFIG_IP_DCCP_CCID2=m
CONFIG_IP_DCCP_CCID3=m
CONFIG_IP_DCCP_CCID3_RTO=100
CONFIG_IP_DCCP_TFRC_LIB=m

CONFIG_IP_SCTP=m
CONFIG_SCTP_HMAC_MD5=y
CONFIG_TIPC=m
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_LLC=m
CONFIG_NET_SCHED=y

CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_INGRESS=m

CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y

CONFIG_FIB_RULES=y

CONFIG_IEEE80211=m
CONFIG_IEEE80211_CRYPT_WEP=m


CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y

CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
CONFIG_ATA_OVER_ETH=m
CONFIG_MISC_DEVICES=y
CONFIG_HAVE_IDE=y

CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m

CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_WAIT_SCAN=m

CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ATA=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_RAID5_RESHAPE=y
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=y
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m


CONFIG_NETDEVICES=y
CONFIG_IFB=m
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_TUN=m
CONFIG_PHYLIB=m

CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_EEPRO100=m
CONFIG_E100=m
CONFIG_NE2K_PCI=m
CONFIG_8139CP=m
CONFIG_8139TOO=m
CONFIG_NETDEV_1000=y
CONFIG_DL2K=m
CONFIG_E1000=m
CONFIG_R8169=m
CONFIG_R8169_VLAN=y
CONFIG_SKY2=m
CONFIG_NETDEV_10000=y


CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
CONFIG_USB_NET_PLUSB=m
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_NET_ZAURUS=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_MPPE=m
CONFIG_PPPOE=m
CONFIG_SLHC=m
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y

CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=m

CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y

CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_VSXXXAA=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_MTOUCH=m
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_UINPUT=m

CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_GAMEPORT=m

CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y

CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_NVRAM=y
CONFIG_RTC=y
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_HANGCHECK_TIMER=y
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y







CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCA9539=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_POWER_SUPPLY=y
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_HDAPS=m
CONFIG_THERMAL=m
CONFIG_WATCHDOG=y

CONFIG_SOFT_WATCHDOG=m



CONFIG_SSB_POSSIBLE=y



CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_COMMON=m
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_VIDEO_MEDIA=m

CONFIG_MEDIA_TUNER=m
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_VIDEO_V4L2=m
CONFIG_VIDEO_V4L1=m
CONFIG_VIDEO_CAPTURE_DRIVERS=y
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_V4L_USB_DRIVERS=y
CONFIG_RADIO_ADAPTERS=y
CONFIG_DAB=y

CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_BACKLIGHT_CLASS_DEVICE=m


CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_SOUND=m
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_RTCTIMER=m
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_VMASTER=y
CONFIG_SND_MPU401_UART=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
CONFIG_SND_MTPAV=m
CONFIG_SND_MPU401=m
CONFIG_SND_PCI=y
CONFIG_SND_HDA_INTEL=m
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_INTEL8X0=m
CONFIG_SND_USB=y
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_DEBUG=y

CONFIG_USB_HID=m
CONFIG_HID_FF=y
CONFIG_HID_PID=y
CONFIG_LOGITECH_FF=y
CONFIG_THRUSTMASTER_FF=y
CONFIG_USB_HIDDEV=y

CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y

CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y

CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_ISP116X_HCD=m
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
CONFIG_USB_SL811_HCD=m

CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m


CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y

CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m

CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP2101=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m

CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
CONFIG_USB_CYPRESS_CY7C63=m
CONFIG_USB_CYTHERM=m
CONFIG_USB_IDMOUSE=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_LD=m

CONFIG_EDD=m
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y

CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_JBD=y
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_FS_POSIX_ACL=y
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_PRINT_QUOTA_WARNING=y
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m

CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=866
CONFIG_FAT_DEFAULT_IOCHARSET="koi8-r"

CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m

CONFIG_CRAMFS=m
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_RPCSEC_GSS_SPKM3=m
CONFIG_CIFS=m
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y

CONFIG_PARTITION_ADVANCED=y
CONFIG_MSDOS_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="koi8-r"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=y
CONFIG_NLS_KOI8_U=y
CONFIG_NLS_UTF8=y

CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_RODATA=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
CONFIG_DEFAULT_IO_DELAY_TYPE=0

CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_CRYPTO=y

CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_AUTHENC=m


CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_ECB=m

CONFIG_CRYPTO_HMAC=y

CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m

CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m

CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_HW=y
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y

CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
=================================================



I'm wondering if it's supposed to work this way? More data/tests needed?

-- 
Please CC me, I'm not at the list.
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-09 18:04 Slow file transfer speeds with CFQ IO scheduler in some cases Vitaly V. Bursov
@ 2008-11-09 18:30 ` Alexey Dobriyan
  2008-11-09 18:32   ` Vitaly V. Bursov
  2008-11-10 10:44 ` Jens Axboe
  1 sibling, 1 reply; 70+ messages in thread
From: Alexey Dobriyan @ 2008-11-09 18:30 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: linux-kernel, devel

On Sun, Nov 09, 2008 at 08:04:45PM +0200, Vitaly V. Bursov wrote:
> I'm building small server system with openvz kernel and have ran into
> some IO performance problems. Reading a single file via NFS delivers
> around 9 MB/s over gigabit network, but while reading, say, 2 different
> or same file 2 times at the same time I get >60MB/s.

openvz kernels have very changed CFQ and, IIRC, this one-reader problem is
fixed somewhere. Ask devel@openvz.org.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-09 18:30 ` Alexey Dobriyan
@ 2008-11-09 18:32   ` Vitaly V. Bursov
  0 siblings, 0 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-09 18:32 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: linux-kernel, devel

Alexey Dobriyan wrote:
> On Sun, Nov 09, 2008 at 08:04:45PM +0200, Vitaly V. Bursov wrote:
>> I'm building small server system with openvz kernel and have ran into
>> some IO performance problems. Reading a single file via NFS delivers
>> around 9 MB/s over gigabit network, but while reading, say, 2 different
>> or same file 2 times at the same time I get >60MB/s.
> 
> openvz kernels have very changed CFQ and, IIRC, this one-reader problem is
> fixed somewhere. Ask devel@openvz.org.

That's why I've checked vanilla kernel also. It performs somewhat better,
but definitely shares same issues...


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-09 18:04 Slow file transfer speeds with CFQ IO scheduler in some cases Vitaly V. Bursov
  2008-11-09 18:30 ` Alexey Dobriyan
@ 2008-11-10 10:44 ` Jens Axboe
  2008-11-10 13:51   ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-10 10:44 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: linux-kernel

On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
> Hello,
> 
> I'm building small server system with openvz kernel and have ran into
> some IO performance problems. Reading a single file via NFS delivers
> around 9 MB/s over gigabit network, but while reading, say, 2 different
> or same file 2 times at the same time I get >60MB/s.
> 
> Changing IO scheduler to deadline or anticipatory fixes problem.
> 
> Tested kernels:
>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
>                  fast local reads)
>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
> 
> Vanilla performs better in worst case but I believe 40 is still low
> concerning test results below.

Can you check with this patch applied?

http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 10:44 ` Jens Axboe
@ 2008-11-10 13:51   ` Jeff Moyer
  2008-11-10 13:56     ` Jens Axboe
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-10 13:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
>> Hello,
>> 
>> I'm building small server system with openvz kernel and have ran into
>> some IO performance problems. Reading a single file via NFS delivers
>> around 9 MB/s over gigabit network, but while reading, say, 2 different
>> or same file 2 times at the same time I get >60MB/s.
>> 
>> Changing IO scheduler to deadline or anticipatory fixes problem.
>> 
>> Tested kernels:
>>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
>>                  fast local reads)
>>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
>> 
>> Vanilla performs better in worst case but I believe 40 is still low
>> concerning test results below.
>
> Can you check with this patch applied?
>
> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view

Funny, I was going to ask the same question.  ;)  The reason Jens wants
you to try this patch is that nfsd may be farming off the I/O requests
to different threads which are then performing interleaved I/O.  The
above patch tries to detect this and allow cooperating processes to get
disk time instead of waiting for the idle timeout.

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 13:51   ` Jeff Moyer
@ 2008-11-10 13:56     ` Jens Axboe
  2008-11-10 17:16       ` Vitaly V. Bursov
  2008-11-12 18:32       ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-10 13:56 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Mon, Nov 10 2008, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
> >> Hello,
> >> 
> >> I'm building small server system with openvz kernel and have ran into
> >> some IO performance problems. Reading a single file via NFS delivers
> >> around 9 MB/s over gigabit network, but while reading, say, 2 different
> >> or same file 2 times at the same time I get >60MB/s.
> >> 
> >> Changing IO scheduler to deadline or anticipatory fixes problem.
> >> 
> >> Tested kernels:
> >>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
> >>                  fast local reads)
> >>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
> >> 
> >> Vanilla performs better in worst case but I believe 40 is still low
> >> concerning test results below.
> >
> > Can you check with this patch applied?
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> 
> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> you to try this patch is that nfsd may be farming off the I/O requests
> to different threads which are then performing interleaved I/O.  The
> above patch tries to detect this and allow cooperating processes to get
> disk time instead of waiting for the idle timeout.

Precisely :-)

The only reason I haven't merged it yet is because of worry of extra
cost, but I'll throw some SSD love at it and see how it turns out.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 13:56     ` Jens Axboe
@ 2008-11-10 17:16       ` Vitaly V. Bursov
  2008-11-10 17:35         ` Jens Axboe
  2008-11-12 18:32       ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-10 17:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel

Jens Axboe wrote:
> On Mon, Nov 10 2008, Jeff Moyer wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>>
>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>> you to try this patch is that nfsd may be farming off the I/O requests
>> to different threads which are then performing interleaved I/O.  The
>> above patch tries to detect this and allow cooperating processes to get
>> disk time instead of waiting for the idle timeout.
> 
> Precisely :-)
> 
> The only reason I haven't merged it yet is because of worry of extra
> cost, but I'll throw some SSD love at it and see how it turns out.
> 

Sorry, but I get "oops" same moment nfs read transfer starts.
I can get directory list via nfs, read files locally (not
carefully tested, though)

Dumps captured via netconsole, so these may not be completely accurate
but hopefully will give a hint.

Linux 2.6.27.5

======================================
[  618.526302] BUG: unable to handle kernel
NULL pointer dereference
at 0000000000000000
[  618.526676] IP:
[<ffffffff81111f16>] rb_erase+0x124/0x290
[  618.526738] PGD 11b96c067
PUD 11b96d067
PMD 0

[  618.526856] Oops: 0000 [1]
SMP

[  618.527020] CPU 1

[  618.527285] Modules linked in:
nfsd
auth_rpcgss
exportfs
netconsole
ipv6
nfs
lockd
nfs_acl
sunrpc
cypress_m8
usbserial
it87
hwmon_vid
hwmon
ppdev
ehci_hcd
ohci_hcd
sg
r8169
thermal
parport_pc
processor
parport
thermal_sys

[  618.528868] Pid: 2184, comm: md1_raid5 Not tainted 2.6.27.5 #2
[  618.528918] RIP: 0010:[<ffffffff81111f16>]
[<ffffffff81111f16>] rb_erase+0x124/0x290
[  618.529087] RSP: 0018:ffff88011d9c9a58  EFLAGS: 00010046
[  618.529136] RAX: ffff88011d3d7c01 RBX: ffff88011d3d7c00 RCX: 0000000000000000
[  618.529187] RDX: 000000002d59fabb RSI: ffff88011e44d430 RDI: 0000000000000000
[  618.529238] RBP: ffff88011e44d430 R08: 0000000000000000 R09: ffff88011c08d8b8
[  618.529289] R10: 000000002d59fdbb R11: 000000002d59fbbb R12: ffff88011e44d400
[  618.529339] R13: ffff88011e44d430 R14: ffff88011e44d400 R15: 0000000000800000
[  618.529390] FS:  00007f6c86182750(0000) GS:ffff88011f86abc0(0000) knlGS:0000000000000000
[  618.529480] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  618.529530] CR2: 0000000000000000 CR3: 000000011b9c6000 CR4: 00000000000006e0
[  618.529580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  618.529631] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  618.529682] Process md1_raid5 (pid: 2184, threadinfo ffff88011d9c8000, task ffff88011fa5d8e0)
[  618.529773] Stack:
ffff88011d3d7660
ffff88011d3d7660
ffffffff8110a0b9
ffff88011d3d7630

[  618.529895]  ffffffff8110a0ff
ffff88011d3d7630
ffff88011e44d400
ffff880119a1cc48

[  618.530085]  ffff88011d3d7630
ffff88011e44d400
ffff880119a1cc48
ffff88011d3d7678

[  618.530154] Call Trace:
[  618.530252]  [<ffffffff8110a0b9>] ? rb_erase_init+0x9/0x17
[  618.530303]  [<ffffffff8110a0ff>] ? cfq_prio_tree_add+0x38/0xa8
[  618.530354]  [<ffffffff8110b12b>] ? cfq_add_rq_rb+0xb5/0xc8
[  618.530404]  [<ffffffff8110b198>] ? cfq_insert_request+0x5a/0x356
[  618.530457]  [<ffffffff811000a1>] ? elv_insert+0x14b/0x218
[  618.530510]  [<ffffffff810ab757>] ? bio_phys_segments+0xf/0x15
[  618.530561]  [<ffffffff811028dc>] ? __make_request+0x3b9/0x3eb
[  618.530612]  [<ffffffff8110120c>] ? generic_make_request+0x30b/0x346
[  618.530665]  [<ffffffff811baae4>] ? raid5_end_write_request+0x0/0xb8
[  618.530716]  [<ffffffff811b8ace>] ? ops_run_io+0x16a/0x1c1
[  618.530767]  [<ffffffff811ba524>] ? handle_stripe5+0x9b5/0x9d6
[  618.530818]  [<ffffffff811bbef8>] ? handle_stripe+0xc3a/0xc6a
[  618.530870]  [<ffffffff81253782>] ? thread_return+0x3a/0xaa
[  618.530921]  [<ffffffff811bc2be>] ? raid5d+0x396/0x3cd
[  618.530972]  [<ffffffff81253bc8>] ? schedule_timeout+0x1e/0xad
[  618.531023]  [<ffffffff811c715f>] ? md_thread+0xdd/0xf9
[  618.531075]  [<ffffffff81044f9c>] ? autoremove_wake_function+0x0/0x2e
[  618.531127]  [<ffffffff811c7082>] ? md_thread+0x0/0xf9
[  618.531177]  [<ffffffff81044e80>] ? kthread+0x47/0x73
[  618.531228]  [<ffffffff8102f867>] ? schedule_tail+0x28/0x60
[  618.531279]  [<ffffffff8100cda9>] ? child_rip+0xa/0x11
[  618.531330]  [<ffffffff81044e39>] ? kthread+0x0/0x73
[  618.531380]  [<ffffffff8100cd9f>] ? child_rip+0x0/0x11
[  618.531428]
[  618.531474]
[  618.531521] Code:
eb
0a
48
89
4b
08
eb
04
48
89
4d
00
41
ff
c8
0f
85
7f
01
00
00
e9
5d
01
00
00
48
8b
7b
10
48
39
cf
0f
85
a5
00
00
00
48
8b
7b
08

8b
07
a8
01
75
1a
48
83
c8
01
48
89
ee
48
89
07
48
83
23
fe

[  618.532199] RIP
[<ffffffff81111f16>] rb_erase+0x124/0x290
[  618.532510]  RSP <ffff88011d9c9a58>
[  618.532510] CR2: 0000000000000000
[  618.532510] ---[ end trace db64b81474e1505e ]---

======================================

[  334.271949] BUG: unable to handle kernel
NULL pointer dereference
at 0000000000000010
[  334.272213] IP:
[<ffffffff81111bbf>] __rb_rotate_left+0x7/0x5b
[  334.27 PGD 11c15d067
PUD 11c140067
PMD 0

[  334.272508] Oops: 0000 [1]
SMP

[  334.272508] CPU 1

[  334.272508] Modules linked in:
nfsd
auth_rpcgss
exportfs
netconsole
ipv6
nfs
lockd
nfs_acl
sunrpc
cypress_m8
usbserial
it87
hwmon_vid
hwmon
ppdev
parport_pc
ehci_hcd
ohci_hcd
thermal
parport
sg
processor
r8169
thermal_sys

[  334.272508] Pid: 2185, comm: md1_raid5 Not tainted 2.6.27.5 #2
[  334.272508] RIP: 0010:[<ffffffff81111bbf>]
[<ffffffff81111bbf>] __rb_rotate_left+0x7/0x5b
[  334.272508] RSP: 0018:ffff88011d591a48  EFLAGS: 00010082
[  334.272508] RAX: ffff88011cc20a20 RBX: ffff88011cc20a20 RCX: 0000000000000000
[  334.272508] RDX: 0000000000000000 RSI: ffff88011f8bae30 RDI: ffff88011cc20a20
[  334.272508] RBP: ffff88011cc20a20 R08: ffff88011cc20a20 R09: ffff88011ac6b3f8
[  334.272508] R10: 000000002d5a237b R11: 000000002d5a30bb R12: ffff88011cc20a20
[  334.272508] R13: 0000000000000000 R14: ffff88011f8bae30 R15: 0000000000800010
[  334.272508] FS:  00007fb5925496f0(0000) GS:ffff88011f86abc0(0000) knlGS:0000000000000000
[  334.272508] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  334.272508] CR2: 0000000000000010 CR3: 000000011c18c000 CR4: 00000000000006e0
[  334.272508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  334.272508] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  334.272508] Process md1_raid5 (pid: 2185, threadinfo ffff88011d590000, task ffff88011fb12860)
[  334.272508] Stack:
ffffffff81111d20
ffff88011cc209f0
ffff88011cc20a20
ffff88011f8bae00

[  334.272508]  ffff88011f8bae30
ffff88011f8bae00
ffffffff8110a164
ffff88011cc209f0

[  334.272508]  ffff88011cc20a20
ffff88011f8bae30
ffff88011cc209f0
ffff88011f8bae00

[  334.272508] Call Trace:
[  334.272508]  [<ffffffff81111d20>] ? rb_insert_color+0xb2/0xda
[  334.272508]  [<ffffffff8110a164>] ? cfq_prio_tree_add+0x9d/0xa8
[  334.272508]  [<ffffffff8110b12b>] ? cfq_add_rq_rb+0xb5/0xc8
[  334.272508]  [<ffffffff8110b198>] ? cfq_insert_request+0x5a/0x356
[  334.272508]  [<ffffffff811000a1>] ? elv_insert+0x14b/0x218
[  334.272508]  [<ffffffff810ab757>] ? bio_phys_segments+0xf/0x15
[  334.272508]  [<ffffffff811028dc>] ? __make_request+0x3b9/0x3eb
[  334.272508]  [<ffffffff8110120c>] ? generic_make_request+0x30b/0x346
[  334.272508]  [<ffffffff811baae4>] ? raid5_end_write_request+0x0/0xb8
[  334.272508]  [<ffffffff811b8ace>] ? ops_run_io+0x16a/0x1c1
[  334.272508]  [<ffffffff811ba524>] ? handle_stripe5+0x9b5/0x9d6
[  334.272508]  [<ffffffff811bbef8>] ? handle_stripe+0xc3a/0xc6a
[  334.272508]  [<ffffffff810296e5>] ? pick_next_task_fair+0x8d/0x9c
[  334.272508]  [<ffffffff81253782>] ? thread_return+0x3a/0xaa
[  334.272508]  [<ffffffff811bc2be>] ? raid5d+0x396/0x3cd
[  334.272508]  [<ffffffff81253bc8>] ? schedule_timeout+0x1e/0xad
[  334.272508]  [<ffffffff811c715f>] ? md_thread+0xdd/0xf9
[  334.272508]  [<ffffffff81044f9c>] ? autoremove_wake_function+0x0/0x2e
[  334.272508]  [<ffffffff811c7082>] ? md_thread+0x0/0xf9
[  334.272508]  [<ffffffff81044e80>] ? kthread+0x47/0x73
[  334.272508]  [<ffffffff8102f867>] ? schedule_tail+0x28/0x60
[  334.272508]  [<ffffffff8100cda9>] ? child_rip+0xa/0x11
[  334.272508]  [<ffffffff81044e39>] ? kthread+0x0/0x73
[  334.272508]  [<ffffffff8100cd9f>] ? child_rip+0x0/0x11
[  334.272508]
[  334.272508]
[  334.272508] Code:
00
31
c0
eb
19
ff
c0
48
89
ee
48
c7
c7
e8
39
44
81
89
43
08
e8
3a
33
14
00
b8
01
00
00
00
5a
5b
5d
c3
90
90
48
8b
4f
08
4c
8b
07

8b
51
10
49
83
e0
fc
48
85
d2
48
89
57
08
74
0c
48
8b
02
83

[  334.272508] RIP
[<ffffffff81111bbf>] __rb_rotate_left+0x7/0x5b
[  334.272508]  RSP <ffff88011d591a48>
[  334.272508] CR2: 0000000000000010
[  334.272508] ---[ end trace f1bde8b7b9263879 ]---
======================================

-- 
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 17:16       ` Vitaly V. Bursov
@ 2008-11-10 17:35         ` Jens Axboe
  2008-11-10 18:27           ` Vitaly V. Bursov
  0 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-10 17:35 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jeff Moyer, linux-kernel

On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> Jens Axboe wrote:
> > On Mon, Nov 10 2008, Jeff Moyer wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >>
> >>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >> you to try this patch is that nfsd may be farming off the I/O requests
> >> to different threads which are then performing interleaved I/O.  The
> >> above patch tries to detect this and allow cooperating processes to get
> >> disk time instead of waiting for the idle timeout.
> > 
> > Precisely :-)
> > 
> > The only reason I haven't merged it yet is because of worry of extra
> > cost, but I'll throw some SSD love at it and see how it turns out.
> > 
> 
> Sorry, but I get "oops" same moment nfs read transfer starts.
> I can get directory list via nfs, read files locally (not
> carefully tested, though)
> 
> Dumps captured via netconsole, so these may not be completely accurate
> but hopefully will give a hint.

Interesting, strange how that hasn't triggered here. Or perhaps the
version that Jeff posted isn't the one I tried. Anyway, search for:

        RB_CLEAR_NODE(&cfqq->rb_node);

and add a

        RB_CLEAR_NODE(&cfqq->prio_node);

just below that. It's in cfq_find_alloc_queue(). I think that should fix
it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 17:35         ` Jens Axboe
@ 2008-11-10 18:27           ` Vitaly V. Bursov
  2008-11-10 18:29             ` Jens Axboe
  2008-11-10 21:51             ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-10 18:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel

Jens Axboe wrote:
> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
>> Jens Axboe wrote:
>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>>
>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>>> you to try this patch is that nfsd may be farming off the I/O requests
>>>> to different threads which are then performing interleaved I/O.  The
>>>> above patch tries to detect this and allow cooperating processes to get
>>>> disk time instead of waiting for the idle timeout.
>>> Precisely :-)
>>>
>>> The only reason I haven't merged it yet is because of worry of extra
>>> cost, but I'll throw some SSD love at it and see how it turns out.
>>>
>> Sorry, but I get "oops" same moment nfs read transfer starts.
>> I can get directory list via nfs, read files locally (not
>> carefully tested, though)
>>
>> Dumps captured via netconsole, so these may not be completely accurate
>> but hopefully will give a hint.
> 
> Interesting, strange how that hasn't triggered here. Or perhaps the
> version that Jeff posted isn't the one I tried. Anyway, search for:
> 
>         RB_CLEAR_NODE(&cfqq->rb_node);
> 
> and add a
> 
>         RB_CLEAR_NODE(&cfqq->prio_node);
> 
> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> it.
> 

Same problem.

I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
but It won't boot anymore with cfq with same error...

Switching cfq io scheduler at runtime (booting with "as") appears to work with
two parallel local dd reads.

But when NFS server starts up:

[  469.000105] BUG: unable to handle kernel
NULL pointer dereference
at 0000000000000000
[  469.000305] IP:
[<ffffffff81111f2a>] rb_erase+0x124/0x290
...

[  469.001905] Pid: 2296, comm: md1_raid5 Not tainted 2.6.27.5 #4
[  469.001982] RIP: 0010:[<ffffffff81111f2a>]
[<ffffffff81111f2a>] rb_erase+0x124/0x290
...
[  469.002509] Call Trace:
[  469.002509]  [<ffffffff8110a0b9>] ? rb_erase_init+0x9/0x17
[  469.002509]  [<ffffffff8110a0ff>] ? cfq_prio_tree_add+0x38/0xa8
[  469.002509]  [<ffffffff8110b13d>] ? cfq_add_rq_rb+0xb5/0xc8
[  469.002509]  [<ffffffff8110b1aa>] ? cfq_insert_request+0x5a/0x356
[  469.002509]  [<ffffffff811000a1>] ? elv_insert+0x14b/0x218
[  469.002509]  [<ffffffff810ab757>] ? bio_phys_segments+0xf/0x15
[  469.002509]  [<ffffffff811028dc>] ? __make_request+0x3b9/0x3eb
[  469.002509]  [<ffffffff8110120c>] ? generic_make_request+0x30b/0x346
[  469.002509]  [<ffffffff811baaf4>] ? raid5_end_write_request+0x0/0xb8
[  469.002509]  [<ffffffff811b8ade>] ? ops_run_io+0x16a/0x1c1
[  469.002509]  [<ffffffff811ba534>] ? handle_stripe5+0x9b5/0x9d6
[  469.002509]  [<ffffffff811bbf08>] ? handle_stripe+0xc3a/0xc6a
[  469.002509]  [<ffffffff810296e5>] ? pick_next_task_fair+0x8d/0x9c
[  469.002509]  [<ffffffff81253792>] ? thread_return+0x3a/0xaa
[  469.002509]  [<ffffffff811bc2ce>] ? raid5d+0x396/0x3cd
[  469.002509]  [<ffffffff81253bd8>] ? schedule_timeout+0x1e/0xad
[  469.002509]  [<ffffffff811c716f>] ? md_thread+0xdd/0xf9
[  469.002509]  [<ffffffff81044f9c>] ? autoremove_wake_function+0x0/0x2e
[  469.002509]  [<ffffffff811c7092>] ? md_thread+0x0/0xf9
[  469.002509]  [<ffffffff81044e80>] ? kthread+0x47/0x73
[  469.002509]  [<ffffffff8102f867>] ? schedule_tail+0x28/0x60
[  469.002509]  [<ffffffff8100cda9>] ? child_rip+0xa/0x11
[  469.002509]  [<ffffffff81044e39>] ? kthread+0x0/0x73
[  469.002509]  [<ffffffff8100cd9f>] ? child_rip+0x0/0x11
...
[  469.002509] RIP
[<ffffffff81111f2a>] rb_erase+0x124/0x290
[  469.002509]  RSP <ffff88011d4c7a58>
[  469.002509] CR2: 0000000000000000
[  469.002509] ---[ end trace acdef779aeb56048 ]---


"Best" result I got with NFS was
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    0,20    0,65    0,00   99,15

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              11,30     0,00    7,60    0,00   245,60     0,00    32,32     0,01    1,18   0,79   0,60
sdb              12,10     0,00    8,00    0,00   246,40     0,00    30,80     0,01    1,62   0,62   0,50

and it lasted around 30 seconds.

-- 
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 18:27           ` Vitaly V. Bursov
@ 2008-11-10 18:29             ` Jens Axboe
  2008-11-10 18:39               ` Jeff Moyer
  2008-11-10 18:42               ` Jens Axboe
  2008-11-10 21:51             ` Jeff Moyer
  1 sibling, 2 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-10 18:29 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jeff Moyer, linux-kernel

On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> Jens Axboe wrote:
> > On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> >> Jens Axboe wrote:
> >>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> >>>> Jens Axboe <jens.axboe@oracle.com> writes:
> >>>>
> >>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >>>> you to try this patch is that nfsd may be farming off the I/O requests
> >>>> to different threads which are then performing interleaved I/O.  The
> >>>> above patch tries to detect this and allow cooperating processes to get
> >>>> disk time instead of waiting for the idle timeout.
> >>> Precisely :-)
> >>>
> >>> The only reason I haven't merged it yet is because of worry of extra
> >>> cost, but I'll throw some SSD love at it and see how it turns out.
> >>>
> >> Sorry, but I get "oops" same moment nfs read transfer starts.
> >> I can get directory list via nfs, read files locally (not
> >> carefully tested, though)
> >>
> >> Dumps captured via netconsole, so these may not be completely accurate
> >> but hopefully will give a hint.
> > 
> > Interesting, strange how that hasn't triggered here. Or perhaps the
> > version that Jeff posted isn't the one I tried. Anyway, search for:
> > 
> >         RB_CLEAR_NODE(&cfqq->rb_node);
> > 
> > and add a
> > 
> >         RB_CLEAR_NODE(&cfqq->prio_node);
> > 
> > just below that. It's in cfq_find_alloc_queue(). I think that should fix
> > it.
> > 
> 
> Same problem.
> 
> I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> but It won't boot anymore with cfq with same error...
> 
> Switching cfq io scheduler at runtime (booting with "as") appears to work with
> two parallel local dd reads.
> 
> But when NFS server starts up:
> 
> [  469.000105] BUG: unable to handle kernel
> NULL pointer dereference
> at 0000000000000000
> [  469.000305] IP:
> [<ffffffff81111f2a>] rb_erase+0x124/0x290
> ...
> 
> [  469.001905] Pid: 2296, comm: md1_raid5 Not tainted 2.6.27.5 #4
> [  469.001982] RIP: 0010:[<ffffffff81111f2a>]
> [<ffffffff81111f2a>] rb_erase+0x124/0x290
> ...
> [  469.002509] Call Trace:
> [  469.002509]  [<ffffffff8110a0b9>] ? rb_erase_init+0x9/0x17
> [  469.002509]  [<ffffffff8110a0ff>] ? cfq_prio_tree_add+0x38/0xa8
> [  469.002509]  [<ffffffff8110b13d>] ? cfq_add_rq_rb+0xb5/0xc8
> [  469.002509]  [<ffffffff8110b1aa>] ? cfq_insert_request+0x5a/0x356
> [  469.002509]  [<ffffffff811000a1>] ? elv_insert+0x14b/0x218
> [  469.002509]  [<ffffffff810ab757>] ? bio_phys_segments+0xf/0x15
> [  469.002509]  [<ffffffff811028dc>] ? __make_request+0x3b9/0x3eb
> [  469.002509]  [<ffffffff8110120c>] ? generic_make_request+0x30b/0x346
> [  469.002509]  [<ffffffff811baaf4>] ? raid5_end_write_request+0x0/0xb8
> [  469.002509]  [<ffffffff811b8ade>] ? ops_run_io+0x16a/0x1c1
> [  469.002509]  [<ffffffff811ba534>] ? handle_stripe5+0x9b5/0x9d6
> [  469.002509]  [<ffffffff811bbf08>] ? handle_stripe+0xc3a/0xc6a
> [  469.002509]  [<ffffffff810296e5>] ? pick_next_task_fair+0x8d/0x9c
> [  469.002509]  [<ffffffff81253792>] ? thread_return+0x3a/0xaa
> [  469.002509]  [<ffffffff811bc2ce>] ? raid5d+0x396/0x3cd
> [  469.002509]  [<ffffffff81253bd8>] ? schedule_timeout+0x1e/0xad
> [  469.002509]  [<ffffffff811c716f>] ? md_thread+0xdd/0xf9
> [  469.002509]  [<ffffffff81044f9c>] ? autoremove_wake_function+0x0/0x2e
> [  469.002509]  [<ffffffff811c7092>] ? md_thread+0x0/0xf9
> [  469.002509]  [<ffffffff81044e80>] ? kthread+0x47/0x73
> [  469.002509]  [<ffffffff8102f867>] ? schedule_tail+0x28/0x60
> [  469.002509]  [<ffffffff8100cda9>] ? child_rip+0xa/0x11
> [  469.002509]  [<ffffffff81044e39>] ? kthread+0x0/0x73
> [  469.002509]  [<ffffffff8100cd9f>] ? child_rip+0x0/0x11
> ...
> [  469.002509] RIP
> [<ffffffff81111f2a>] rb_erase+0x124/0x290
> [  469.002509]  RSP <ffff88011d4c7a58>
> [  469.002509] CR2: 0000000000000000
> [  469.002509] ---[ end trace acdef779aeb56048 ]---
> 
> 
> "Best" result I got with NFS was
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0,00    0,00    0,20    0,65    0,00   99,15
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda              11,30     0,00    7,60    0,00   245,60     0,00    32,32     0,01    1,18   0,79   0,60
> sdb              12,10     0,00    8,00    0,00   246,40     0,00    30,80     0,01    1,62   0,62   0,50
> 
> and it lasted around 30 seconds.

OK, I'll throw some NFS at this patch in the morning and do some
measurements as well, so it can get queued up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 18:29             ` Jens Axboe
@ 2008-11-10 18:39               ` Jeff Moyer
  2008-11-10 18:42               ` Jens Axboe
  1 sibling, 0 replies; 70+ messages in thread
From: Jeff Moyer @ 2008-11-10 18:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> OK, I'll throw some NFS at this patch in the morning and do some
> measurements as well, so it can get queued up.

I'll take a look at this this afternoon.

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 18:29             ` Jens Axboe
  2008-11-10 18:39               ` Jeff Moyer
@ 2008-11-10 18:42               ` Jens Axboe
  1 sibling, 0 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-10 18:42 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jeff Moyer, linux-kernel

On Mon, Nov 10 2008, Jens Axboe wrote:
> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> > Jens Axboe wrote:
> > > On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> > >> Jens Axboe wrote:
> > >>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> > >>>> Jens Axboe <jens.axboe@oracle.com> writes:
> > >>>>
> > >>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> > >>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> > >>>> you to try this patch is that nfsd may be farming off the I/O requests
> > >>>> to different threads which are then performing interleaved I/O.  The
> > >>>> above patch tries to detect this and allow cooperating processes to get
> > >>>> disk time instead of waiting for the idle timeout.
> > >>> Precisely :-)
> > >>>
> > >>> The only reason I haven't merged it yet is because of worry of extra
> > >>> cost, but I'll throw some SSD love at it and see how it turns out.
> > >>>
> > >> Sorry, but I get "oops" same moment nfs read transfer starts.
> > >> I can get directory list via nfs, read files locally (not
> > >> carefully tested, though)
> > >>
> > >> Dumps captured via netconsole, so these may not be completely accurate
> > >> but hopefully will give a hint.
> > > 
> > > Interesting, strange how that hasn't triggered here. Or perhaps the
> > > version that Jeff posted isn't the one I tried. Anyway, search for:
> > > 
> > >         RB_CLEAR_NODE(&cfqq->rb_node);
> > > 
> > > and add a
> > > 
> > >         RB_CLEAR_NODE(&cfqq->prio_node);
> > > 
> > > just below that. It's in cfq_find_alloc_queue(). I think that should fix
> > > it.
> > > 
> > 
> > Same problem.
> > 
> > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> > but It won't boot anymore with cfq with same error...
> > 
> > Switching cfq io scheduler at runtime (booting with "as") appears to work with
> > two parallel local dd reads.
> > 
> > But when NFS server starts up:
> > 
> > [  469.000105] BUG: unable to handle kernel
> > NULL pointer dereference
> > at 0000000000000000
> > [  469.000305] IP:
> > [<ffffffff81111f2a>] rb_erase+0x124/0x290
> > ...
> > 
> > [  469.001905] Pid: 2296, comm: md1_raid5 Not tainted 2.6.27.5 #4
> > [  469.001982] RIP: 0010:[<ffffffff81111f2a>]
> > [<ffffffff81111f2a>] rb_erase+0x124/0x290
> > ...
> > [  469.002509] Call Trace:
> > [  469.002509]  [<ffffffff8110a0b9>] ? rb_erase_init+0x9/0x17
> > [  469.002509]  [<ffffffff8110a0ff>] ? cfq_prio_tree_add+0x38/0xa8
> > [  469.002509]  [<ffffffff8110b13d>] ? cfq_add_rq_rb+0xb5/0xc8
> > [  469.002509]  [<ffffffff8110b1aa>] ? cfq_insert_request+0x5a/0x356
> > [  469.002509]  [<ffffffff811000a1>] ? elv_insert+0x14b/0x218
> > [  469.002509]  [<ffffffff810ab757>] ? bio_phys_segments+0xf/0x15
> > [  469.002509]  [<ffffffff811028dc>] ? __make_request+0x3b9/0x3eb
> > [  469.002509]  [<ffffffff8110120c>] ? generic_make_request+0x30b/0x346
> > [  469.002509]  [<ffffffff811baaf4>] ? raid5_end_write_request+0x0/0xb8
> > [  469.002509]  [<ffffffff811b8ade>] ? ops_run_io+0x16a/0x1c1
> > [  469.002509]  [<ffffffff811ba534>] ? handle_stripe5+0x9b5/0x9d6
> > [  469.002509]  [<ffffffff811bbf08>] ? handle_stripe+0xc3a/0xc6a
> > [  469.002509]  [<ffffffff810296e5>] ? pick_next_task_fair+0x8d/0x9c
> > [  469.002509]  [<ffffffff81253792>] ? thread_return+0x3a/0xaa
> > [  469.002509]  [<ffffffff811bc2ce>] ? raid5d+0x396/0x3cd
> > [  469.002509]  [<ffffffff81253bd8>] ? schedule_timeout+0x1e/0xad
> > [  469.002509]  [<ffffffff811c716f>] ? md_thread+0xdd/0xf9
> > [  469.002509]  [<ffffffff81044f9c>] ? autoremove_wake_function+0x0/0x2e
> > [  469.002509]  [<ffffffff811c7092>] ? md_thread+0x0/0xf9
> > [  469.002509]  [<ffffffff81044e80>] ? kthread+0x47/0x73
> > [  469.002509]  [<ffffffff8102f867>] ? schedule_tail+0x28/0x60
> > [  469.002509]  [<ffffffff8100cda9>] ? child_rip+0xa/0x11
> > [  469.002509]  [<ffffffff81044e39>] ? kthread+0x0/0x73
> > [  469.002509]  [<ffffffff8100cd9f>] ? child_rip+0x0/0x11
> > ...
> > [  469.002509] RIP
> > [<ffffffff81111f2a>] rb_erase+0x124/0x290
> > [  469.002509]  RSP <ffff88011d4c7a58>
> > [  469.002509] CR2: 0000000000000000
> > [  469.002509] ---[ end trace acdef779aeb56048 ]---
> > 
> > 
> > "Best" result I got with NFS was
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            0,00    0,00    0,20    0,65    0,00   99,15
> > 
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> > sda              11,30     0,00    7,60    0,00   245,60     0,00    32,32     0,01    1,18   0,79   0,60
> > sdb              12,10     0,00    8,00    0,00   246,40     0,00    30,80     0,01    1,62   0,62   0,50
> > 
> > and it lasted around 30 seconds.
> 
> OK, I'll throw some NFS at this patch in the morning and do some
> measurements as well, so it can get queued up.

I spotted a bug - if the ioprio of a process gets changed, it needs to
be repositioned in the cooperator tree or we'll end up doing erase on a
wrong root. Perhaps that is what is biting you here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 18:27           ` Vitaly V. Bursov
  2008-11-10 18:29             ` Jens Axboe
@ 2008-11-10 21:51             ` Jeff Moyer
  2008-11-11  9:34               ` Jens Axboe
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-10 21:51 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jens Axboe, linux-kernel

"Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:

> Jens Axboe wrote:
>> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
>>> Jens Axboe wrote:
>>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>>>
>>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>>>> you to try this patch is that nfsd may be farming off the I/O requests
>>>>> to different threads which are then performing interleaved I/O.  The
>>>>> above patch tries to detect this and allow cooperating processes to get
>>>>> disk time instead of waiting for the idle timeout.
>>>> Precisely :-)
>>>>
>>>> The only reason I haven't merged it yet is because of worry of extra
>>>> cost, but I'll throw some SSD love at it and see how it turns out.
>>>>
>>> Sorry, but I get "oops" same moment nfs read transfer starts.
>>> I can get directory list via nfs, read files locally (not
>>> carefully tested, though)
>>>
>>> Dumps captured via netconsole, so these may not be completely accurate
>>> but hopefully will give a hint.
>> 
>> Interesting, strange how that hasn't triggered here. Or perhaps the
>> version that Jeff posted isn't the one I tried. Anyway, search for:
>> 
>>         RB_CLEAR_NODE(&cfqq->rb_node);
>> 
>> and add a
>> 
>>         RB_CLEAR_NODE(&cfqq->prio_node);
>> 
>> just below that. It's in cfq_find_alloc_queue(). I think that should fix
>> it.
>> 
>
> Same problem.
>
> I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> but It won't boot anymore with cfq with same error...
>
> Switching cfq io scheduler at runtime (booting with "as") appears to work with
> two parallel local dd reads.

Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
are the results I see:

[root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
[root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
[root@maiden ~]# umount /mnt/megadeth/
[root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
[root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
[root@maiden ~]# umount /mnt/megadeth/

Here is the patch, with the suggestion from Jens to switch the cfqq to
the right priority tree when the priority is changed.

Cheers,

Jeff

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6a062ee..dd26cba 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -83,6 +83,12 @@ struct cfq_data {
 	 * rr list of queues with requests and the count of them
 	 */
 	struct cfq_rb_root service_tree;
+	/*
+	 * Each priority tree is sorted by next_request position.  These
+	 * trees are used when determining if two or more queues are
+	 * interleaving requests (see cfq_close_cooperator).
+	 */
+	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 	unsigned int busy_queues;
 
 	int rq_in_driver;
@@ -142,6 +148,8 @@ struct cfq_queue {
 	struct rb_node rb_node;
 	/* service_tree key */
 	unsigned long rb_key;
+	/* prio tree member */
+	struct rb_node prio_node;
 	/* sorted list of pending requests */
 	struct rb_root sort_list;
 	/* if fifo isn't expired, next request to serve */
@@ -415,13 +423,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
 	return NULL;
 }
 
+static void rb_erase_init(struct rb_node *n, struct rb_root *root)
+{
+	rb_erase(n, root);
+	RB_CLEAR_NODE(n);
+}
+
 static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
 {
 	if (root->left == n)
 		root->left = NULL;
-
-	rb_erase(n, &root->rb);
-	RB_CLEAR_NODE(n);
+	rb_erase_init(n, &root->rb);
 }
 
 /*
@@ -540,6 +552,62 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
 	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
 }
 
+static struct cfq_queue *cfq_prio_tree_lookup(struct cfq_data *cfqd,
+	int ioprio, sector_t sector, struct rb_node **ret_parent,
+	struct rb_node ***rb_link)
+{
+	struct rb_root *root = &cfqd->prio_trees[ioprio];
+	struct rb_node **p, *parent;
+	struct cfq_queue *cfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		cfqq = rb_entry(parent, struct cfq_queue, prio_node);
+
+		/*
+		 * Sort strictly based on sector.  Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > cfqq->next_rq->sector)
+			n = &(*p)->rb_right;
+		else if (sector < cfqq->next_rq->sector)
+			n = &(*p)->rb_left;
+		else
+			break;
+		p = n;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+	return NULL;
+}
+
+static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	struct rb_root *root = &cfqd->prio_trees[cfqq->ioprio];
+	struct rb_node **p, *parent;
+	struct cfq_queue *__cfqq;
+
+	if (!RB_EMPTY_NODE(&cfqq->prio_node))
+		rb_erase_init(&cfqq->prio_node, root);
+
+	if (cfq_class_idle(cfqq))
+		return;
+	if (!cfqq->next_rq)
+		return;
+
+	__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->ioprio, cfqq->next_rq->sector, &parent, &p);
+	BUG_ON(__cfqq);
+
+	rb_link_node(&cfqq->prio_node, parent, p);
+	rb_insert_color(&cfqq->prio_node, root);
+}
+
 /*
  * Update cfqq's position in the service tree.
  */
@@ -548,8 +616,10 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	/*
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
-	if (cfq_cfqq_on_rr(cfqq))
+	if (cfq_cfqq_on_rr(cfqq)) {
 		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_prio_tree_add(cfqd, cfqq);
+	}
 }
 
 /*
@@ -578,6 +648,8 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node))
 		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (!RB_EMPTY_NODE(&cfqq->prio_node))
+		rb_erase_init(&cfqq->prio_node, &cfqd->prio_trees[cfqq->ioprio]);
 
 	BUG_ON(!cfqd->busy_queues);
 	cfqd->busy_queues--;
@@ -623,6 +695,9 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * check if this request is a better next-serve candidate
 	 */
 	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
+	/* this changes the position in the priority tree */
+	cfq_prio_tree_add(cfqd, cfqq);
+
 	BUG_ON(!cfqq->next_rq);
 }
 
@@ -831,11 +906,11 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 /*
  * Get and set a new active queue for service.
  */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd)
+static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
+					      struct cfq_queue *cfqq)
 {
-	struct cfq_queue *cfqq;
-
-	cfqq = cfq_get_next_queue(cfqd);
+	if (!cfqq)
+		cfqq = cfq_get_next_queue(cfqd);
 	__cfq_set_active_queue(cfqd, cfqq);
 	return cfqq;
 }
@@ -859,15 +934,77 @@ static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 	return cfq_dist_from_last(cfqd, rq) <= cic->seek_mean;
 }
 
-static int cfq_close_cooperator(struct cfq_data *cfq_data,
-				struct cfq_queue *cfqq)
+static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
+				    struct cfq_queue *cur_cfqq)
 {
+	struct rb_root *root = &cfqd->prio_trees[cur_cfqq->ioprio];
+	struct rb_node *parent, *node;
+	struct cfq_queue *__cfqq;
+	sector_t sector = cfqd->last_position;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__cfqq = cfq_prio_tree_lookup(cfqd, cur_cfqq->ioprio,
+				      sector, &parent, NULL);
+	if (__cfqq)
+		return __cfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector.
+	 */
+	__cfqq = rb_entry(parent, struct cfq_queue, prio_node);
+	if (cfq_rq_close(cfqd, __cfqq->next_rq))
+		return __cfqq;
+	if (__cfqq->next_rq->sector < sector)
+		node = rb_next(&__cfqq->prio_node);
+	else
+		node = rb_prev(&__cfqq->prio_node);
+	if (!node)
+		return NULL;
+	__cfqq = rb_entry(node, struct cfq_queue, prio_node);
+	if (cfq_rq_close(cfqd, __cfqq->next_rq))
+		return __cfqq;
+
+	return NULL;
+}
+
+/*
+ * cfqd - obvious
+ * cur_cfqq - passed in so that we don't decide that the current queue is
+ * 	      closely cooperating with itself.
+ *
+ * So, basically we're assuming that that cur_cfqq has dispatched at least
+ * one request, and that cfqd->last_position reflects a position on the disk
+ * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
+ * assumption.
+ */
+static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
+					      struct cfq_queue *cur_cfqq)
+{
+	struct cfq_queue *cfqq;
+
+	/*
+	 * A valid cfq_io_context is necessary to compare requests against
+	 * the seek_mean of the current cfqq.
+	 */
+	if (!cfqd->active_cic)
+		return NULL;
+
 	/*
 	 * We should notice if some of the queues are cooperating, eg
 	 * working closely on the same area of the disk. In that case,
 	 * we can group them together and don't waste time idling.
 	 */
-	return 0;
+	if ((cfqq = cfqq_close(cfqd, cur_cfqq)))
+		return cfqq;
+
+	return NULL;
 }
 
 #define CIC_SEEKY(cic) ((cic)->seek_mean > (8 * 1024))
@@ -908,13 +1045,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	/*
-	 * See if this prio level has a good candidate
-	 */
-	if (cfq_close_cooperator(cfqd, cfqq) &&
-	    (sample_valid(cic->ttime_samples) && cic->ttime_mean > 2))
-		return;
-
 	cfq_mark_cfqq_must_dispatch(cfqq);
 	cfq_mark_cfqq_wait_request(cfqq);
 
@@ -992,7 +1122,7 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  */
 static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cfqq, *new_cfqq = NULL;
 
 	cfqq = cfqd->active_queue;
 	if (!cfqq)
@@ -1012,6 +1142,15 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto keep_queue;
 
 	/*
+ 	 * If another queue has a request waiting within our mean seek
+ 	 * distance, let it run.  The expire code will check for close
+ 	 * cooperators and put the close queue at the front of the service
+ 	 * tree.
+ 	 */
+	if ((new_cfqq = cfq_close_cooperator(cfqd, cfqq)))
+		goto expire;
+
+	/*
 	 * No requests pending. If the active queue still has requests in
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
@@ -1025,7 +1164,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
-	cfqq = cfq_set_active_queue(cfqd);
+	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
 }
@@ -1466,6 +1605,7 @@ retry:
 		}
 
 		RB_CLEAR_NODE(&cfqq->rb_node);
+		RB_CLEAR_NODE(&cfqq->prio_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
 
 		atomic_set(&cfqq->ref, 0);
@@ -1949,7 +2089,18 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			cfq_set_prio_slice(cfqd, cfqq);
 			cfq_clear_cfqq_slice_new(cfqq);
 		}
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (RB_EMPTY_ROOT(&cfqq->sort_list) &&
+		    !RB_EMPTY_ROOT(&cfqd->service_tree.rb)) {
+			if (cfq_close_cooperator(cfqd, cfqq))
+				cfq_slice_expired(cfqd, 1);
+		} else if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
 			cfq_slice_expired(cfqd, 1);
 		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
 			cfq_arm_slice_timer(cfqd);
@@ -1983,6 +2134,7 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
 		if (cfqq->ioprio != cfqq->org_ioprio)
 			cfqq->ioprio = cfqq->org_ioprio;
 	}
+	cfq_prio_tree_add(cfqq->cfqd, cfqq);
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 21:51             ` Jeff Moyer
@ 2008-11-11  9:34               ` Jens Axboe
  2008-11-11  9:35                 ` Jens Axboe
  0 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-11  9:34 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Mon, Nov 10 2008, Jeff Moyer wrote:
> "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
> 
> > Jens Axboe wrote:
> >> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> >>> Jens Axboe wrote:
> >>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> >>>>> Jens Axboe <jens.axboe@oracle.com> writes:
> >>>>>
> >>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >>>>> you to try this patch is that nfsd may be farming off the I/O requests
> >>>>> to different threads which are then performing interleaved I/O.  The
> >>>>> above patch tries to detect this and allow cooperating processes to get
> >>>>> disk time instead of waiting for the idle timeout.
> >>>> Precisely :-)
> >>>>
> >>>> The only reason I haven't merged it yet is because of worry of extra
> >>>> cost, but I'll throw some SSD love at it and see how it turns out.
> >>>>
> >>> Sorry, but I get "oops" same moment nfs read transfer starts.
> >>> I can get directory list via nfs, read files locally (not
> >>> carefully tested, though)
> >>>
> >>> Dumps captured via netconsole, so these may not be completely accurate
> >>> but hopefully will give a hint.
> >> 
> >> Interesting, strange how that hasn't triggered here. Or perhaps the
> >> version that Jeff posted isn't the one I tried. Anyway, search for:
> >> 
> >>         RB_CLEAR_NODE(&cfqq->rb_node);
> >> 
> >> and add a
> >> 
> >>         RB_CLEAR_NODE(&cfqq->prio_node);
> >> 
> >> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> >> it.
> >> 
> >
> > Same problem.
> >
> > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> > but It won't boot anymore with cfq with same error...
> >
> > Switching cfq io scheduler at runtime (booting with "as") appears to work with
> > two parallel local dd reads.
> 
> Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
> are the results I see:
> 
> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
> [root@maiden ~]# umount /mnt/megadeth/
> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
> [root@maiden ~]# umount /mnt/megadeth/
> 
> Here is the patch, with the suggestion from Jens to switch the cfqq to
> the right priority tree when the priority is changed.

I don't see the issue here either. Vitaly, are you using any openvz
kernel patches? IIRC, they patch cfq so it could just be that your cfq
version is incompatible with Jeff's patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11  9:34               ` Jens Axboe
@ 2008-11-11  9:35                 ` Jens Axboe
  2008-11-11 11:52                   ` Jens Axboe
  0 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-11  9:35 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Tue, Nov 11 2008, Jens Axboe wrote:
> On Mon, Nov 10 2008, Jeff Moyer wrote:
> > "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
> > 
> > > Jens Axboe wrote:
> > >> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> > >>> Jens Axboe wrote:
> > >>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> > >>>>> Jens Axboe <jens.axboe@oracle.com> writes:
> > >>>>>
> > >>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> > >>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> > >>>>> you to try this patch is that nfsd may be farming off the I/O requests
> > >>>>> to different threads which are then performing interleaved I/O.  The
> > >>>>> above patch tries to detect this and allow cooperating processes to get
> > >>>>> disk time instead of waiting for the idle timeout.
> > >>>> Precisely :-)
> > >>>>
> > >>>> The only reason I haven't merged it yet is because of worry of extra
> > >>>> cost, but I'll throw some SSD love at it and see how it turns out.
> > >>>>
> > >>> Sorry, but I get "oops" same moment nfs read transfer starts.
> > >>> I can get directory list via nfs, read files locally (not
> > >>> carefully tested, though)
> > >>>
> > >>> Dumps captured via netconsole, so these may not be completely accurate
> > >>> but hopefully will give a hint.
> > >> 
> > >> Interesting, strange how that hasn't triggered here. Or perhaps the
> > >> version that Jeff posted isn't the one I tried. Anyway, search for:
> > >> 
> > >>         RB_CLEAR_NODE(&cfqq->rb_node);
> > >> 
> > >> and add a
> > >> 
> > >>         RB_CLEAR_NODE(&cfqq->prio_node);
> > >> 
> > >> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> > >> it.
> > >> 
> > >
> > > Same problem.
> > >
> > > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> > > but It won't boot anymore with cfq with same error...
> > >
> > > Switching cfq io scheduler at runtime (booting with "as") appears to work with
> > > two parallel local dd reads.
> > 
> > Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
> > are the results I see:
> > 
> > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
> > [root@maiden ~]# umount /mnt/megadeth/
> > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
> > [root@maiden ~]# umount /mnt/megadeth/
> > 
> > Here is the patch, with the suggestion from Jens to switch the cfqq to
> > the right priority tree when the priority is changed.
> 
> I don't see the issue here either. Vitaly, are you using any openvz
> kernel patches? IIRC, they patch cfq so it could just be that your cfq
> version is incompatible with Jeff's patch.

Heh, got it to trigger about 3 seconds after sending that email! I'll
look more into it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11  9:35                 ` Jens Axboe
@ 2008-11-11 11:52                   ` Jens Axboe
  2008-11-11 16:48                     ` Jeff Moyer
  2008-11-11 16:53                     ` Vitaly V. Bursov
  0 siblings, 2 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-11 11:52 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Tue, Nov 11 2008, Jens Axboe wrote:
> On Tue, Nov 11 2008, Jens Axboe wrote:
> > On Mon, Nov 10 2008, Jeff Moyer wrote:
> > > "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
> > > 
> > > > Jens Axboe wrote:
> > > >> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> > > >>> Jens Axboe wrote:
> > > >>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> > > >>>>> Jens Axboe <jens.axboe@oracle.com> writes:
> > > >>>>>
> > > >>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> > > >>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> > > >>>>> you to try this patch is that nfsd may be farming off the I/O requests
> > > >>>>> to different threads which are then performing interleaved I/O.  The
> > > >>>>> above patch tries to detect this and allow cooperating processes to get
> > > >>>>> disk time instead of waiting for the idle timeout.
> > > >>>> Precisely :-)
> > > >>>>
> > > >>>> The only reason I haven't merged it yet is because of worry of extra
> > > >>>> cost, but I'll throw some SSD love at it and see how it turns out.
> > > >>>>
> > > >>> Sorry, but I get "oops" same moment nfs read transfer starts.
> > > >>> I can get directory list via nfs, read files locally (not
> > > >>> carefully tested, though)
> > > >>>
> > > >>> Dumps captured via netconsole, so these may not be completely accurate
> > > >>> but hopefully will give a hint.
> > > >> 
> > > >> Interesting, strange how that hasn't triggered here. Or perhaps the
> > > >> version that Jeff posted isn't the one I tried. Anyway, search for:
> > > >> 
> > > >>         RB_CLEAR_NODE(&cfqq->rb_node);
> > > >> 
> > > >> and add a
> > > >> 
> > > >>         RB_CLEAR_NODE(&cfqq->prio_node);
> > > >> 
> > > >> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> > > >> it.
> > > >> 
> > > >
> > > > Same problem.
> > > >
> > > > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> > > > but It won't boot anymore with cfq with same error...
> > > >
> > > > Switching cfq io scheduler at runtime (booting with "as") appears to work with
> > > > two parallel local dd reads.
> > > 
> > > Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
> > > are the results I see:
> > > 
> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> > > 1024+0 records in
> > > 1024+0 records out
> > > 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
> > > [root@maiden ~]# umount /mnt/megadeth/
> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> > > 1024+0 records in
> > > 1024+0 records out
> > > 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
> > > [root@maiden ~]# umount /mnt/megadeth/
> > > 
> > > Here is the patch, with the suggestion from Jens to switch the cfqq to
> > > the right priority tree when the priority is changed.
> > 
> > I don't see the issue here either. Vitaly, are you using any openvz
> > kernel patches? IIRC, they patch cfq so it could just be that your cfq
> > version is incompatible with Jeff's patch.
> 
> Heh, got it to trigger about 3 seconds after sending that email! I'll
> look more into it.

OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
even return a hit, since it just breaks and returns NULL always. That
can cause cfq_prio_tree_add() to screw up the rbtree. The code to
correct on ioprio change wasn't correct either, I changed that as well.
New patch below, Vitaly can you give it a spin?

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6a062ee..b1f1b47 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -83,6 +83,12 @@ struct cfq_data {
 	 * rr list of queues with requests and the count of them
 	 */
 	struct cfq_rb_root service_tree;
+	/*
+	 * Each priority tree is sorted by next_request position.  These
+	 * trees are used when determining if two or more queues are
+	 * interleaving requests (see cfq_close_cooperator).
+	 */
+	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 	unsigned int busy_queues;
 
 	int rq_in_driver;
@@ -142,6 +148,9 @@ struct cfq_queue {
 	struct rb_node rb_node;
 	/* service_tree key */
 	unsigned long rb_key;
+	/* prio tree member */
+	struct rb_node prio_node;
+	struct rb_root *prio_root;
 	/* sorted list of pending requests */
 	struct rb_root sort_list;
 	/* if fifo isn't expired, next request to serve */
@@ -415,13 +424,17 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
 	return NULL;
 }
 
+static void rb_erase_init(struct rb_node *n, struct rb_root *root)
+{
+	rb_erase(n, root);
+	RB_CLEAR_NODE(n);
+}
+
 static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
 {
 	if (root->left == n)
 		root->left = NULL;
-
-	rb_erase(n, &root->rb);
-	RB_CLEAR_NODE(n);
+	rb_erase_init(n, &root->rb);
 }
 
 /*
@@ -540,6 +553,67 @@ static void cfq_service_tree_add(struct cfq_data *cfqd,
 	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
 }
 
+static struct cfq_queue *cfq_prio_tree_lookup(struct cfq_data *cfqd,
+	int ioprio, sector_t sector, struct rb_node **ret_parent,
+	struct rb_node ***rb_link)
+{
+	struct rb_root *root = &cfqd->prio_trees[ioprio];
+	struct rb_node **p, *parent;
+	struct cfq_queue *cfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		cfqq = rb_entry(parent, struct cfq_queue, prio_node);
+
+		/*
+		 * Sort strictly based on sector.  Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > cfqq->next_rq->sector)
+			n = &(*p)->rb_right;
+		else if (sector < cfqq->next_rq->sector)
+			n = &(*p)->rb_left;
+		else
+			return cfqq;
+		p = n;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+	return NULL;
+}
+
+static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	struct rb_node **p, *parent;
+	struct cfq_queue *__cfqq;
+
+	if (cfqq->prio_root) {
+		rb_erase_init(&cfqq->prio_node, cfqq->prio_root);
+		cfqq->prio_root = NULL;
+	}
+
+	if (cfq_class_idle(cfqq))
+		return;
+	if (!cfqq->next_rq)
+		return;
+
+	/*
+	 * If an alias is found, don't bother adding this one
+	 */
+	__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->ioprio, cfqq->next_rq->sector, &parent, &p);
+	if (!__cfqq) {
+		cfqq->prio_root = &cfqd->prio_trees[cfqq->ioprio];
+		rb_link_node(&cfqq->prio_node, parent, p);
+		rb_insert_color(&cfqq->prio_node, cfqq->prio_root);
+	}
+}
+
 /*
  * Update cfqq's position in the service tree.
  */
@@ -548,8 +622,10 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	/*
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
-	if (cfq_cfqq_on_rr(cfqq))
+	if (cfq_cfqq_on_rr(cfqq)) {
 		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_prio_tree_add(cfqd, cfqq);
+	}
 }
 
 /*
@@ -578,6 +654,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node))
 		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (cfqq->prio_root) {
+		rb_erase_init(&cfqq->prio_node, cfqq->prio_root);
+		cfqq->prio_root = NULL;
+	}
 
 	BUG_ON(!cfqd->busy_queues);
 	cfqd->busy_queues--;
@@ -623,6 +703,9 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * check if this request is a better next-serve candidate
 	 */
 	cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
+	/* this changes the position in the priority tree */
+	cfq_prio_tree_add(cfqd, cfqq);
+
 	BUG_ON(!cfqq->next_rq);
 }
 
@@ -831,11 +914,11 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 /*
  * Get and set a new active queue for service.
  */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd)
+static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
+					      struct cfq_queue *cfqq)
 {
-	struct cfq_queue *cfqq;
-
-	cfqq = cfq_get_next_queue(cfqd);
+	if (!cfqq)
+		cfqq = cfq_get_next_queue(cfqd);
 	__cfq_set_active_queue(cfqd, cfqq);
 	return cfqq;
 }
@@ -859,15 +942,78 @@ static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 	return cfq_dist_from_last(cfqd, rq) <= cic->seek_mean;
 }
 
-static int cfq_close_cooperator(struct cfq_data *cfq_data,
-				struct cfq_queue *cfqq)
+static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
+				    struct cfq_queue *cur_cfqq)
+{
+	struct rb_root *root = &cfqd->prio_trees[cur_cfqq->ioprio];
+	struct rb_node *parent, *node;
+	struct cfq_queue *__cfqq;
+	sector_t sector = cfqd->last_position;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__cfqq = cfq_prio_tree_lookup(cfqd, cur_cfqq->ioprio,
+				      sector, &parent, NULL);
+	if (__cfqq)
+		return __cfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector.
+	 */
+	__cfqq = rb_entry(parent, struct cfq_queue, prio_node);
+	if (cfq_rq_close(cfqd, __cfqq->next_rq))
+		return __cfqq;
+	if (__cfqq->next_rq->sector < sector)
+		node = rb_next(&__cfqq->prio_node);
+	else
+		node = rb_prev(&__cfqq->prio_node);
+	if (!node)
+		return NULL;
+	__cfqq = rb_entry(node, struct cfq_queue, prio_node);
+	if (cfq_rq_close(cfqd, __cfqq->next_rq))
+		return __cfqq;
+
+	return NULL;
+}
+
+/*
+ * cfqd - obvious
+ * cur_cfqq - passed in so that we don't decide that the current queue is
+ * 	      closely cooperating with itself.
+ *
+ * So, basically we're assuming that that cur_cfqq has dispatched at least
+ * one request, and that cfqd->last_position reflects a position on the disk
+ * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
+ * assumption.
+ */
+static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
+					      struct cfq_queue *cur_cfqq)
 {
+	struct cfq_queue *cfqq;
+
+	/*
+	 * A valid cfq_io_context is necessary to compare requests against
+	 * the seek_mean of the current cfqq.
+	 */
+	if (!cfqd->active_cic)
+		return NULL;
+
 	/*
 	 * We should notice if some of the queues are cooperating, eg
 	 * working closely on the same area of the disk. In that case,
 	 * we can group them together and don't waste time idling.
 	 */
-	return 0;
+	cfqq = cfqq_close(cfqd, cur_cfqq);
+	if (cfqq)
+		return cfqq;
+
+	return NULL;
 }
 
 #define CIC_SEEKY(cic) ((cic)->seek_mean > (8 * 1024))
@@ -908,13 +1054,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	/*
-	 * See if this prio level has a good candidate
-	 */
-	if (cfq_close_cooperator(cfqd, cfqq) &&
-	    (sample_valid(cic->ttime_samples) && cic->ttime_mean > 2))
-		return;
-
 	cfq_mark_cfqq_must_dispatch(cfqq);
 	cfq_mark_cfqq_wait_request(cfqq);
 
@@ -992,7 +1131,7 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
  */
 static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cfqq, *new_cfqq = NULL;
 
 	cfqq = cfqd->active_queue;
 	if (!cfqq)
@@ -1012,6 +1151,15 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto keep_queue;
 
 	/*
+ 	 * If another queue has a request waiting within our mean seek
+ 	 * distance, let it run.  The expire code will check for close
+ 	 * cooperators and put the close queue at the front of the service
+ 	 * tree.
+ 	 */
+	if ((new_cfqq = cfq_close_cooperator(cfqd, cfqq)))
+		goto expire;
+
+	/*
 	 * No requests pending. If the active queue still has requests in
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
@@ -1025,7 +1173,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
-	cfqq = cfq_set_active_queue(cfqd);
+	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
 }
@@ -1174,6 +1322,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 
 	cfq_log_cfqq(cfqd, cfqq, "put_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
+	BUG_ON(!RB_EMPTY_NODE(&cfqq->prio_node));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
 	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
@@ -1386,6 +1535,15 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 	}
 
 	/*
+	 * Reposition in prio_tree array, if we are already there
+	 */
+	if (cfqq->prio_root) {
+		rb_erase_init(&cfqq->prio_node, cfqq->prio_root);
+		cfqq->prio_root = NULL;
+		cfq_prio_tree_add(cfqq->cfqd, cfqq);
+	}
+
+	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
@@ -1466,6 +1624,7 @@ retry:
 		}
 
 		RB_CLEAR_NODE(&cfqq->rb_node);
+		RB_CLEAR_NODE(&cfqq->prio_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
 
 		atomic_set(&cfqq->ref, 0);
@@ -1949,7 +2108,18 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			cfq_set_prio_slice(cfqd, cfqq);
 			cfq_clear_cfqq_slice_new(cfqq);
 		}
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (RB_EMPTY_ROOT(&cfqq->sort_list) &&
+		    !RB_EMPTY_ROOT(&cfqd->service_tree.rb)) {
+			if (cfq_close_cooperator(cfqd, cfqq))
+				cfq_slice_expired(cfqd, 1);
+		} else if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
 			cfq_slice_expired(cfqd, 1);
 		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
 			cfq_arm_slice_timer(cfqd);
@@ -2210,12 +2380,17 @@ static void cfq_exit_queue(elevator_t *e)
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
+	int i;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
 	cfqd->service_tree = CFQ_RB_ROOT;
+
+	for (i = 0; i < CFQ_PRIO_LISTS; i++)
+		cfqd->prio_trees[i] = RB_ROOT;
+
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
 	cfqd->queue = q;

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 11:52                   ` Jens Axboe
@ 2008-11-11 16:48                     ` Jeff Moyer
  2008-11-11 18:08                       ` Jens Axboe
  2008-11-11 16:53                     ` Vitaly V. Bursov
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-11 16:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Tue, Nov 11 2008, Jens Axboe wrote:
>> On Tue, Nov 11 2008, Jens Axboe wrote:
>> > On Mon, Nov 10 2008, Jeff Moyer wrote:
>> > > "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
>> > > 
>> > > > Jens Axboe wrote:
>> > > >> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
>> > > >>> Jens Axboe wrote:
>> > > >>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>> > > >>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>> > > >>>>>
>> > > >>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>> > > >>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>> > > >>>>> you to try this patch is that nfsd may be farming off the I/O requests
>> > > >>>>> to different threads which are then performing interleaved I/O.  The
>> > > >>>>> above patch tries to detect this and allow cooperating processes to get
>> > > >>>>> disk time instead of waiting for the idle timeout.
>> > > >>>> Precisely :-)
>> > > >>>>
>> > > >>>> The only reason I haven't merged it yet is because of worry of extra
>> > > >>>> cost, but I'll throw some SSD love at it and see how it turns out.
>> > > >>>>
>> > > >>> Sorry, but I get "oops" same moment nfs read transfer starts.
>> > > >>> I can get directory list via nfs, read files locally (not
>> > > >>> carefully tested, though)
>> > > >>>
>> > > >>> Dumps captured via netconsole, so these may not be completely accurate
>> > > >>> but hopefully will give a hint.
>> > > >> 
>> > > >> Interesting, strange how that hasn't triggered here. Or perhaps the
>> > > >> version that Jeff posted isn't the one I tried. Anyway, search for:
>> > > >> 
>> > > >>         RB_CLEAR_NODE(&cfqq->rb_node);
>> > > >> 
>> > > >> and add a
>> > > >> 
>> > > >>         RB_CLEAR_NODE(&cfqq->prio_node);
>> > > >> 
>> > > >> just below that. It's in cfq_find_alloc_queue(). I think that should fix
>> > > >> it.
>> > > >> 
>> > > >
>> > > > Same problem.
>> > > >
>> > > > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
>> > > > but It won't boot anymore with cfq with same error...
>> > > >
>> > > > Switching cfq io scheduler at runtime (booting with "as") appears to work with
>> > > > two parallel local dd reads.
>> > > 
>> > > Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
>> > > are the results I see:
>> > > 
>> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>> > > 1024+0 records in
>> > > 1024+0 records out
>> > > 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
>> > > [root@maiden ~]# umount /mnt/megadeth/
>> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>> > > 1024+0 records in
>> > > 1024+0 records out
>> > > 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
>> > > [root@maiden ~]# umount /mnt/megadeth/
>> > > 
>> > > Here is the patch, with the suggestion from Jens to switch the cfqq to
>> > > the right priority tree when the priority is changed.
>> > 
>> > I don't see the issue here either. Vitaly, are you using any openvz
>> > kernel patches? IIRC, they patch cfq so it could just be that your cfq
>> > version is incompatible with Jeff's patch.
>> 
>> Heh, got it to trigger about 3 seconds after sending that email! I'll
>> look more into it.
>
> OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
> even return a hit, since it just breaks and returns NULL always. That
> can cause cfq_prio_tree_add() to screw up the rbtree. The code to
> correct on ioprio change wasn't correct either, I changed that as well.
> New patch below, Vitaly can you give it a spin?

Thanks for doing that!  Yeah, that was a stupid bug with the lookup
routine.  I don't know that I agree with you that the ioprio change code
was wrong.  I looked at all of the callers and that seemed the code path
that was used for I/O priority *changes*.  The initial creation was
already okay, wasn't it?

Anwyay, I'll test this new version.

Thanks!

Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 11:52                   ` Jens Axboe
  2008-11-11 16:48                     ` Jeff Moyer
@ 2008-11-11 16:53                     ` Vitaly V. Bursov
  2008-11-11 18:06                       ` Jens Axboe
  1 sibling, 1 reply; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-11 16:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel

Jens Axboe wrote:
> On Tue, Nov 11 2008, Jens Axboe wrote:
>> On Tue, Nov 11 2008, Jens Axboe wrote:
>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>> "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
>>>>
>>>>> Jens Axboe wrote:
>>>>>> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
>>>>>>> Jens Axboe wrote:
>>>>>>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>>>>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>>>>>>>
>>>>>>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>>>>>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>>>>>>>> you to try this patch is that nfsd may be farming off the I/O requests
>>>>>>>>> to different threads which are then performing interleaved I/O.  The
>>>>>>>>> above patch tries to detect this and allow cooperating processes to get
>>>>>>>>> disk time instead of waiting for the idle timeout.
>>>>>>>> Precisely :-)
>>>>>>>>
>>>>>>>> The only reason I haven't merged it yet is because of worry of extra
>>>>>>>> cost, but I'll throw some SSD love at it and see how it turns out.
>>>>>>>>
>>>>>>> Sorry, but I get "oops" same moment nfs read transfer starts.
>>>>>>> I can get directory list via nfs, read files locally (not
>>>>>>> carefully tested, though)
>>>>>>>
>>>>>>> Dumps captured via netconsole, so these may not be completely accurate
>>>>>>> but hopefully will give a hint.
>>>>>> Interesting, strange how that hasn't triggered here. Or perhaps the
>>>>>> version that Jeff posted isn't the one I tried. Anyway, search for:
>>>>>>
>>>>>>         RB_CLEAR_NODE(&cfqq->rb_node);
>>>>>>
>>>>>> and add a
>>>>>>
>>>>>>         RB_CLEAR_NODE(&cfqq->prio_node);
>>>>>>
>>>>>> just below that. It's in cfq_find_alloc_queue(). I think that should fix
>>>>>> it.
>>>>>>
>>>>> Same problem.
>>>>>
>>>>> I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
>>>>> but It won't boot anymore with cfq with same error...
>>>>>
>>>>> Switching cfq io scheduler at runtime (booting with "as") appears to work with
>>>>> two parallel local dd reads.
>>>> Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
>>>> are the results I see:
>>>>
>>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>>>> 1024+0 records in
>>>> 1024+0 records out
>>>> 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
>>>> [root@maiden ~]# umount /mnt/megadeth/
>>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>>>> 1024+0 records in
>>>> 1024+0 records out
>>>> 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
>>>> [root@maiden ~]# umount /mnt/megadeth/
>>>>
>>>> Here is the patch, with the suggestion from Jens to switch the cfqq to
>>>> the right priority tree when the priority is changed.
>>> I don't see the issue here either. Vitaly, are you using any openvz
>>> kernel patches? IIRC, they patch cfq so it could just be that your cfq
>>> version is incompatible with Jeff's patch.
>> Heh, got it to trigger about 3 seconds after sending that email! I'll
>> look more into it.
> 
> OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
> even return a hit, since it just breaks and returns NULL always. That
> can cause cfq_prio_tree_add() to screw up the rbtree. The code to
> correct on ioprio change wasn't correct either, I changed that as well.
> New patch below, Vitaly can you give it a spin?
> 

No crashes so far. Transfer speed is quiet good also.


NFS+deadline, file not cached:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   25,50   19,40    0,00   55,10

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            6648,80     0,00 1281,70    0,00 115179,20     0,00    89,86     5,35    4,18   0,35  45,20
sdb            6672,30     0,00 1257,00    0,00 115292,80     0,00    91,72     5,09    4,06   0,35  44,60



NFS+cfq, file not cached:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,05    0,00   25,30   23,95    0,00   50,70

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            6403,00     0,00 1089,90    0,00 108655,20     0,00    99,69     4,50    4,13   0,41  44,50
sdb            6394,90     0,00 1099,60    0,00 108639,20     0,00    98,80     4,53    4,12   0,39  42,50


Just for reference: 10 sec interval average, gigabit network,
no tcp/udp hardware checksumming may lead to high system cpu load.


Also, a few more test (server has 4G RAM):

NFS+cfq, file not cached:
$ dd if=test of=/dev/null bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 24.9147 s, 84.2 MB/s

NFS+deadline, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 23.2999 s, 90.0 MB/s

file cached on server:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 21.9784 s, 95.4 MB/s


Local single dd read leads to 193 MB/s for deadline and
167 MB/s for cfq.

-- 
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 16:53                     ` Vitaly V. Bursov
@ 2008-11-11 18:06                       ` Jens Axboe
  2008-11-11 19:36                         ` Jeff Moyer
  2008-11-11 19:42                         ` Vitaly V. Bursov
  0 siblings, 2 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-11 18:06 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jeff Moyer, linux-kernel

On Tue, Nov 11 2008, Vitaly V. Bursov wrote:
> Jens Axboe wrote:
> > On Tue, Nov 11 2008, Jens Axboe wrote:
> >> On Tue, Nov 11 2008, Jens Axboe wrote:
> >>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> >>>> "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
> >>>>
> >>>>> Jens Axboe wrote:
> >>>>>> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> >>>>>>> Jens Axboe wrote:
> >>>>>>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> >>>>>>>>> Jens Axboe <jens.axboe@oracle.com> writes:
> >>>>>>>>>
> >>>>>>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >>>>>>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >>>>>>>>> you to try this patch is that nfsd may be farming off the I/O requests
> >>>>>>>>> to different threads which are then performing interleaved I/O.  The
> >>>>>>>>> above patch tries to detect this and allow cooperating processes to get
> >>>>>>>>> disk time instead of waiting for the idle timeout.
> >>>>>>>> Precisely :-)
> >>>>>>>>
> >>>>>>>> The only reason I haven't merged it yet is because of worry of extra
> >>>>>>>> cost, but I'll throw some SSD love at it and see how it turns out.
> >>>>>>>>
> >>>>>>> Sorry, but I get "oops" same moment nfs read transfer starts.
> >>>>>>> I can get directory list via nfs, read files locally (not
> >>>>>>> carefully tested, though)
> >>>>>>>
> >>>>>>> Dumps captured via netconsole, so these may not be completely accurate
> >>>>>>> but hopefully will give a hint.
> >>>>>> Interesting, strange how that hasn't triggered here. Or perhaps the
> >>>>>> version that Jeff posted isn't the one I tried. Anyway, search for:
> >>>>>>
> >>>>>>         RB_CLEAR_NODE(&cfqq->rb_node);
> >>>>>>
> >>>>>> and add a
> >>>>>>
> >>>>>>         RB_CLEAR_NODE(&cfqq->prio_node);
> >>>>>>
> >>>>>> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> >>>>>> it.
> >>>>>>
> >>>>> Same problem.
> >>>>>
> >>>>> I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> >>>>> but It won't boot anymore with cfq with same error...
> >>>>>
> >>>>> Switching cfq io scheduler at runtime (booting with "as") appears to work with
> >>>>> two parallel local dd reads.
> >>>> Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
> >>>> are the results I see:
> >>>>
> >>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> >>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> >>>> 1024+0 records in
> >>>> 1024+0 records out
> >>>> 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
> >>>> [root@maiden ~]# umount /mnt/megadeth/
> >>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> >>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> >>>> 1024+0 records in
> >>>> 1024+0 records out
> >>>> 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
> >>>> [root@maiden ~]# umount /mnt/megadeth/
> >>>>
> >>>> Here is the patch, with the suggestion from Jens to switch the cfqq to
> >>>> the right priority tree when the priority is changed.
> >>> I don't see the issue here either. Vitaly, are you using any openvz
> >>> kernel patches? IIRC, they patch cfq so it could just be that your cfq
> >>> version is incompatible with Jeff's patch.
> >> Heh, got it to trigger about 3 seconds after sending that email! I'll
> >> look more into it.
> > 
> > OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
> > even return a hit, since it just breaks and returns NULL always. That
> > can cause cfq_prio_tree_add() to screw up the rbtree. The code to
> > correct on ioprio change wasn't correct either, I changed that as well.
> > New patch below, Vitaly can you give it a spin?
> > 
> 
> No crashes so far. Transfer speed is quiet good also.
> 
> 
> NFS+deadline, file not cached:
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0,00    0,00   25,50   19,40    0,00   55,10
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda            6648,80     0,00 1281,70    0,00 115179,20     0,00    89,86     5,35    4,18   0,35  45,20
> sdb            6672,30     0,00 1257,00    0,00 115292,80     0,00    91,72     5,09    4,06   0,35  44,60
> 
> 
> 
> NFS+cfq, file not cached:
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0,05    0,00   25,30   23,95    0,00   50,70
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> sda            6403,00     0,00 1089,90    0,00 108655,20     0,00    99,69     4,50    4,13   0,41  44,50
> sdb            6394,90     0,00 1099,60    0,00 108639,20     0,00    98,80     4,53    4,12   0,39  42,50
> 
> 
> Just for reference: 10 sec interval average, gigabit network,
> no tcp/udp hardware checksumming may lead to high system cpu load.
> 
> 
> Also, a few more test (server has 4G RAM):
> 
> NFS+cfq, file not cached:
> $ dd if=test of=/dev/null bs=1M count=2000
> 2000+0 records in
> 2000+0 records out
> 2097152000 bytes (2.1 GB) copied, 24.9147 s, 84.2 MB/s
> 
> NFS+deadline, file not cached:
> 2000+0 records in
> 2000+0 records out
> 2097152000 bytes (2.1 GB) copied, 23.2999 s, 90.0 MB/s
> 
> file cached on server:
> 2000+0 records in
> 2000+0 records out
> 2097152000 bytes (2.1 GB) copied, 21.9784 s, 95.4 MB/s
> 
> 
> Local single dd read leads to 193 MB/s for deadline and
> 167 MB/s for cfq.

OK, that looks better. Can I talk you into just trying this little
patch, just to see what kind of performance that yields? Remove the cfq
patch first. I would have patched nfsd only, but this is just a quick'n
dirty.

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 8e7a7ce..3aacf48 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -92,7 +92,7 @@ static void create_kthread(struct kthread_create_info *create)
 	int pid;
 
 	/* We want our own signal handler (we take no signals by default). */
-	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
+	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | CLONE_IO | SIGCHLD);
 	if (pid < 0) {
 		create->result = ERR_PTR(pid);
 	} else {

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 16:48                     ` Jeff Moyer
@ 2008-11-11 18:08                       ` Jens Axboe
  0 siblings, 0 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-11 18:08 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Tue, Nov 11 2008, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Tue, Nov 11 2008, Jens Axboe wrote:
> >> On Tue, Nov 11 2008, Jens Axboe wrote:
> >> > On Mon, Nov 10 2008, Jeff Moyer wrote:
> >> > > "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
> >> > > 
> >> > > > Jens Axboe wrote:
> >> > > >> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
> >> > > >>> Jens Axboe wrote:
> >> > > >>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
> >> > > >>>>> Jens Axboe <jens.axboe@oracle.com> writes:
> >> > > >>>>>
> >> > > >>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >> > > >>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >> > > >>>>> you to try this patch is that nfsd may be farming off the I/O requests
> >> > > >>>>> to different threads which are then performing interleaved I/O.  The
> >> > > >>>>> above patch tries to detect this and allow cooperating processes to get
> >> > > >>>>> disk time instead of waiting for the idle timeout.
> >> > > >>>> Precisely :-)
> >> > > >>>>
> >> > > >>>> The only reason I haven't merged it yet is because of worry of extra
> >> > > >>>> cost, but I'll throw some SSD love at it and see how it turns out.
> >> > > >>>>
> >> > > >>> Sorry, but I get "oops" same moment nfs read transfer starts.
> >> > > >>> I can get directory list via nfs, read files locally (not
> >> > > >>> carefully tested, though)
> >> > > >>>
> >> > > >>> Dumps captured via netconsole, so these may not be completely accurate
> >> > > >>> but hopefully will give a hint.
> >> > > >> 
> >> > > >> Interesting, strange how that hasn't triggered here. Or perhaps the
> >> > > >> version that Jeff posted isn't the one I tried. Anyway, search for:
> >> > > >> 
> >> > > >>         RB_CLEAR_NODE(&cfqq->rb_node);
> >> > > >> 
> >> > > >> and add a
> >> > > >> 
> >> > > >>         RB_CLEAR_NODE(&cfqq->prio_node);
> >> > > >> 
> >> > > >> just below that. It's in cfq_find_alloc_queue(). I think that should fix
> >> > > >> it.
> >> > > >> 
> >> > > >
> >> > > > Same problem.
> >> > > >
> >> > > > I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
> >> > > > but It won't boot anymore with cfq with same error...
> >> > > >
> >> > > > Switching cfq io scheduler at runtime (booting with "as") appears to work with
> >> > > > two parallel local dd reads.
> >> > > 
> >> > > Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
> >> > > are the results I see:
> >> > > 
> >> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> >> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> >> > > 1024+0 records in
> >> > > 1024+0 records out
> >> > > 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
> >> > > [root@maiden ~]# umount /mnt/megadeth/
> >> > > [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
> >> > > [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
> >> > > 1024+0 records in
> >> > > 1024+0 records out
> >> > > 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
> >> > > [root@maiden ~]# umount /mnt/megadeth/
> >> > > 
> >> > > Here is the patch, with the suggestion from Jens to switch the cfqq to
> >> > > the right priority tree when the priority is changed.
> >> > 
> >> > I don't see the issue here either. Vitaly, are you using any openvz
> >> > kernel patches? IIRC, they patch cfq so it could just be that your cfq
> >> > version is incompatible with Jeff's patch.
> >> 
> >> Heh, got it to trigger about 3 seconds after sending that email! I'll
> >> look more into it.
> >
> > OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
> > even return a hit, since it just breaks and returns NULL always. That
> > can cause cfq_prio_tree_add() to screw up the rbtree. The code to
> > correct on ioprio change wasn't correct either, I changed that as well.
> > New patch below, Vitaly can you give it a spin?
> 
> Thanks for doing that!  Yeah, that was a stupid bug with the lookup
> routine.  I don't know that I agree with you that the ioprio change code
> was wrong.  I looked at all of the callers and that seemed the code path
> that was used for I/O priority *changes*.  The initial creation was
> already okay, wasn't it?

You only did it in cfq_prio_boost(), you should go one down and do it
for all prio changes. cfq_init_prio_data() gets called to fix state up
lazily when it notices a prio change, either due to prio boost or
because someone ran ionice.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 18:06                       ` Jens Axboe
@ 2008-11-11 19:36                         ` Jeff Moyer
  2008-11-11 21:41                           ` Jeff Layton
  2008-11-11 19:42                         ` Vitaly V. Bursov
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-11 19:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel, jlayton

Jens Axboe <jens.axboe@oracle.com> writes:

> OK, that looks better. Can I talk you into just trying this little
> patch, just to see what kind of performance that yields? Remove the cfq
> patch first. I would have patched nfsd only, but this is just a quick'n
> dirty.

I went ahead and gave it a shot.  The updated CFQ patch with no I/O
context sharing does about 40MB/s reading a 1GB file.  Backing that
patch out, and then adding the patch to share io_context's between
kthreads yields 45MB/s.

By the way, in looking at the copy_io function, I noticed what appears
to be a (minor) bug:

        if (clone_flags & CLONE_IO) {
                tsk->io_context = ioc_task_link(ioc);
                if (unlikely(!tsk->io_context))
                        return -ENOMEM;

According to comments in ioc_task_link, tsk->io_context == NULL means:
        /*
         * if ref count is zero, don't allow sharing (ioc is going away, it's
         * a race).
         */

It seems more appropriate to just create a new I/O context at this
point, don't you think?  (Sorry, I know it's off-topic!)

Cheers,

Jeff

diff --git a/kernel/fork.c b/kernel/fork.c
index f608356..483d95c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -723,10 +723,17 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 	 * Share io context with parent, if CLONE_IO is set
 	 */
 	if (clone_flags & CLONE_IO) {
+		/*
+		 * If ioc_task_link fails, it just means that we raced
+		 * with io context cleanup.  Continue on to allocate
+		 * a new context in this case.
+		 */
 		tsk->io_context = ioc_task_link(ioc);
-		if (unlikely(!tsk->io_context))
-			return -ENOMEM;
-	} else if (ioprio_valid(ioc->ioprio)) {
+		if (likely(tsk->io_context))
+			return 0;
+	}
+
+	if (ioprio_valid(ioc->ioprio)) {
 		tsk->io_context = alloc_io_context(GFP_KERNEL, -1);
 		if (unlikely(!tsk->io_context))
 			return -ENOMEM;

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 18:06                       ` Jens Axboe
  2008-11-11 19:36                         ` Jeff Moyer
@ 2008-11-11 19:42                         ` Vitaly V. Bursov
  1 sibling, 0 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-11 19:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel

Jens Axboe wrote:
> On Tue, Nov 11 2008, Vitaly V. Bursov wrote:
>> Jens Axboe wrote:
>>> On Tue, Nov 11 2008, Jens Axboe wrote:
>>>> On Tue, Nov 11 2008, Jens Axboe wrote:
>>>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>>>> "Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:
>>>>>>
>>>>>>> Jens Axboe wrote:
>>>>>>>> On Mon, Nov 10 2008, Vitaly V. Bursov wrote:
>>>>>>>>> Jens Axboe wrote:
>>>>>>>>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>>>>>>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>>>>>>>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>>>>>>>>>> you to try this patch is that nfsd may be farming off the I/O requests
>>>>>>>>>>> to different threads which are then performing interleaved I/O.  The
>>>>>>>>>>> above patch tries to detect this and allow cooperating processes to get
>>>>>>>>>>> disk time instead of waiting for the idle timeout.
>>>>>>>>>> Precisely :-)
>>>>>>>>>>
>>>>>>>>>> The only reason I haven't merged it yet is because of worry of extra
>>>>>>>>>> cost, but I'll throw some SSD love at it and see how it turns out.
>>>>>>>>>>
>>>>>>>>> Sorry, but I get "oops" same moment nfs read transfer starts.
>>>>>>>>> I can get directory list via nfs, read files locally (not
>>>>>>>>> carefully tested, though)
>>>>>>>>>
>>>>>>>>> Dumps captured via netconsole, so these may not be completely accurate
>>>>>>>>> but hopefully will give a hint.
>>>>>>>> Interesting, strange how that hasn't triggered here. Or perhaps the
>>>>>>>> version that Jeff posted isn't the one I tried. Anyway, search for:
>>>>>>>>
>>>>>>>>         RB_CLEAR_NODE(&cfqq->rb_node);
>>>>>>>>
>>>>>>>> and add a
>>>>>>>>
>>>>>>>>         RB_CLEAR_NODE(&cfqq->prio_node);
>>>>>>>>
>>>>>>>> just below that. It's in cfq_find_alloc_queue(). I think that should fix
>>>>>>>> it.
>>>>>>>>
>>>>>>> Same problem.
>>>>>>>
>>>>>>> I did make clean; make -j3; sync; on (2 times) patched kernel and it went OK
>>>>>>> but It won't boot anymore with cfq with same error...
>>>>>>>
>>>>>>> Switching cfq io scheduler at runtime (booting with "as") appears to work with
>>>>>>> two parallel local dd reads.
>>>>>> Strange, I can't reproduce a failure.  I'll keep trying.  For now, these
>>>>>> are the results I see:
>>>>>>
>>>>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>>>>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>>>>>> 1024+0 records in
>>>>>> 1024+0 records out
>>>>>> 1073741824 bytes (1.1 GB) copied, 26.8128 s, 40.0 MB/s
>>>>>> [root@maiden ~]# umount /mnt/megadeth/
>>>>>> [root@maiden ~]# mount megadeth:/export/cciss /mnt/megadeth/
>>>>>> [root@maiden ~]# dd if=/mnt/megadeth/file1 of=/dev/null bs=1M
>>>>>> 1024+0 records in
>>>>>> 1024+0 records out
>>>>>> 1073741824 bytes (1.1 GB) copied, 23.7025 s, 45.3 MB/s
>>>>>> [root@maiden ~]# umount /mnt/megadeth/
>>>>>>
>>>>>> Here is the patch, with the suggestion from Jens to switch the cfqq to
>>>>>> the right priority tree when the priority is changed.
>>>>> I don't see the issue here either. Vitaly, are you using any openvz
>>>>> kernel patches? IIRC, they patch cfq so it could just be that your cfq
>>>>> version is incompatible with Jeff's patch.
>>>> Heh, got it to trigger about 3 seconds after sending that email! I'll
>>>> look more into it.
>>> OK, found the issue. A few bugs there... cfq_prio_tree_lookup() doesn't
>>> even return a hit, since it just breaks and returns NULL always. That
>>> can cause cfq_prio_tree_add() to screw up the rbtree. The code to
>>> correct on ioprio change wasn't correct either, I changed that as well.
>>> New patch below, Vitaly can you give it a spin?
>>>
>> No crashes so far. Transfer speed is quiet good also.
>>
>>
>> NFS+deadline, file not cached:
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0,00    0,00   25,50   19,40    0,00   55,10
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
>> sda            6648,80     0,00 1281,70    0,00 115179,20     0,00    89,86     5,35    4,18   0,35  45,20
>> sdb            6672,30     0,00 1257,00    0,00 115292,80     0,00    91,72     5,09    4,06   0,35  44,60
>>
>>
>>
>> NFS+cfq, file not cached:
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            0,05    0,00   25,30   23,95    0,00   50,70
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
>> sda            6403,00     0,00 1089,90    0,00 108655,20     0,00    99,69     4,50    4,13   0,41  44,50
>> sdb            6394,90     0,00 1099,60    0,00 108639,20     0,00    98,80     4,53    4,12   0,39  42,50
>>
>>
>> Just for reference: 10 sec interval average, gigabit network,
>> no tcp/udp hardware checksumming may lead to high system cpu load.
>>
>>
>> Also, a few more test (server has 4G RAM):
>>
>> NFS+cfq, file not cached:
>> $ dd if=test of=/dev/null bs=1M count=2000
>> 2000+0 records in
>> 2000+0 records out
>> 2097152000 bytes (2.1 GB) copied, 24.9147 s, 84.2 MB/s
>>
>> NFS+deadline, file not cached:
>> 2000+0 records in
>> 2000+0 records out
>> 2097152000 bytes (2.1 GB) copied, 23.2999 s, 90.0 MB/s
>>
>> file cached on server:
>> 2000+0 records in
>> 2000+0 records out
>> 2097152000 bytes (2.1 GB) copied, 21.9784 s, 95.4 MB/s
>>
>>
>> Local single dd read leads to 193 MB/s for deadline and
>> 167 MB/s for cfq.
> 
> OK, that looks better. Can I talk you into just trying this little
> patch, just to see what kind of performance that yields? Remove the cfq
> patch first. I would have patched nfsd only, but this is just a quick'n
> dirty.
> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 8e7a7ce..3aacf48 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -92,7 +92,7 @@ static void create_kthread(struct kthread_create_info *create)
>  	int pid;
>  
>  	/* We want our own signal handler (we take no signals by default). */
> -	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
> +	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | CLONE_IO | SIGCHLD);
>  	if (pid < 0) {
>  		create->result = ERR_PTR(pid);
>  	} else {
> 

No patches:

iostat for nfs+cfq read
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    3,25   52,20    0,00   44,55

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            2820,10     0,00  452,40    0,00 47648,80     0,00   105,32     7,54   16,70   1,96  88,60
sdb            2818,60     0,00  453,90    0,00 47391,20     0,00   104,41     4,13    9,02   1,33  60,30

NFS+cfq, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 57.5762 s, 36.4 MB/s

NFS+deadline, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 23.6672 s, 88.6 MB/s

======================
Above patch applied:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    3,60   51,10    0,00   45,30

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            2805,80     0,00  446,20    0,00 47267,20     0,00   105,93     5,61   12,62   1,71  76,50
sdb            2803,90     0,00  448,50    0,00 47246,40     0,00   105,34     5,56   12,46   1,68  75,40


NFS+cfq, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 57.5903 s, 36.4 MB/s

NFS+deadline, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 23.46 s, 89.4 MB/s

======================
Both patches applied:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   22,95   24,65    0,00   52,40

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda            6504,60     0,00 1089,80    0,00 110359,20     0,00   101,27     4,67    4,29   0,40  43,50
sdb            6495,50     0,00 1097,50    0,00 110312,80     0,00   100,51     4,57    4,17   0,39  43,10


NFS+cfq, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 25.4477 s, 82.4 MB/s

NFS+deadline, file not cached:
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 23.1639 s, 90.5 MB/s

-- 
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 19:36                         ` Jeff Moyer
@ 2008-11-11 21:41                           ` Jeff Layton
  2008-11-11 21:59                             ` Jeff Layton
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Layton @ 2008-11-11 21:41 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, Vitaly V. Bursov, linux-kernel, bfields

On Tue, 11 Nov 2008 14:36:07 -0500
Jeff Moyer <jmoyer@redhat.com> wrote:

> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > OK, that looks better. Can I talk you into just trying this little
> > patch, just to see what kind of performance that yields? Remove the cfq
> > patch first. I would have patched nfsd only, but this is just a quick'n
> > dirty.
> 
> I went ahead and gave it a shot.  The updated CFQ patch with no I/O
> context sharing does about 40MB/s reading a 1GB file.  Backing that
> patch out, and then adding the patch to share io_context's between
> kthreads yields 45MB/s.
> 

Here's a quick and dirty patch to make all of the nfsd's have the same
io_context. Comments appreciated -- I'm not that familiar with the IO
scheduling code. If this looks good, I'll clean it up, add some
comments and formally send it to Bruce.

----------------[snip]-------------------

>From dd15b19a0eab3e181a6f76f1421b97950e255b4b Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@redhat.com>
Date: Tue, 11 Nov 2008 15:43:15 -0500
Subject: [PATCH] knfsd: make all nfsd threads share an io_context

This apparently makes the I/O scheduler treat the threads as a group
which helps throughput when sequential I/O is multiplexed over several
nfsd's.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfsd/nfssvc.c |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 07e4f5d..6d87f74 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -22,6 +22,7 @@
 #include <linux/freezer.h>
 #include <linux/fs_struct.h>
 #include <linux/kthread.h>
+#include <linux/iocontext.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/stats.h>
@@ -42,6 +43,7 @@ static int			nfsd(void *vrqstp);
 struct timeval			nfssvc_boot;
 static atomic_t			nfsd_busy;
 static unsigned long		nfsd_last_call;
+static struct io_context	*nfsd_io_context;
 static DEFINE_SPINLOCK(nfsd_call_lock);
 
 /*
@@ -173,6 +175,7 @@ static void nfsd_last_thread(struct svc_serv *serv)
 	nfsd_serv = NULL;
 	nfsd_racache_shutdown();
 	nfs4_state_shutdown();
+	nfsd_io_context = NULL;
 
 	printk(KERN_WARNING "nfsd: last server has exited, flushing export "
 			    "cache\n");
@@ -398,6 +401,28 @@ update_thread_usage(int busy_threads)
 }
 
 /*
+ * should be called while holding nfsd_mutex
+ */
+static void
+nfsd_set_io_context(void)
+{
+	int cpu, node;
+
+	if (!nfsd_io_context) {
+		cpu = get_cpu();
+		node = cpu_to_node(cpu);
+		put_cpu();
+
+		/*
+		 * get_io_context can return NULL if the alloc_context fails.
+		 * That's not technically fatal here, so we don't bother to
+		 * check for it.
+		 */
+		nfsd_io_context = get_io_context(GFP_KERNEL, node);
+	} else
+		copy_io_context(&current->io_context, &nfsd_io_context);
+}
+/*
  * This is the NFS server kernel thread
  */
 static int
@@ -410,6 +435,8 @@ nfsd(void *vrqstp)
 	/* Lock module and set up kernel thread */
 	mutex_lock(&nfsd_mutex);
 
+	nfsd_set_io_context();
+
 	/* At this point, the thread shares current->fs
 	 * with the init process. We need to create files with a
 	 * umask of 0 instead of init's umask. */
-- 
1.5.5.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 21:41                           ` Jeff Layton
@ 2008-11-11 21:59                             ` Jeff Layton
  2008-11-12 12:20                               ` Jens Axboe
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Layton @ 2008-11-11 21:59 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, Vitaly V. Bursov, linux-kernel, bfields

On Tue, 11 Nov 2008 16:41:04 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Tue, 11 Nov 2008 14:36:07 -0500
> Jeff Moyer <jmoyer@redhat.com> wrote:
> 
> > Jens Axboe <jens.axboe@oracle.com> writes:
> > 
> > > OK, that looks better. Can I talk you into just trying this little
> > > patch, just to see what kind of performance that yields? Remove the cfq
> > > patch first. I would have patched nfsd only, but this is just a quick'n
> > > dirty.
> > 
> > I went ahead and gave it a shot.  The updated CFQ patch with no I/O
> > context sharing does about 40MB/s reading a 1GB file.  Backing that
> > patch out, and then adding the patch to share io_context's between
> > kthreads yields 45MB/s.
> > 
> 
> Here's a quick and dirty patch to make all of the nfsd's have the same
> io_context. Comments appreciated -- I'm not that familiar with the IO
> scheduling code. If this looks good, I'll clean it up, add some
> comments and formally send it to Bruce.
> 

No sooner than I send it out than I find a bug. We need to eventually
put the io_context reference we get. This should be more correct:

----------------[snip]-------------------

>From d0ee67045a12c677883f77791c6f260588c7b41f Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@redhat.com>
Date: Tue, 11 Nov 2008 16:54:16 -0500
Subject: [PATCH] knfsd: make all nfsd threads share an io_context

This apparently makes the I/O scheduler treat the threads as a group
which helps throughput when sequential I/O is multiplexed over several
nfsd's.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfsd/nfssvc.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 07e4f5d..5cd99f9 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -22,6 +22,7 @@
 #include <linux/freezer.h>
 #include <linux/fs_struct.h>
 #include <linux/kthread.h>
+#include <linux/iocontext.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/stats.h>
@@ -42,6 +43,7 @@ static int			nfsd(void *vrqstp);
 struct timeval			nfssvc_boot;
 static atomic_t			nfsd_busy;
 static unsigned long		nfsd_last_call;
+static struct io_context	*nfsd_io_context;
 static DEFINE_SPINLOCK(nfsd_call_lock);
 
 /*
@@ -173,6 +175,10 @@ static void nfsd_last_thread(struct svc_serv *serv)
 	nfsd_serv = NULL;
 	nfsd_racache_shutdown();
 	nfs4_state_shutdown();
+	if (nfsd_io_context) {
+		put_io_context(nfsd_io_context);
+		nfsd_io_context = NULL;
+	}
 
 	printk(KERN_WARNING "nfsd: last server has exited, flushing export "
 			    "cache\n");
@@ -398,6 +404,28 @@ update_thread_usage(int busy_threads)
 }
 
 /*
+ * should be called while holding nfsd_mutex
+ */
+static void
+nfsd_set_io_context(void)
+{
+	int cpu, node;
+
+	if (!nfsd_io_context) {
+		cpu = get_cpu();
+		node = cpu_to_node(cpu);
+		put_cpu();
+
+		/*
+		 * get_io_context can return NULL if the alloc_context fails.
+		 * That's not technically fatal here, so we don't bother to
+		 * check for it.
+		 */
+		nfsd_io_context = get_io_context(GFP_KERNEL, node);
+	} else
+		copy_io_context(&current->io_context, &nfsd_io_context);
+}
+/*
  * This is the NFS server kernel thread
  */
 static int
@@ -410,6 +438,8 @@ nfsd(void *vrqstp)
 	/* Lock module and set up kernel thread */
 	mutex_lock(&nfsd_mutex);
 
+	nfsd_set_io_context();
+
 	/* At this point, the thread shares current->fs
 	 * with the init process. We need to create files with a
 	 * umask of 0 instead of init's umask. */
-- 
1.5.5.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-11 21:59                             ` Jeff Layton
@ 2008-11-12 12:20                               ` Jens Axboe
  2008-11-12 12:45                                 ` Jeff Layton
  0 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-12 12:20 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel, bfields

On Tue, Nov 11 2008, Jeff Layton wrote:
> On Tue, 11 Nov 2008 16:41:04 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Tue, 11 Nov 2008 14:36:07 -0500
> > Jeff Moyer <jmoyer@redhat.com> wrote:
> > 
> > > Jens Axboe <jens.axboe@oracle.com> writes:
> > > 
> > > > OK, that looks better. Can I talk you into just trying this little
> > > > patch, just to see what kind of performance that yields? Remove the cfq
> > > > patch first. I would have patched nfsd only, but this is just a quick'n
> > > > dirty.
> > > 
> > > I went ahead and gave it a shot.  The updated CFQ patch with no I/O
> > > context sharing does about 40MB/s reading a 1GB file.  Backing that
> > > patch out, and then adding the patch to share io_context's between
> > > kthreads yields 45MB/s.
> > > 
> > 
> > Here's a quick and dirty patch to make all of the nfsd's have the same
> > io_context. Comments appreciated -- I'm not that familiar with the IO
> > scheduling code. If this looks good, I'll clean it up, add some
> > comments and formally send it to Bruce.
> > 
> 
> No sooner than I send it out than I find a bug. We need to eventually
> put the io_context reference we get. This should be more correct:

That sort of thing happens a lot, I can definitely sympathize with you
there :-)

> ----------------[snip]-------------------
> 
> From d0ee67045a12c677883f77791c6f260588c7b41f Mon Sep 17 00:00:00 2001
> From: Jeff Layton <jlayton@redhat.com>
> Date: Tue, 11 Nov 2008 16:54:16 -0500
> Subject: [PATCH] knfsd: make all nfsd threads share an io_context
> 
> This apparently makes the I/O scheduler treat the threads as a group
> which helps throughput when sequential I/O is multiplexed over several
> nfsd's.

That's a lot more nifty than my stupid CLONE_IO flag addition. Both are
only good for test purposes though.

It's a bit difficult to make this really mergeable. I don't know
anything about how nfsd manages its thread pool, but something more
appropriate would be an io context per client mount. That's still not
perfect as you could easily have more than one process doing simultanous
IO on the client side, but it's a lot better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 12:20                               ` Jens Axboe
@ 2008-11-12 12:45                                 ` Jeff Layton
  2008-11-12 12:54                                   ` Christoph Hellwig
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Layton @ 2008-11-12 12:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel, bfields

On Wed, 12 Nov 2008 13:20:35 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Tue, Nov 11 2008, Jeff Layton wrote:
> > On Tue, 11 Nov 2008 16:41:04 -0500
> > Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > > On Tue, 11 Nov 2008 14:36:07 -0500
> > > Jeff Moyer <jmoyer@redhat.com> wrote:
> > > 
> > > > Jens Axboe <jens.axboe@oracle.com> writes:
> > > > 
> > > > > OK, that looks better. Can I talk you into just trying this little
> > > > > patch, just to see what kind of performance that yields? Remove the cfq
> > > > > patch first. I would have patched nfsd only, but this is just a quick'n
> > > > > dirty.
> > > > 
> > > > I went ahead and gave it a shot.  The updated CFQ patch with no I/O
> > > > context sharing does about 40MB/s reading a 1GB file.  Backing that
> > > > patch out, and then adding the patch to share io_context's between
> > > > kthreads yields 45MB/s.
> > > > 
> > > 
> > > Here's a quick and dirty patch to make all of the nfsd's have the same
> > > io_context. Comments appreciated -- I'm not that familiar with the IO
> > > scheduling code. If this looks good, I'll clean it up, add some
> > > comments and formally send it to Bruce.
> > > 
> > 
> > No sooner than I send it out than I find a bug. We need to eventually
> > put the io_context reference we get. This should be more correct:
> 
> That sort of thing happens a lot, I can definitely sympathize with you
> there :-)
> 
> > ----------------[snip]-------------------
> > 
> > From d0ee67045a12c677883f77791c6f260588c7b41f Mon Sep 17 00:00:00 2001
> > From: Jeff Layton <jlayton@redhat.com>
> > Date: Tue, 11 Nov 2008 16:54:16 -0500
> > Subject: [PATCH] knfsd: make all nfsd threads share an io_context
> > 
> > This apparently makes the I/O scheduler treat the threads as a group
> > which helps throughput when sequential I/O is multiplexed over several
> > nfsd's.
> 
> That's a lot more nifty than my stupid CLONE_IO flag addition. Both are
> only good for test purposes though.
> 
> It's a bit difficult to make this really mergeable. I don't know
> anything about how nfsd manages its thread pool, but something more
> appropriate would be an io context per client mount. That's still not
> perfect as you could easily have more than one process doing simultanous
> IO on the client side, but it's a lot better.
> 

Ahh Good point...I wasn't thinking about it the right way. I guess what
we really need is some way to tell that a series of I/O requests
originated from the same client thread. NFS isn't really conducive to
this...

We might be able to do something like that with NFSv4 though. Maybe an
io_context per state owner or something. That's probably not going to
be trivial to implement however...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 12:45                                 ` Jeff Layton
@ 2008-11-12 12:54                                   ` Christoph Hellwig
  0 siblings, 0 replies; 70+ messages in thread
From: Christoph Hellwig @ 2008-11-12 12:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, bfields

On Wed, Nov 12, 2008 at 07:45:03AM -0500, Jeff Layton wrote:
> Ahh Good point...I wasn't thinking about it the right way. I guess what
> we really need is some way to tell that a series of I/O requests
> originated from the same client thread. NFS isn't really conducive to
> this...
> 
> We might be able to do something like that with NFSv4 though. Maybe an
> io_context per state owner or something. That's probably not going to
> be trivial to implement however...

Talk to Greg Banks, he's been thinking about and working on an open
files cache for quite a while already.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-10 13:56     ` Jens Axboe
  2008-11-10 17:16       ` Vitaly V. Bursov
@ 2008-11-12 18:32       ` Jeff Moyer
  2008-11-12 19:02         ` Jens Axboe
  2008-11-13  6:54         ` Vitaly V. Bursov
  1 sibling, 2 replies; 70+ messages in thread
From: Jeff Moyer @ 2008-11-12 18:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Mon, Nov 10 2008, Jeff Moyer wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>> > On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
>> >> Hello,
>> >> 
>> >> I'm building small server system with openvz kernel and have ran into
>> >> some IO performance problems. Reading a single file via NFS delivers
>> >> around 9 MB/s over gigabit network, but while reading, say, 2 different
>> >> or same file 2 times at the same time I get >60MB/s.
>> >> 
>> >> Changing IO scheduler to deadline or anticipatory fixes problem.
>> >> 
>> >> Tested kernels:
>> >>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
>> >>                  fast local reads)
>> >>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
>> >> 
>> >> Vanilla performs better in worst case but I believe 40 is still low
>> >> concerning test results below.
>> >
>> > Can you check with this patch applied?
>> >
>> > http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>> 
>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>> you to try this patch is that nfsd may be farming off the I/O requests
>> to different threads which are then performing interleaved I/O.  The
>> above patch tries to detect this and allow cooperating processes to get
>> disk time instead of waiting for the idle timeout.
>
> Precisely :-)
>
> The only reason I haven't merged it yet is because of worry of extra
> cost, but I'll throw some SSD love at it and see how it turns out.

OK, I'm not actually able to reproduce the horrible 9MB/s reported by
Vitaly.  Here are the numbers I see.

Single dd performing a cold cache read of a 1GB file from an
nfs server.  read_ahead_kb is 128 (the default) for all tests.
cfq-cc denotes that the cfq scheduler was patched with the close
cooperator patch.  All numbers are in MB/s.

nfsd threads|   1  |  2   |  4   |  8  
----------------------------------------
deadline    | 65.3 | 52.2 | 46.7 | 46.1
cfq         | 64.1 | 57.8 | 53.3 | 46.9
cfq-cc      | 65.7 | 55.8 | 52.1 | 40.3

So, in my configuration, cfq and deadline both degrade in performance as
the number of nfsd threads is increased.  The close cooperator patch
seems to hurt a bit more at 8 threads, instead of helping;  I'm not sure
why that is.

Now, the workload that showed most slowdown for cfq with respect to
other I/O schedulers was using dump(8) to backup a file system.  Here
are the numbers for that:

deadline  82241 kB/s
cfq	  34143 kB/s
cfq-cc    82241 kB/s

And a customer actually went to the trouble to write a test to approximate
dump(8)'s I/O patterns.  For that test, we also see a big speedup (as
expected):

deadline  87157 kB/s
cfq       20218 kB/s
cfq-cc    87056 kB/s

Jens, if you have any canned fio workloads that you use for regression
testing, please pass them my way and I'll give them a go on some SAN
storage.

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 18:32       ` Jeff Moyer
@ 2008-11-12 19:02         ` Jens Axboe
  2008-11-13  8:51           ` Wu Fengguang
  2008-11-24 15:33           ` Jeff Moyer
  2008-11-13  6:54         ` Vitaly V. Bursov
  1 sibling, 2 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-12 19:02 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Wed, Nov 12 2008, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Mon, Nov 10 2008, Jeff Moyer wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >> 
> >> > On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
> >> >> Hello,
> >> >> 
> >> >> I'm building small server system with openvz kernel and have ran into
> >> >> some IO performance problems. Reading a single file via NFS delivers
> >> >> around 9 MB/s over gigabit network, but while reading, say, 2 different
> >> >> or same file 2 times at the same time I get >60MB/s.
> >> >> 
> >> >> Changing IO scheduler to deadline or anticipatory fixes problem.
> >> >> 
> >> >> Tested kernels:
> >> >>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
> >> >>                  fast local reads)
> >> >>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
> >> >> 
> >> >> Vanilla performs better in worst case but I believe 40 is still low
> >> >> concerning test results below.
> >> >
> >> > Can you check with this patch applied?
> >> >
> >> > http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
> >> 
> >> Funny, I was going to ask the same question.  ;)  The reason Jens wants
> >> you to try this patch is that nfsd may be farming off the I/O requests
> >> to different threads which are then performing interleaved I/O.  The
> >> above patch tries to detect this and allow cooperating processes to get
> >> disk time instead of waiting for the idle timeout.
> >
> > Precisely :-)
> >
> > The only reason I haven't merged it yet is because of worry of extra
> > cost, but I'll throw some SSD love at it and see how it turns out.
> 
> OK, I'm not actually able to reproduce the horrible 9MB/s reported by
> Vitaly.  Here are the numbers I see.
> 
> Single dd performing a cold cache read of a 1GB file from an
> nfs server.  read_ahead_kb is 128 (the default) for all tests.
> cfq-cc denotes that the cfq scheduler was patched with the close
> cooperator patch.  All numbers are in MB/s.
> 
> nfsd threads|   1  |  2   |  4   |  8  
> ----------------------------------------
> deadline    | 65.3 | 52.2 | 46.7 | 46.1
> cfq         | 64.1 | 57.8 | 53.3 | 46.9
> cfq-cc      | 65.7 | 55.8 | 52.1 | 40.3
> 
> So, in my configuration, cfq and deadline both degrade in performance as
> the number of nfsd threads is increased.  The close cooperator patch
> seems to hurt a bit more at 8 threads, instead of helping;  I'm not sure
> why that is.
> 
> Now, the workload that showed most slowdown for cfq with respect to
> other I/O schedulers was using dump(8) to backup a file system.  Here
> are the numbers for that:
> 
> deadline  82241 kB/s
> cfq	  34143 kB/s
> cfq-cc    82241 kB/s
> 
> And a customer actually went to the trouble to write a test to approximate
> dump(8)'s I/O patterns.  For that test, we also see a big speedup (as
> expected):
> 
> deadline  87157 kB/s
> cfq       20218 kB/s
> cfq-cc    87056 kB/s
> 
> Jens, if you have any canned fio workloads that you use for regression
> testing, please pass them my way and I'll give them a go on some SAN
> storage.

I already talked about this with Jeff on irc, but I guess should post it
here as well.

nfsd aside (which does seem to have some different behaviour skewing the
results), the original patch came about because dump(8) has a really
stupid design that offloads IO to a number of processes. This basically
makes fairly sequential IO more random with CFQ, since each process gets
its own io context. My feeling is that we should fix dump instead of
introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
aware of any other good programs out there that would do something
similar, so I don't think there's a lot of merrit to spending cycles on
detecting cooperating processes.

Jeff will take a look at fixing dump instead, and I may have promised
him that santa will bring him something nice this year if he does (since
I'm sure it'll be painful on the eyes).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 18:32       ` Jeff Moyer
  2008-11-12 19:02         ` Jens Axboe
@ 2008-11-13  6:54         ` Vitaly V. Bursov
  2008-11-13 14:32           ` Jeff Moyer
  1 sibling, 1 reply; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-13  6:54 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, linux-kernel

Jeff Moyer пишет:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>
>>>> On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
>>>>> Hello,
>>>>>
>>>>> I'm building small server system with openvz kernel and have ran into
>>>>> some IO performance problems. Reading a single file via NFS delivers
>>>>> around 9 MB/s over gigabit network, but while reading, say, 2 different
>>>>> or same file 2 times at the same time I get >60MB/s.
>>>>>
>>>>> Changing IO scheduler to deadline or anticipatory fixes problem.
>>>>>
>>>>> Tested kernels:
>>>>>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
>>>>>                  fast local reads)
>>>>>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
>>>>>
>>>>> Vanilla performs better in worst case but I believe 40 is still low
>>>>> concerning test results below.
>>>> Can you check with this patch applied?
>>>>
>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>> you to try this patch is that nfsd may be farming off the I/O requests
>>> to different threads which are then performing interleaved I/O.  The
>>> above patch tries to detect this and allow cooperating processes to get
>>> disk time instead of waiting for the idle timeout.
>> Precisely :-)
>>
>> The only reason I haven't merged it yet is because of worry of extra
>> cost, but I'll throw some SSD love at it and see how it turns out.
> 
> OK, I'm not actually able to reproduce the horrible 9MB/s reported by
> Vitaly.  Here are the numbers I see.

It's 2.6.18-openvz-rhel5 kernel gives me 9MB/s, and with 2.6.27 I get ~40-50MB/s
instead of 80-90 MB/s as there should be no bottlenecks except the network.

> Single dd performing a cold cache read of a 1GB file from an
> nfs server.  read_ahead_kb is 128 (the default) for all tests.
> cfq-cc denotes that the cfq scheduler was patched with the close
> cooperator patch.  All numbers are in MB/s.
> 
> nfsd threads|   1  |  2   |  4   |  8  
> ----------------------------------------
> deadline    | 65.3 | 52.2 | 46.7 | 46.1
> cfq         | 64.1 | 57.8 | 53.3 | 46.9
> cfq-cc      | 65.7 | 55.8 | 52.1 | 40.3
> 
> So, in my configuration, cfq and deadline both degrade in performance as
> the number of nfsd threads is increased.  The close cooperator patch
> seems to hurt a bit more at 8 threads, instead of helping;  I'm not sure
> why that is.

Interesting, I'll try to change nfsd threads number and see how it performs
on my setup. Setting it to 1 seems like a good idea for cfq and a non-high-end
hardware.

I'll look into it this evening.

-- 
Regards,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 19:02         ` Jens Axboe
@ 2008-11-13  8:51           ` Wu Fengguang
  2008-11-13  8:54             ` Jens Axboe
                               ` (2 more replies)
  2008-11-24 15:33           ` Jeff Moyer
  1 sibling, 3 replies; 70+ messages in thread
From: Wu Fengguang @ 2008-11-13  8:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel

Hi all,

//Sorry for being late. 

On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
[...]
> I already talked about this with Jeff on irc, but I guess should post it
> here as well.
> 
> nfsd aside (which does seem to have some different behaviour skewing the
> results), the original patch came about because dump(8) has a really
> stupid design that offloads IO to a number of processes. This basically
> makes fairly sequential IO more random with CFQ, since each process gets
> its own io context. My feeling is that we should fix dump instead of
> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> aware of any other good programs out there that would do something
> similar, so I don't think there's a lot of merrit to spending cycles on
> detecting cooperating processes.
> 
> Jeff will take a look at fixing dump instead, and I may have promised
> him that santa will bring him something nice this year if he does (since
> I'm sure it'll be painful on the eyes).

This could also be fixed at the VFS readahead level.

In fact I've seen many kinds of interleaved accesses:
- concurrently reading 40 files that are in fact hard links of one single file
- a backup tool that splits a big file into 8k chunks, and serve the
  {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
  chunks in another one
- a pool of NFSDs randomly serving some originally sequential read requests 
- now dump(8) seems to have some similar problem.

In summary there have been all kinds of efforts on trying to
parallelize I/O tasks, but unfortunately they can easily screw up the
sequential pattern. It may not be easily fixable for many of them.

It is however possible to detect most of these patterns at the
readahead layer and restore sequential I/Os, before they propagate
into the block layer and hurt performance.

Vitaly, if that's what you need, I can try to prepare a patch for testing out.

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13  8:51           ` Wu Fengguang
@ 2008-11-13  8:54             ` Jens Axboe
  2008-11-14  1:36               ` Wu Fengguang
  2008-11-13 18:46             ` Vitaly V. Bursov
  2008-11-25 10:59             ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-13  8:54 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Thu, Nov 13 2008, Wu Fengguang wrote:
> Hi all,
> 
> //Sorry for being late. 
> 
> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
> [...]
> > I already talked about this with Jeff on irc, but I guess should post it
> > here as well.
> > 
> > nfsd aside (which does seem to have some different behaviour skewing the
> > results), the original patch came about because dump(8) has a really
> > stupid design that offloads IO to a number of processes. This basically
> > makes fairly sequential IO more random with CFQ, since each process gets
> > its own io context. My feeling is that we should fix dump instead of
> > introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> > aware of any other good programs out there that would do something
> > similar, so I don't think there's a lot of merrit to spending cycles on
> > detecting cooperating processes.
> > 
> > Jeff will take a look at fixing dump instead, and I may have promised
> > him that santa will bring him something nice this year if he does (since
> > I'm sure it'll be painful on the eyes).
> 
> This could also be fixed at the VFS readahead level.
> 
> In fact I've seen many kinds of interleaved accesses:
> - concurrently reading 40 files that are in fact hard links of one single file
> - a backup tool that splits a big file into 8k chunks, and serve the
>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>   chunks in another one
> - a pool of NFSDs randomly serving some originally sequential read requests 
> - now dump(8) seems to have some similar problem.
> 
> In summary there have been all kinds of efforts on trying to
> parallelize I/O tasks, but unfortunately they can easily screw up the
> sequential pattern. It may not be easily fixable for many of them.
> 
> It is however possible to detect most of these patterns at the
> readahead layer and restore sequential I/Os, before they propagate
> into the block layer and hurt performance.
> 
> Vitaly, if that's what you need, I can try to prepare a patch for
> testing out.

It's not easy. To really fix it, you have to get that sequential RA
pattern from just the single process. As soon as you spread the IO
between processes (eg N-1 aren't just getting cache hits), then you may
run into trouble on the IO scheduler side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13  6:54         ` Vitaly V. Bursov
@ 2008-11-13 14:32           ` Jeff Moyer
  2008-11-13 18:33             ` Vitaly V. Bursov
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-13 14:32 UTC (permalink / raw)
  To: Vitaly V. Bursov; +Cc: Jens Axboe, linux-kernel

"Vitaly V. Bursov" <vitalyb@telenet.dn.ua> writes:

> Jeff Moyer пишет:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>>> On Mon, Nov 10 2008, Jeff Moyer wrote:
>>>> Jens Axboe <jens.axboe@oracle.com> writes:
>>>>
>>>>> On Sun, Nov 09 2008, Vitaly V. Bursov wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I'm building small server system with openvz kernel and have ran into
>>>>>> some IO performance problems. Reading a single file via NFS delivers
>>>>>> around 9 MB/s over gigabit network, but while reading, say, 2 different
>>>>>> or same file 2 times at the same time I get >60MB/s.
>>>>>>
>>>>>> Changing IO scheduler to deadline or anticipatory fixes problem.
>>>>>>
>>>>>> Tested kernels:
>>>>>>   OpenVZ RHEL5 028stab059.3 (9 MB/s with HZ=100, 20MB/s with HZ=1000
>>>>>>                  fast local reads)
>>>>>>   Vanilla 2.6.27.5 (40 MB/s with HZ=100, slow local reads)
>>>>>>
>>>>>> Vanilla performs better in worst case but I believe 40 is still low
>>>>>> concerning test results below.
>>>>> Can you check with this patch applied?
>>>>>
>>>>> http://bugzilla.kernel.org/attachment.cgi?id=18473&action=view
>>>> Funny, I was going to ask the same question.  ;)  The reason Jens wants
>>>> you to try this patch is that nfsd may be farming off the I/O requests
>>>> to different threads which are then performing interleaved I/O.  The
>>>> above patch tries to detect this and allow cooperating processes to get
>>>> disk time instead of waiting for the idle timeout.
>>> Precisely :-)
>>>
>>> The only reason I haven't merged it yet is because of worry of extra
>>> cost, but I'll throw some SSD love at it and see how it turns out.
>> 
>> OK, I'm not actually able to reproduce the horrible 9MB/s reported by
>> Vitaly.  Here are the numbers I see.
>
> It's 2.6.18-openvz-rhel5 kernel gives me 9MB/s, and with 2.6.27 I get ~40-50MB/s
> instead of 80-90 MB/s as there should be no bottlenecks except the network.

Reading back through your original problem report, I'm not seeing what
your numbers were with deadline; you simply mentioned that it "fixed"
the problem.  Are you sure you'll get 80-90MB/s for this?  The local
disks in my configuration, when performing a dd on the server system,
can produce numbers around 85 MB/s, yet the NFS performance is around 65
MB/s (and this is a gigabit network).

>> Single dd performing a cold cache read of a 1GB file from an
>> nfs server.  read_ahead_kb is 128 (the default) for all tests.
>> cfq-cc denotes that the cfq scheduler was patched with the close
>> cooperator patch.  All numbers are in MB/s.
>> 
>> nfsd threads|   1  |  2   |  4   |  8  
>> ----------------------------------------
>> deadline    | 65.3 | 52.2 | 46.7 | 46.1
>> cfq         | 64.1 | 57.8 | 53.3 | 46.9
>> cfq-cc      | 65.7 | 55.8 | 52.1 | 40.3
>> 
>> So, in my configuration, cfq and deadline both degrade in performance as
>> the number of nfsd threads is increased.  The close cooperator patch
>> seems to hurt a bit more at 8 threads, instead of helping;  I'm not sure
>> why that is.
>
> Interesting, I'll try to change nfsd threads number and see how it performs
> on my setup. Setting it to 1 seems like a good idea for cfq and a non-high-end
> hardware.

I think you're looking at this backwards.  I'm no nfsd tuning expert,
but I'm pretty sure that you would tune the numbe,r of threads based on
the number of active clients and the amount of memory on the server
(since each thread has to reserve memory for incoming requests).

> I'll look into it this evening.

The real reason I tried varying the number of nfsd threads was to show,
at least for CFQ, that spreading a sequential I/O load across multiple
threads would result in suboptimal performance.  What I found, instead,
was that it hurt performance for cfq *and* deadline (and that the close
cooperator patches did not help in this regard).  This tells me that
there is something else which is affecting the performance.  What that
something is I don't know, I think we'd have to take a closer look at
what's going on on the server to figure it out.

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13 14:32           ` Jeff Moyer
@ 2008-11-13 18:33             ` Vitaly V. Bursov
  0 siblings, 0 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-13 18:33 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, linux-kernel

Jeff Moyer wrote:

>> It's 2.6.18-openvz-rhel5 kernel gives me 9MB/s, and with 2.6.27 I get ~40-50MB/s
>> instead of 80-90 MB/s as there should be no bottlenecks except the network.
> 
> Reading back through your original problem report, I'm not seeing what
> your numbers were with deadline; you simply mentioned that it "fixed"
> the problem.  Are you sure you'll get 80-90MB/s for this?  The local
> disks in my configuration, when performing a dd on the server system,
> can produce numbers around 85 MB/s, yet the NFS performance is around 65
> MB/s (and this is a gigabit network).

I have pair of 1TB HDDs and each of them is able to deliver around
100MB/s for sequental reads and PCI-E Ethernet adapter not sharing
a bus with SATA controller.

2.6.18-openvz-rhel5, loopback mounted nfs with deadline gives
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 7.12014 s, 147 MB/s

and same system via network:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.3197 s, 73.2 MB/s

and network+cfq:
dd if=samefile of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 59.3116 s, 17.7 MB/s

and network+file cached on server side:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 12.4204 s, 84.4 MB/s

Well, 73 is still not 80, but way much better than 17 (or,
even worse - 9)

I have 8 NFS threads by default here.

I got 17 MB here because of HZ=1000, also 2.6.27 performed
better in every heavy-nfs-transfer test so far

Changed network parateters:

net.core.wmem_default = 1048576
net.core.wmem_max = 1048576
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576

net.ipv4.tcp_mem = 1048576 1048576 4194304
net.ipv4.tcp_rmem = 1048576 1048576 4194304
net.ipv4.tcp_rmem = 1048576 1048576 4194304
net.ipv4.tcp_wmem = 1048576 1048576 4194304
net.ipv4.tcp_wmem = 1048576 1048576 4194304

MTU: 6000

Sorry, I didn't mention these in original post.

>>> Single dd performing a cold cache read of a 1GB file from an
>>> nfs server.  read_ahead_kb is 128 (the default) for all tests.
>>> cfq-cc denotes that the cfq scheduler was patched with the close
>>> cooperator patch.  All numbers are in MB/s.
>>>
>>> nfsd threads|   1  |  2   |  4   |  8  
>>> ----------------------------------------
>>> deadline    | 65.3 | 52.2 | 46.7 | 46.1
>>> cfq         | 64.1 | 57.8 | 53.3 | 46.9
>>> cfq-cc      | 65.7 | 55.8 | 52.1 | 40.3
>>>
>>> So, in my configuration, cfq and deadline both degrade in performance as
>>> the number of nfsd threads is increased.  The close cooperator patch
>>> seems to hurt a bit more at 8 threads, instead of helping;  I'm not sure
>>> why that is.
>> Interesting, I'll try to change nfsd threads number and see how it performs
>> on my setup. Setting it to 1 seems like a good idea for cfq and a non-high-end
>> hardware.
> 
> I think you're looking at this backwards.  I'm no nfsd tuning expert,
> but I'm pretty sure that you would tune the numbe,r of threads based on
> the number of active clients and the amount of memory on the server
> (since each thread has to reserve memory for incoming requests).

I understand this. It's just one of the parameters I completely missed
out of my sight :)

>> I'll look into it this evening.
> 
> The real reason I tried varying the number of nfsd threads was to show,
> at least for CFQ, that spreading a sequential I/O load across multiple
> threads would result in suboptimal performance.  What I found, instead,
> was that it hurt performance for cfq *and* deadline (and that the close
> cooperator patches did not help in this regard).  This tells me that
> there is something else which is affecting the performance.  What that
> something is I don't know, I think we'd have to take a closer look at
> what's going on on the server to figure it out.
> 

I've tested it also...

loopback:
nfsd threads |  1 |  2 |  4 |  8 |  16
----------------------------------------
deadline-vz  |  97|  92| 128| 145| 148
deadline     | 145| 160| 173| 170| 150
cfq-cc       | 137| 150| 167| 157| 133
cfq          |  26|  28|  34|  38|  38

network:
nfsd threads |  1 |  2 |   4|   8|  16
----------------------------------------
deadline-vz  |  68|  69|  75|  73|  72
deadline     |  91|  89|  87|  88|  84
cfq-cc       |  91|  89|  88|  82|  74
cfq          |  25|  28|  32|  36|  34


deadline-vz - deadline with 2.6.18-openvz-rhel5 kernel
deadline, cfq, cfq-cc - linux-2.6.27.5

Yep, it's not that simple as I thought...

-- 
Regards,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13  8:51           ` Wu Fengguang
  2008-11-13  8:54             ` Jens Axboe
@ 2008-11-13 18:46             ` Vitaly V. Bursov
  2008-11-25 10:59             ` Vladislav Bolkhovitin
  2 siblings, 0 replies; 70+ messages in thread
From: Vitaly V. Bursov @ 2008-11-13 18:46 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, linux-kernel

Wu Fengguang wrote:
> Hi all,
> 
> //Sorry for being late. 
> 
> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
> [...]
>> I already talked about this with Jeff on irc, but I guess should post it
>> here as well.
>>
>> nfsd aside (which does seem to have some different behaviour skewing the
>> results), the original patch came about because dump(8) has a really
>> stupid design that offloads IO to a number of processes. This basically
>> makes fairly sequential IO more random with CFQ, since each process gets
>> its own io context. My feeling is that we should fix dump instead of
>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>> aware of any other good programs out there that would do something
>> similar, so I don't think there's a lot of merrit to spending cycles on
>> detecting cooperating processes.
>>
>> Jeff will take a look at fixing dump instead, and I may have promised
>> him that santa will bring him something nice this year if he does (since
>> I'm sure it'll be painful on the eyes).
> 
> This could also be fixed at the VFS readahead level.
> 
> In fact I've seen many kinds of interleaved accesses:
> - concurrently reading 40 files that are in fact hard links of one single file
> - a backup tool that splits a big file into 8k chunks, and serve the
>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>   chunks in another one
> - a pool of NFSDs randomly serving some originally sequential read requests 
> - now dump(8) seems to have some similar problem.
> 
> In summary there have been all kinds of efforts on trying to
> parallelize I/O tasks, but unfortunately they can easily screw up the
> sequential pattern. It may not be easily fixable for many of them.
> 
> It is however possible to detect most of these patterns at the
> readahead layer and restore sequential I/Os, before they propagate
> into the block layer and hurt performance.
> 
> Vitaly, if that's what you need, I can try to prepare a patch for testing out.

Deadline scheduler should fit my needs, I believe. I can test a patch which
tries to resolve the issue or run some more tests, though.

-- 
Thanks,
Vitaly

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13  8:54             ` Jens Axboe
@ 2008-11-14  1:36               ` Wu Fengguang
  2008-11-25 11:02                 ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2008-11-14  1:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Thu, Nov 13, 2008 at 09:54:39AM +0100, Jens Axboe wrote:
> On Thu, Nov 13 2008, Wu Fengguang wrote:
> > Hi all,
> > 
> > //Sorry for being late. 
> > 
> > On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
> > [...]
> > > I already talked about this with Jeff on irc, but I guess should post it
> > > here as well.
> > > 
> > > nfsd aside (which does seem to have some different behaviour skewing the
> > > results), the original patch came about because dump(8) has a really
> > > stupid design that offloads IO to a number of processes. This basically
> > > makes fairly sequential IO more random with CFQ, since each process gets
> > > its own io context. My feeling is that we should fix dump instead of
> > > introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> > > aware of any other good programs out there that would do something
> > > similar, so I don't think there's a lot of merrit to spending cycles on
> > > detecting cooperating processes.
> > > 
> > > Jeff will take a look at fixing dump instead, and I may have promised
> > > him that santa will bring him something nice this year if he does (since
> > > I'm sure it'll be painful on the eyes).
> > 
> > This could also be fixed at the VFS readahead level.
> > 
> > In fact I've seen many kinds of interleaved accesses:
> > - concurrently reading 40 files that are in fact hard links of one single file
> > - a backup tool that splits a big file into 8k chunks, and serve the
> >   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
> >   chunks in another one
> > - a pool of NFSDs randomly serving some originally sequential read requests 
> > - now dump(8) seems to have some similar problem.
> > 
> > In summary there have been all kinds of efforts on trying to
> > parallelize I/O tasks, but unfortunately they can easily screw up the
> > sequential pattern. It may not be easily fixable for many of them.
> > 
> > It is however possible to detect most of these patterns at the
> > readahead layer and restore sequential I/Os, before they propagate
> > into the block layer and hurt performance.
> > 
> > Vitaly, if that's what you need, I can try to prepare a patch for
> > testing out.
> 
> It's not easy. To really fix it, you have to get that sequential RA
> pattern from just the single process. As soon as you spread the IO
> between processes (eg N-1 aren't just getting cache hits), then you may
> run into trouble on the IO scheduler side.

Yes, it's not easy(or possible) to tell from file->f_ra all those
cooperative processes working on the same sequential stream, since
they will have different file->f_ra instances. In the case of NFSD,
the file->f_ra may well be all zeros.

Another scheme is to detect the sequential pattern via looking up
the page cache, which provides one single and consistent view of the
pages recently accessed. That makes sequential detection possible.

The cost will be one extra page cache lookup per random read.
If it's not acceptable, the corresponding code could be disabled
by default. 

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-12 19:02         ` Jens Axboe
  2008-11-13  8:51           ` Wu Fengguang
@ 2008-11-24 15:33           ` Jeff Moyer
  2008-11-24 18:13             ` Jens Axboe
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-24 15:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> nfsd aside (which does seem to have some different behaviour skewing the
> results), the original patch came about because dump(8) has a really
> stupid design that offloads IO to a number of processes. This basically
> makes fairly sequential IO more random with CFQ, since each process gets
> its own io context. My feeling is that we should fix dump instead of
> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> aware of any other good programs out there that would do something
> similar, so I don't think there's a lot of merrit to spending cycles on
> detecting cooperating processes.
>
> Jeff will take a look at fixing dump instead, and I may have promised
> him that santa will bring him something nice this year if he does (since
> I'm sure it'll be painful on the eyes).

Sorry to drum up this topic once again, but we've recently run into
another instance where the close cooperator patch helps significantly.
The case is KVM using the virtio disk driver.  The host-side uses
posix_aio calls to issue I/O on behalf of the guest.  It's worth noting
that pthread_create does not pass CLONE_IO (at least that was my reading
of the code).  It is questionable whether it really should as that will
change the I/O scheduling dynamics.

So, Jens, what do you think?  Should we collect some performance numbers
to make sure that the close cooperator patch doesn't hurt the common
case?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-24 15:33           ` Jeff Moyer
@ 2008-11-24 18:13             ` Jens Axboe
  2008-11-24 18:50               ` Jeff Moyer
  0 siblings, 1 reply; 70+ messages in thread
From: Jens Axboe @ 2008-11-24 18:13 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Mon, Nov 24 2008, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > nfsd aside (which does seem to have some different behaviour skewing the
> > results), the original patch came about because dump(8) has a really
> > stupid design that offloads IO to a number of processes. This basically
> > makes fairly sequential IO more random with CFQ, since each process gets
> > its own io context. My feeling is that we should fix dump instead of
> > introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> > aware of any other good programs out there that would do something
> > similar, so I don't think there's a lot of merrit to spending cycles on
> > detecting cooperating processes.
> >
> > Jeff will take a look at fixing dump instead, and I may have promised
> > him that santa will bring him something nice this year if he does (since
> > I'm sure it'll be painful on the eyes).
> 
> Sorry to drum up this topic once again, but we've recently run into
> another instance where the close cooperator patch helps significantly.
> The case is KVM using the virtio disk driver.  The host-side uses
> posix_aio calls to issue I/O on behalf of the guest.  It's worth noting
> that pthread_create does not pass CLONE_IO (at least that was my reading
> of the code).  It is questionable whether it really should as that will
> change the I/O scheduling dynamics.
> 
> So, Jens, what do you think?  Should we collect some performance numbers
> to make sure that the close cooperator patch doesn't hurt the common
> case?

No, posix aio is a piece of crap on Linux/glibc so we want to be fixing
that instead. A quick fix is again to use CLONE_IO, though posix aio
needs more work than that. I told the qemu guys not to use posix aio a
long time ago since it does stink and doesn't perform well under any
circumstance... So I don't consider that a valid use case, there's a
reason that basically nobody is using posix aio.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-24 18:13             ` Jens Axboe
@ 2008-11-24 18:50               ` Jeff Moyer
  2008-11-24 18:51                 ` Jens Axboe
  0 siblings, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-24 18:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Vitaly V. Bursov, linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Mon, Nov 24 2008, Jeff Moyer wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>> > nfsd aside (which does seem to have some different behaviour skewing the
>> > results), the original patch came about because dump(8) has a really
>> > stupid design that offloads IO to a number of processes. This basically
>> > makes fairly sequential IO more random with CFQ, since each process gets
>> > its own io context. My feeling is that we should fix dump instead of
>> > introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>> > aware of any other good programs out there that would do something
>> > similar, so I don't think there's a lot of merrit to spending cycles on
>> > detecting cooperating processes.
>> >
>> > Jeff will take a look at fixing dump instead, and I may have promised
>> > him that santa will bring him something nice this year if he does (since
>> > I'm sure it'll be painful on the eyes).
>> 
>> Sorry to drum up this topic once again, but we've recently run into
>> another instance where the close cooperator patch helps significantly.
>> The case is KVM using the virtio disk driver.  The host-side uses
>> posix_aio calls to issue I/O on behalf of the guest.  It's worth noting
>> that pthread_create does not pass CLONE_IO (at least that was my reading
>> of the code).  It is questionable whether it really should as that will
>> change the I/O scheduling dynamics.
>> 
>> So, Jens, what do you think?  Should we collect some performance numbers
>> to make sure that the close cooperator patch doesn't hurt the common
>> case?
>
> No, posix aio is a piece of crap on Linux/glibc so we want to be fixing
> that instead. A quick fix is again to use CLONE_IO, though posix aio
> needs more work than that. I told the qemu guys not to use posix aio a
> long time ago since it does stink and doesn't perform well under any
> circumstance... So I don't consider that a valid use case, there's a
> reason that basically nobody is using posix aio.

It doesn't help that we never took in patches to the kernel that would
allow for a usable posix aio implementation, but I digress.

My question to you is how many use cases do we dismiss as broken before
recognizing that people actually do this, and that we should at least
try to detect and gracefully deal with it?  Is this too much to expect
from the default I/O scheduler?  Sorry to beat a dead horse, but folks
do view this as a regression, and they won't be changing their
applications, they'll be switching I/O schedulers to fix this.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-24 18:50               ` Jeff Moyer
@ 2008-11-24 18:51                 ` Jens Axboe
  0 siblings, 0 replies; 70+ messages in thread
From: Jens Axboe @ 2008-11-24 18:51 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Vitaly V. Bursov, linux-kernel

On Mon, Nov 24 2008, Jeff Moyer wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Mon, Nov 24 2008, Jeff Moyer wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >> 
> >> > nfsd aside (which does seem to have some different behaviour skewing the
> >> > results), the original patch came about because dump(8) has a really
> >> > stupid design that offloads IO to a number of processes. This basically
> >> > makes fairly sequential IO more random with CFQ, since each process gets
> >> > its own io context. My feeling is that we should fix dump instead of
> >> > introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
> >> > aware of any other good programs out there that would do something
> >> > similar, so I don't think there's a lot of merrit to spending cycles on
> >> > detecting cooperating processes.
> >> >
> >> > Jeff will take a look at fixing dump instead, and I may have promised
> >> > him that santa will bring him something nice this year if he does (since
> >> > I'm sure it'll be painful on the eyes).
> >> 
> >> Sorry to drum up this topic once again, but we've recently run into
> >> another instance where the close cooperator patch helps significantly.
> >> The case is KVM using the virtio disk driver.  The host-side uses
> >> posix_aio calls to issue I/O on behalf of the guest.  It's worth noting
> >> that pthread_create does not pass CLONE_IO (at least that was my reading
> >> of the code).  It is questionable whether it really should as that will
> >> change the I/O scheduling dynamics.
> >> 
> >> So, Jens, what do you think?  Should we collect some performance numbers
> >> to make sure that the close cooperator patch doesn't hurt the common
> >> case?
> >
> > No, posix aio is a piece of crap on Linux/glibc so we want to be fixing
> > that instead. A quick fix is again to use CLONE_IO, though posix aio
> > needs more work than that. I told the qemu guys not to use posix aio a
> > long time ago since it does stink and doesn't perform well under any
> > circumstance... So I don't consider that a valid use case, there's a
> > reason that basically nobody is using posix aio.
> 
> It doesn't help that we never took in patches to the kernel that would
> allow for a usable posix aio implementation, but I digress.
> 
> My question to you is how many use cases do we dismiss as broken before
> recognizing that people actually do this, and that we should at least
> try to detect and gracefully deal with it?  Is this too much to expect
> from the default I/O scheduler?  Sorry to beat a dead horse, but folks
> do view this as a regression, and they won't be changing their
> applications, they'll be switching I/O schedulers to fix this.

Yes, I'm aware of that. If posix aio was in wide spread use it would be
an issue, and it's really a shame that it sucks as much as it does. A
single case like dump is worth changing, if there was 1 or 2 other real
cases I'd say we'd have a real case for doing the coop checking.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-13  8:51           ` Wu Fengguang
  2008-11-13  8:54             ` Jens Axboe
  2008-11-13 18:46             ` Vitaly V. Bursov
@ 2008-11-25 10:59             ` Vladislav Bolkhovitin
  2008-11-25 11:30               ` Wu Fengguang
  2 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 10:59 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

Wu Fengguang wrote:
> Hi all,
> 
> //Sorry for being late. 
> 
> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
> [...]
>> I already talked about this with Jeff on irc, but I guess should post it
>> here as well.
>>
>> nfsd aside (which does seem to have some different behaviour skewing the
>> results), the original patch came about because dump(8) has a really
>> stupid design that offloads IO to a number of processes. This basically
>> makes fairly sequential IO more random with CFQ, since each process gets
>> its own io context. My feeling is that we should fix dump instead of
>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>> aware of any other good programs out there that would do something
>> similar, so I don't think there's a lot of merrit to spending cycles on
>> detecting cooperating processes.
>>
>> Jeff will take a look at fixing dump instead, and I may have promised
>> him that santa will bring him something nice this year if he does (since
>> I'm sure it'll be painful on the eyes).
> 
> This could also be fixed at the VFS readahead level.
> 
> In fact I've seen many kinds of interleaved accesses:
> - concurrently reading 40 files that are in fact hard links of one single file
> - a backup tool that splits a big file into 8k chunks, and serve the
>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>   chunks in another one
> - a pool of NFSDs randomly serving some originally sequential read requests 
> - now dump(8) seems to have some similar problem.
> 
> In summary there have been all kinds of efforts on trying to
> parallelize I/O tasks, but unfortunately they can easily screw up the
> sequential pattern. It may not be easily fixable for many of them.
> 
> It is however possible to detect most of these patterns at the
> readahead layer and restore sequential I/Os, before they propagate
> into the block layer and hurt performance.

I believe this would be the most effective way to go, especially in case 
if data delivery path to the original client has its own latency 
depended from the amount of transferred data as it is in the case of 
remote NFS mount, which does synchronous sequential reads. In this case 
it is essential for performance to make both links (local to the storage 
and network to the client) be always busy and transfer data 
simultaneously. Since the reads are synchronous, the only way to achieve 
that is perform read ahead on the server sufficient to cover the network 
link latency. Otherwise you would end up with only half of possible 
throughput.

However, from one side, server has to have a pool of threads/processes 
to perform well, but, from other side, current read ahead code doesn't 
detect too well that those threads/processes are doing joint sequential 
read, so the read ahead window gets smaller, hence the overall read 
performance gets considerably smaller too.

> Vitaly, if that's what you need, I can try to prepare a patch for testing out.

I can test it with SCST SCSI target sybsystem (http://scst.sf.net). SCST 
needs such feature very much, otherwise it can't get full backstorage 
read speed. The maximum I can see is about ~80MB/s from ~130MB/s 15K RPM 
disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).

> Thank you,
> Fengguang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-14  1:36               ` Wu Fengguang
@ 2008-11-25 11:02                 ` Vladislav Bolkhovitin
  2008-11-25 11:25                   ` Wu Fengguang
  2008-11-25 15:21                   ` Jeff Moyer
  0 siblings, 2 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 11:02 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel



Wu Fengguang wrote:
> On Thu, Nov 13, 2008 at 09:54:39AM +0100, Jens Axboe wrote:
>> On Thu, Nov 13 2008, Wu Fengguang wrote:
>>> Hi all,
>>>
>>> //Sorry for being late. 
>>>
>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>> [...]
>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>> here as well.
>>>>
>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>> results), the original patch came about because dump(8) has a really
>>>> stupid design that offloads IO to a number of processes. This basically
>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>> its own io context. My feeling is that we should fix dump instead of
>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>> aware of any other good programs out there that would do something
>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>> detecting cooperating processes.
>>>>
>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>> him that santa will bring him something nice this year if he does (since
>>>> I'm sure it'll be painful on the eyes).
>>> This could also be fixed at the VFS readahead level.
>>>
>>> In fact I've seen many kinds of interleaved accesses:
>>> - concurrently reading 40 files that are in fact hard links of one single file
>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>   chunks in another one
>>> - a pool of NFSDs randomly serving some originally sequential read requests 
>>> - now dump(8) seems to have some similar problem.
>>>
>>> In summary there have been all kinds of efforts on trying to
>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>> sequential pattern. It may not be easily fixable for many of them.
>>>
>>> It is however possible to detect most of these patterns at the
>>> readahead layer and restore sequential I/Os, before they propagate
>>> into the block layer and hurt performance.
>>>
>>> Vitaly, if that's what you need, I can try to prepare a patch for
>>> testing out.
>> It's not easy. To really fix it, you have to get that sequential RA
>> pattern from just the single process. As soon as you spread the IO
>> between processes (eg N-1 aren't just getting cache hits), then you may
>> run into trouble on the IO scheduler side.
> 
> Yes, it's not easy(or possible) to tell from file->f_ra all those
> cooperative processes working on the same sequential stream, since
> they will have different file->f_ra instances. In the case of NFSD,
> the file->f_ra may well be all zeros.
> 
> Another scheme is to detect the sequential pattern via looking up
> the page cache, which provides one single and consistent view of the
> pages recently accessed. That makes sequential detection possible.
> 
> The cost will be one extra page cache lookup per random read.
> If it's not acceptable, the corresponding code could be disabled
> by default. 

I think, this should be the best and the simplest way to go. Since in 
most case data from the cache should be later copied to user, one more 
page cache lookup should be negligible.

> Thanks,
> Fengguang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 11:02                 ` Vladislav Bolkhovitin
@ 2008-11-25 11:25                   ` Wu Fengguang
  2008-11-25 15:21                   ` Jeff Moyer
  1 sibling, 0 replies; 70+ messages in thread
From: Wu Fengguang @ 2008-11-25 11:25 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Tue, Nov 25, 2008 at 02:02:45PM +0300, Vladislav Bolkhovitin wrote:
>
>
> Wu Fengguang wrote:
>> On Thu, Nov 13, 2008 at 09:54:39AM +0100, Jens Axboe wrote:
>>> On Thu, Nov 13 2008, Wu Fengguang wrote:
>>>> Hi all,
>>>>
>>>> //Sorry for being late. 
>>>>
>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>> [...]
>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>> here as well.
>>>>>
>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>> results), the original patch came about because dump(8) has a really
>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>> aware of any other good programs out there that would do something
>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>> detecting cooperating processes.
>>>>>
>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>> him that santa will bring him something nice this year if he does (since
>>>>> I'm sure it'll be painful on the eyes).
>>>> This could also be fixed at the VFS readahead level.
>>>>
>>>> In fact I've seen many kinds of interleaved accesses:
>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>   chunks in another one
>>>> - a pool of NFSDs randomly serving some originally sequential read 
>>>> requests - now dump(8) seems to have some similar problem.
>>>>
>>>> In summary there have been all kinds of efforts on trying to
>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>
>>>> It is however possible to detect most of these patterns at the
>>>> readahead layer and restore sequential I/Os, before they propagate
>>>> into the block layer and hurt performance.
>>>>
>>>> Vitaly, if that's what you need, I can try to prepare a patch for
>>>> testing out.
>>> It's not easy. To really fix it, you have to get that sequential RA
>>> pattern from just the single process. As soon as you spread the IO
>>> between processes (eg N-1 aren't just getting cache hits), then you may
>>> run into trouble on the IO scheduler side.
>>
>> Yes, it's not easy(or possible) to tell from file->f_ra all those
>> cooperative processes working on the same sequential stream, since
>> they will have different file->f_ra instances. In the case of NFSD,
>> the file->f_ra may well be all zeros.
>>
>> Another scheme is to detect the sequential pattern via looking up
>> the page cache, which provides one single and consistent view of the
>> pages recently accessed. That makes sequential detection possible.
>>
>> The cost will be one extra page cache lookup per random read.
>> If it's not acceptable, the corresponding code could be disabled
>> by default. 
>
> I think, this should be the best and the simplest way to go. Since in  
> most case data from the cache should be later copied to user, one more  
> page cache lookup should be negligible.

After the initial proposal, two merits come to my mind to implement
cooperative sequential I/O detection in the readahead code:

1) readahead can make larger(hence more efficient) I/O requests
2) the page cache lookup trick eliminates the overheads of extra rbtrees.

So I would definitely like to try it out :-)

Thank you,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 10:59             ` Vladislav Bolkhovitin
@ 2008-11-25 11:30               ` Wu Fengguang
  2008-11-25 11:41                 ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2008-11-25 11:30 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang wrote:
>> Hi all,
>>
>> //Sorry for being late. 
>>
>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>> [...]
>>> I already talked about this with Jeff on irc, but I guess should post it
>>> here as well.
>>>
>>> nfsd aside (which does seem to have some different behaviour skewing the
>>> results), the original patch came about because dump(8) has a really
>>> stupid design that offloads IO to a number of processes. This basically
>>> makes fairly sequential IO more random with CFQ, since each process gets
>>> its own io context. My feeling is that we should fix dump instead of
>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>> aware of any other good programs out there that would do something
>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>> detecting cooperating processes.
>>>
>>> Jeff will take a look at fixing dump instead, and I may have promised
>>> him that santa will bring him something nice this year if he does (since
>>> I'm sure it'll be painful on the eyes).
>>
>> This could also be fixed at the VFS readahead level.
>>
>> In fact I've seen many kinds of interleaved accesses:
>> - concurrently reading 40 files that are in fact hard links of one single file
>> - a backup tool that splits a big file into 8k chunks, and serve the
>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>   chunks in another one
>> - a pool of NFSDs randomly serving some originally sequential read 
>> requests - now dump(8) seems to have some similar problem.
>>
>> In summary there have been all kinds of efforts on trying to
>> parallelize I/O tasks, but unfortunately they can easily screw up the
>> sequential pattern. It may not be easily fixable for many of them.
>>
>> It is however possible to detect most of these patterns at the
>> readahead layer and restore sequential I/Os, before they propagate
>> into the block layer and hurt performance.
>
> I believe this would be the most effective way to go, especially in case  
> if data delivery path to the original client has its own latency  
> depended from the amount of transferred data as it is in the case of  
> remote NFS mount, which does synchronous sequential reads. In this case  
> it is essential for performance to make both links (local to the storage  
> and network to the client) be always busy and transfer data  
> simultaneously. Since the reads are synchronous, the only way to achieve  
> that is perform read ahead on the server sufficient to cover the network  
> link latency. Otherwise you would end up with only half of possible  
> throughput.
>
> However, from one side, server has to have a pool of threads/processes  
> to perform well, but, from other side, current read ahead code doesn't  
> detect too well that those threads/processes are doing joint sequential  
> read, so the read ahead window gets smaller, hence the overall read  
> performance gets considerably smaller too.
>
>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>
> I can test it with SCST SCSI target sybsystem (http://scst.sf.net). SCST  
> needs such feature very much, otherwise it can't get full backstorage  
> read speed. The maximum I can see is about ~80MB/s from ~130MB/s 15K RPM  
> disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).

Thank you very much!

BTW, do you implicate that the SCSI system (or its applications) has
similar behaviors that the current readahead code cannot handle well?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 11:30               ` Wu Fengguang
@ 2008-11-25 11:41                 ` Vladislav Bolkhovitin
  2008-11-25 11:49                   ` Wu Fengguang
  0 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 11:41 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

Wu Fengguang wrote:
> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang wrote:
>>> Hi all,
>>>
>>> //Sorry for being late. 
>>>
>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>> [...]
>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>> here as well.
>>>>
>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>> results), the original patch came about because dump(8) has a really
>>>> stupid design that offloads IO to a number of processes. This basically
>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>> its own io context. My feeling is that we should fix dump instead of
>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>> aware of any other good programs out there that would do something
>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>> detecting cooperating processes.
>>>>
>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>> him that santa will bring him something nice this year if he does (since
>>>> I'm sure it'll be painful on the eyes).
>>> This could also be fixed at the VFS readahead level.
>>>
>>> In fact I've seen many kinds of interleaved accesses:
>>> - concurrently reading 40 files that are in fact hard links of one single file
>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>   chunks in another one
>>> - a pool of NFSDs randomly serving some originally sequential read 
>>> requests - now dump(8) seems to have some similar problem.
>>>
>>> In summary there have been all kinds of efforts on trying to
>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>> sequential pattern. It may not be easily fixable for many of them.
>>>
>>> It is however possible to detect most of these patterns at the
>>> readahead layer and restore sequential I/Os, before they propagate
>>> into the block layer and hurt performance.
>> I believe this would be the most effective way to go, especially in case  
>> if data delivery path to the original client has its own latency  
>> depended from the amount of transferred data as it is in the case of  
>> remote NFS mount, which does synchronous sequential reads. In this case  
>> it is essential for performance to make both links (local to the storage  
>> and network to the client) be always busy and transfer data  
>> simultaneously. Since the reads are synchronous, the only way to achieve  
>> that is perform read ahead on the server sufficient to cover the network  
>> link latency. Otherwise you would end up with only half of possible  
>> throughput.
>>
>> However, from one side, server has to have a pool of threads/processes  
>> to perform well, but, from other side, current read ahead code doesn't  
>> detect too well that those threads/processes are doing joint sequential  
>> read, so the read ahead window gets smaller, hence the overall read  
>> performance gets considerably smaller too.
>>
>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>> I can test it with SCST SCSI target sybsystem (http://scst.sf.net). SCST  
>> needs such feature very much, otherwise it can't get full backstorage  
>> read speed. The maximum I can see is about ~80MB/s from ~130MB/s 15K RPM  
>> disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).
> 
> Thank you very much!
> 
> BTW, do you implicate that the SCSI system (or its applications) has
> similar behaviors that the current readahead code cannot handle well?

No. SCSI target subsystem is not the same as SCSI initiator subsystem, 
which usually called simply SCSI (sub)system. SCSI target is a SCSI 
server. It has the same amount of common with SCSI initiator as there 
is, e.g., between Apache (HTTP server) and Firefox (HTTP client).

> Thanks,
> Fengguang
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 11:41                 ` Vladislav Bolkhovitin
@ 2008-11-25 11:49                   ` Wu Fengguang
  2008-11-25 12:03                     ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2008-11-25 11:49 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang wrote:
>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang wrote:
>>>> Hi all,
>>>>
>>>> //Sorry for being late. 
>>>>
>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>> [...]
>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>> here as well.
>>>>>
>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>> results), the original patch came about because dump(8) has a really
>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>> aware of any other good programs out there that would do something
>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>> detecting cooperating processes.
>>>>>
>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>> him that santa will bring him something nice this year if he does (since
>>>>> I'm sure it'll be painful on the eyes).
>>>> This could also be fixed at the VFS readahead level.
>>>>
>>>> In fact I've seen many kinds of interleaved accesses:
>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>   chunks in another one
>>>> - a pool of NFSDs randomly serving some originally sequential read  
>>>> requests - now dump(8) seems to have some similar problem.
>>>>
>>>> In summary there have been all kinds of efforts on trying to
>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>
>>>> It is however possible to detect most of these patterns at the
>>>> readahead layer and restore sequential I/Os, before they propagate
>>>> into the block layer and hurt performance.
>>> I believe this would be the most effective way to go, especially in 
>>> case  if data delivery path to the original client has its own 
>>> latency  depended from the amount of transferred data as it is in the 
>>> case of  remote NFS mount, which does synchronous sequential reads. 
>>> In this case  it is essential for performance to make both links 
>>> (local to the storage  and network to the client) be always busy and 
>>> transfer data  simultaneously. Since the reads are synchronous, the 
>>> only way to achieve  that is perform read ahead on the server 
>>> sufficient to cover the network  link latency. Otherwise you would 
>>> end up with only half of possible  throughput.
>>>
>>> However, from one side, server has to have a pool of 
>>> threads/processes  to perform well, but, from other side, current 
>>> read ahead code doesn't  detect too well that those threads/processes 
>>> are doing joint sequential  read, so the read ahead window gets 
>>> smaller, hence the overall read  performance gets considerably 
>>> smaller too.
>>>
>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>> I can test it with SCST SCSI target sybsystem (http://scst.sf.net). 
>>> SCST  needs such feature very much, otherwise it can't get full 
>>> backstorage  read speed. The maximum I can see is about ~80MB/s from 
>>> ~130MB/s 15K RPM  disk over 1Gbps iSCSI link (maximum possible is 
>>> ~110MB/s).
>>
>> Thank you very much!
>>
>> BTW, do you implicate that the SCSI system (or its applications) has
>> similar behaviors that the current readahead code cannot handle well?
>
> No. SCSI target subsystem is not the same as SCSI initiator subsystem,  
> which usually called simply SCSI (sub)system. SCSI target is a SCSI  
> server. It has the same amount of common with SCSI initiator as there  
> is, e.g., between Apache (HTTP server) and Firefox (HTTP client).

Got it. So the SCSI server will split&spread sequential IO of one
single file to cooperative threads? I'm trying to understand why the
proposed page cache context based readahead would help a SCSI server.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 11:49                   ` Wu Fengguang
@ 2008-11-25 12:03                     ` Vladislav Bolkhovitin
  2008-11-25 12:09                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 12:03 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

Wu Fengguang wrote:
> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang wrote:
>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> Hi all,
>>>>>
>>>>> //Sorry for being late. 
>>>>>
>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>> [...]
>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>> here as well.
>>>>>>
>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>> results), the original patch came about because dump(8) has a really
>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>> aware of any other good programs out there that would do something
>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>> detecting cooperating processes.
>>>>>>
>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>> I'm sure it'll be painful on the eyes).
>>>>> This could also be fixed at the VFS readahead level.
>>>>>
>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>   chunks in another one
>>>>> - a pool of NFSDs randomly serving some originally sequential read  
>>>>> requests - now dump(8) seems to have some similar problem.
>>>>>
>>>>> In summary there have been all kinds of efforts on trying to
>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>
>>>>> It is however possible to detect most of these patterns at the
>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>> into the block layer and hurt performance.
>>>> I believe this would be the most effective way to go, especially in 
>>>> case  if data delivery path to the original client has its own 
>>>> latency  depended from the amount of transferred data as it is in the 
>>>> case of  remote NFS mount, which does synchronous sequential reads. 
>>>> In this case  it is essential for performance to make both links 
>>>> (local to the storage  and network to the client) be always busy and 
>>>> transfer data  simultaneously. Since the reads are synchronous, the 
>>>> only way to achieve  that is perform read ahead on the server 
>>>> sufficient to cover the network  link latency. Otherwise you would 
>>>> end up with only half of possible  throughput.
>>>>
>>>> However, from one side, server has to have a pool of 
>>>> threads/processes  to perform well, but, from other side, current 
>>>> read ahead code doesn't  detect too well that those threads/processes 
>>>> are doing joint sequential  read, so the read ahead window gets 
>>>> smaller, hence the overall read  performance gets considerably 
>>>> smaller too.
>>>>
>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>> I can test it with SCST SCSI target sybsystem (http://scst.sf.net). 
>>>> SCST  needs such feature very much, otherwise it can't get full 
>>>> backstorage  read speed. The maximum I can see is about ~80MB/s from 
>>>> ~130MB/s 15K RPM  disk over 1Gbps iSCSI link (maximum possible is 
>>>> ~110MB/s).
>>> Thank you very much!
>>>
>>> BTW, do you implicate that the SCSI system (or its applications) has
>>> similar behaviors that the current readahead code cannot handle well?
>> No. SCSI target subsystem is not the same as SCSI initiator subsystem,  
>> which usually called simply SCSI (sub)system. SCSI target is a SCSI  
>> server. It has the same amount of common with SCSI initiator as there  
>> is, e.g., between Apache (HTTP server) and Firefox (HTTP client).
> 
> Got it. So the SCSI server will split&spread sequential IO of one
> single file to cooperative threads?

Yes. It has to do so, because Linux doesn't have async. cached IO and a 
client can queue several tens of commands at time. Then, on the 
sequential IO with 1 command at time, CPU scheduler comes to play and 
spreads those commands over those threads, so read ahead gets too small 
to cover the external link latency and fill both links with data, so 
that uncovered latency kills throughput.

> I'm trying to understand why the
> proposed page cache context based readahead would help a SCSI server.
> 
> Thanks,
> Fengguang
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 12:03                     ` Vladislav Bolkhovitin
@ 2008-11-25 12:09                       ` Vladislav Bolkhovitin
  2008-11-25 12:15                         ` Wu Fengguang
  0 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 12:09 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

Vladislav Bolkhovitin wrote:
> Wu Fengguang wrote:
>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang wrote:
>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Wu Fengguang wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> //Sorry for being late. 
>>>>>>
>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>> [...]
>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>> here as well.
>>>>>>>
>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>> aware of any other good programs out there that would do something
>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>> detecting cooperating processes.
>>>>>>>
>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>
>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>   chunks in another one
>>>>>> - a pool of NFSDs randomly serving some originally sequential read  
>>>>>> requests - now dump(8) seems to have some similar problem.
>>>>>>
>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>
>>>>>> It is however possible to detect most of these patterns at the
>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>> into the block layer and hurt performance.
>>>>> I believe this would be the most effective way to go, especially in 
>>>>> case  if data delivery path to the original client has its own 
>>>>> latency  depended from the amount of transferred data as it is in the 
>>>>> case of  remote NFS mount, which does synchronous sequential reads. 
>>>>> In this case  it is essential for performance to make both links 
>>>>> (local to the storage  and network to the client) be always busy and 
>>>>> transfer data  simultaneously. Since the reads are synchronous, the 
>>>>> only way to achieve  that is perform read ahead on the server 
>>>>> sufficient to cover the network  link latency. Otherwise you would 
>>>>> end up with only half of possible  throughput.
>>>>>
>>>>> However, from one side, server has to have a pool of 
>>>>> threads/processes  to perform well, but, from other side, current 
>>>>> read ahead code doesn't  detect too well that those threads/processes 
>>>>> are doing joint sequential  read, so the read ahead window gets 
>>>>> smaller, hence the overall read  performance gets considerably 
>>>>> smaller too.
>>>>>
>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>> I can test it with SCST SCSI target sybsystem (http://scst.sf.net). 
>>>>> SCST  needs such feature very much, otherwise it can't get full 
>>>>> backstorage  read speed. The maximum I can see is about ~80MB/s from 
>>>>> ~130MB/s 15K RPM  disk over 1Gbps iSCSI link (maximum possible is 
>>>>> ~110MB/s).
>>>> Thank you very much!
>>>>
>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>> similar behaviors that the current readahead code cannot handle well?
>>> No. SCSI target subsystem is not the same as SCSI initiator subsystem,  
>>> which usually called simply SCSI (sub)system. SCSI target is a SCSI  
>>> server. It has the same amount of common with SCSI initiator as there  
>>> is, e.g., between Apache (HTTP server) and Firefox (HTTP client).
>> Got it. So the SCSI server will split&spread sequential IO of one
>> single file to cooperative threads?
> 
> Yes. It has to do so, because Linux doesn't have async. cached IO and a 
> client can queue several tens of commands at time. Then, on the 
> sequential IO with 1 command at time, CPU scheduler comes to play and 
> spreads those commands over those threads, so read ahead gets too small 
> to cover the external link latency and fill both links with data, so 
> that uncovered latency kills throughput.

Additionally, if the uncovered external link latency is too large, one 
more factor is getting noticeable: storage rotation latency. If the next 
unread sector is missed to be read at time, server has to wait a full 
rotation to start receiving data for the next block, which even more 
decreases the resulting throughput.

>> I'm trying to understand why the
>> proposed page cache context based readahead would help a SCSI server.
>>
>> Thanks,
>> Fengguang
>>
> 
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 12:09                       ` Vladislav Bolkhovitin
@ 2008-11-25 12:15                         ` Wu Fengguang
  2008-11-27 17:46                           ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2008-11-25 12:15 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel

On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
> Vladislav Bolkhovitin wrote:
>> Wu Fengguang wrote:
>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>> Wu Fengguang wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> //Sorry for being late. 
>>>>>>>
>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>> [...]
>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>> here as well.
>>>>>>>>
>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>> detecting cooperating processes.
>>>>>>>>
>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>
>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>>   chunks in another one
>>>>>>> - a pool of NFSDs randomly serving some originally sequential 
>>>>>>> read  requests - now dump(8) seems to have some similar 
>>>>>>> problem.
>>>>>>>
>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>
>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>> into the block layer and hurt performance.
>>>>>> I believe this would be the most effective way to go, 
>>>>>> especially in case  if data delivery path to the original 
>>>>>> client has its own latency  depended from the amount of 
>>>>>> transferred data as it is in the case of  remote NFS mount, 
>>>>>> which does synchronous sequential reads. In this case  it is 
>>>>>> essential for performance to make both links (local to the 
>>>>>> storage  and network to the client) be always busy and  
>>>>>> transfer data  simultaneously. Since the reads are synchronous, 
>>>>>> the only way to achieve  that is perform read ahead on the 
>>>>>> server sufficient to cover the network  link latency. Otherwise 
>>>>>> you would end up with only half of possible  throughput.
>>>>>>
>>>>>> However, from one side, server has to have a pool of  
>>>>>> threads/processes  to perform well, but, from other side, 
>>>>>> current read ahead code doesn't  detect too well that those 
>>>>>> threads/processes are doing joint sequential  read, so the read 
>>>>>> ahead window gets smaller, hence the overall read  performance 
>>>>>> gets considerably smaller too.
>>>>>>
>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>> I can test it with SCST SCSI target sybsystem 
>>>>>> (http://scst.sf.net). SCST  needs such feature very much, 
>>>>>> otherwise it can't get full backstorage  read speed. The 
>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM  disk 
>>>>>> over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>> Thank you very much!
>>>>>
>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>> similar behaviors that the current readahead code cannot handle well?
>>>> No. SCSI target subsystem is not the same as SCSI initiator 
>>>> subsystem,  which usually called simply SCSI (sub)system. SCSI 
>>>> target is a SCSI  server. It has the same amount of common with 
>>>> SCSI initiator as there  is, e.g., between Apache (HTTP server) and 
>>>> Firefox (HTTP client).
>>> Got it. So the SCSI server will split&spread sequential IO of one
>>> single file to cooperative threads?
>>
>> Yes. It has to do so, because Linux doesn't have async. cached IO and a 
>> client can queue several tens of commands at time. Then, on the  
>> sequential IO with 1 command at time, CPU scheduler comes to play and  
>> spreads those commands over those threads, so read ahead gets too small 
>> to cover the external link latency and fill both links with data, so  
>> that uncovered latency kills throughput.
>
> Additionally, if the uncovered external link latency is too large, one  
> more factor is getting noticeable: storage rotation latency. If the next  
> unread sector is missed to be read at time, server has to wait a full  
> rotation to start receiving data for the next block, which even more  
> decreases the resulting throughput.

Thank you for the details. I've been working slowly on the idea, and
should be able to send you a patch in the next one or two days.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 11:02                 ` Vladislav Bolkhovitin
  2008-11-25 11:25                   ` Wu Fengguang
@ 2008-11-25 15:21                   ` Jeff Moyer
  2008-11-25 16:17                     ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 70+ messages in thread
From: Jeff Moyer @ 2008-11-25 15:21 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Wu Fengguang, Jens Axboe, Vitaly V. Bursov, linux-kernel

Vladislav Bolkhovitin <vst@vlnb.net> writes:

> Wu Fengguang wrote:
>> Another scheme is to detect the sequential pattern via looking up
>> the page cache, which provides one single and consistent view of the
>> pages recently accessed. That makes sequential detection possible.
>>
>> The cost will be one extra page cache lookup per random read.
>> If it's not acceptable, the corresponding code could be disabled
>> by default. 
>
> I think, this should be the best and the simplest way to go. Since in
> most case data from the cache should be later copied to user, one more
> page cache lookup should be negligible.

I haven't thought about your suggestion in any detail, but it seems to
me that you will not handle O_DIRECT I/O with this scheme.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 15:21                   ` Jeff Moyer
@ 2008-11-25 16:17                     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-25 16:17 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vladislav Bolkhovitin, Wu Fengguang, Jens Axboe,
	Vitaly V. Bursov, linux-kernel

Jeff Moyer wrote:
> Vladislav Bolkhovitin <vst@vlnb.net> writes:
> 
>> Wu Fengguang wrote:
>>> Another scheme is to detect the sequential pattern via looking up
>>> the page cache, which provides one single and consistent view of the
>>> pages recently accessed. That makes sequential detection possible.
>>>
>>> The cost will be one extra page cache lookup per random read.
>>> If it's not acceptable, the corresponding code could be disabled
>>> by default. 
>> I think, this should be the best and the simplest way to go. Since in
>> most case data from the cache should be later copied to user, one more
>> page cache lookup should be negligible.
> 
> I haven't thought about your suggestion in any detail, but it seems to
> me that you will not handle O_DIRECT I/O with this scheme.

And shouldn't. One using O_DIRECT is supposed to understand and accept 
that no features provided by page cache, including read ahead, will be 
available to him. Hence, if necessary, he must implement own read ahead 
procedures.

> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-25 12:15                         ` Wu Fengguang
@ 2008-11-27 17:46                           ` Vladislav Bolkhovitin
  2008-11-28  0:48                             ` Wu Fengguang
  0 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2008-11-27 17:46 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel


Wu Fengguang wrote:
> On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
>> Vladislav Bolkhovitin wrote:
>>> Wu Fengguang wrote:
>>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Wu Fengguang wrote:
>>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>>> Wu Fengguang wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> //Sorry for being late. 
>>>>>>>>
>>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>>> [...]
>>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>>> here as well.
>>>>>>>>>
>>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>>> detecting cooperating processes.
>>>>>>>>>
>>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>>
>>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>>>   chunks in another one
>>>>>>>> - a pool of NFSDs randomly serving some originally sequential 
>>>>>>>> read  requests - now dump(8) seems to have some similar 
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>>
>>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>>> into the block layer and hurt performance.
>>>>>>> I believe this would be the most effective way to go, 
>>>>>>> especially in case  if data delivery path to the original 
>>>>>>> client has its own latency  depended from the amount of 
>>>>>>> transferred data as it is in the case of  remote NFS mount, 
>>>>>>> which does synchronous sequential reads. In this case  it is 
>>>>>>> essential for performance to make both links (local to the 
>>>>>>> storage  and network to the client) be always busy and  
>>>>>>> transfer data  simultaneously. Since the reads are synchronous, 
>>>>>>> the only way to achieve  that is perform read ahead on the 
>>>>>>> server sufficient to cover the network  link latency. Otherwise 
>>>>>>> you would end up with only half of possible  throughput.
>>>>>>>
>>>>>>> However, from one side, server has to have a pool of  
>>>>>>> threads/processes  to perform well, but, from other side, 
>>>>>>> current read ahead code doesn't  detect too well that those 
>>>>>>> threads/processes are doing joint sequential  read, so the read 
>>>>>>> ahead window gets smaller, hence the overall read  performance 
>>>>>>> gets considerably smaller too.
>>>>>>>
>>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>>> I can test it with SCST SCSI target sybsystem 
>>>>>>> (http://scst.sf.net). SCST  needs such feature very much, 
>>>>>>> otherwise it can't get full backstorage  read speed. The 
>>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM  disk 
>>>>>>> over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>>> Thank you very much!
>>>>>>
>>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>>> similar behaviors that the current readahead code cannot handle well?
>>>>> No. SCSI target subsystem is not the same as SCSI initiator 
>>>>> subsystem,  which usually called simply SCSI (sub)system. SCSI 
>>>>> target is a SCSI  server. It has the same amount of common with 
>>>>> SCSI initiator as there  is, e.g., between Apache (HTTP server) and 
>>>>> Firefox (HTTP client).
>>>> Got it. So the SCSI server will split&spread sequential IO of one
>>>> single file to cooperative threads?
>>> Yes. It has to do so, because Linux doesn't have async. cached IO and a 
>>> client can queue several tens of commands at time. Then, on the  
>>> sequential IO with 1 command at time, CPU scheduler comes to play and  
>>> spreads those commands over those threads, so read ahead gets too small 
>>> to cover the external link latency and fill both links with data, so  
>>> that uncovered latency kills throughput.
>> Additionally, if the uncovered external link latency is too large, one  
>> more factor is getting noticeable: storage rotation latency. If the next  
>> unread sector is missed to be read at time, server has to wait a full  
>> rotation to start receiving data for the next block, which even more  
>> decreases the resulting throughput.
> 
> Thank you for the details. I've been working slowly on the idea, and
> should be able to send you a patch in the next one or two days.

Actually, there's one more thing, which should have been mentioned. It 
is possible that remote clients have several sequential read streams at 
time together with some "noise" of random requests. A good read-ahead 
subsystem should handle such case by keeping big read-ahead windows for 
the sequential streams and don't do any read ahead for the random 
requests. And all at the same time.

Currently on such workloads read ahead will be completely disabled for 
all the requests. Hence, there is a possibility here to improve 
performance in 3-5 times or even more by making the workload more linear.

I guess, such functionality can be done by keeping several read-ahead 
contexts not in file handle as now, but in inode, or ever in task_struct 
similarly to io_context. Or even as a part of struct io_context. Then 
those contexts would be rotated in, e.g., round robin manner. I have 
some basic thoughts in this area and would be glad to share them, if you 
are interested.

Going further, ultimately, it would be great to provide somehow a 
capability to allow to assign for each read-ahead stream own IO context, 
because on the remote side in most cases such streams would correspond 
to different programs reading data in parallel. This should allow to 
increase performance even more.

> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-27 17:46                           ` Vladislav Bolkhovitin
@ 2008-11-28  0:48                             ` Wu Fengguang
  2009-02-12 18:35                               ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2008-11-28  0:48 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 12051 bytes --]

On Thu, Nov 27, 2008 at 08:46:35PM +0300, Vladislav Bolkhovitin wrote:
>
> Wu Fengguang wrote:
>> On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
>>> Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>>>> Wu Fengguang wrote:
>>>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>>>> Wu Fengguang wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> //Sorry for being late. 
>>>>>>>>>
>>>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>>>> [...]
>>>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>>>> here as well.
>>>>>>>>>>
>>>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>>>> detecting cooperating processes.
>>>>>>>>>>
>>>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>>>
>>>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>>>>   {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>>>>   chunks in another one
>>>>>>>>> - a pool of NFSDs randomly serving some originally 
>>>>>>>>> sequential read  requests - now dump(8) seems to have 
>>>>>>>>> some similar problem.
>>>>>>>>>
>>>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>>>
>>>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>>>> into the block layer and hurt performance.
>>>>>>>> I believe this would be the most effective way to go,  
>>>>>>>> especially in case  if data delivery path to the original  
>>>>>>>> client has its own latency  depended from the amount of  
>>>>>>>> transferred data as it is in the case of  remote NFS mount, 
>>>>>>>> which does synchronous sequential reads. In this case  it 
>>>>>>>> is essential for performance to make both links (local to 
>>>>>>>> the storage  and network to the client) be always busy and  
>>>>>>>> transfer data  simultaneously. Since the reads are 
>>>>>>>> synchronous, the only way to achieve  that is perform read 
>>>>>>>> ahead on the server sufficient to cover the network  link 
>>>>>>>> latency. Otherwise you would end up with only half of 
>>>>>>>> possible  throughput.
>>>>>>>>
>>>>>>>> However, from one side, server has to have a pool of   
>>>>>>>> threads/processes  to perform well, but, from other side,  
>>>>>>>> current read ahead code doesn't  detect too well that those 
>>>>>>>> threads/processes are doing joint sequential  read, so the 
>>>>>>>> read ahead window gets smaller, hence the overall read  
>>>>>>>> performance gets considerably smaller too.
>>>>>>>>
>>>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>>>> I can test it with SCST SCSI target sybsystem  
>>>>>>>> (http://scst.sf.net). SCST  needs such feature very much,  
>>>>>>>> otherwise it can't get full backstorage  read speed. The  
>>>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM  
>>>>>>>> disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>>>> similar behaviors that the current readahead code cannot handle well?
>>>>>> No. SCSI target subsystem is not the same as SCSI initiator  
>>>>>> subsystem,  which usually called simply SCSI (sub)system. SCSI  
>>>>>> target is a SCSI  server. It has the same amount of common with 
>>>>>> SCSI initiator as there  is, e.g., between Apache (HTTP server) 
>>>>>> and Firefox (HTTP client).
>>>>> Got it. So the SCSI server will split&spread sequential IO of one
>>>>> single file to cooperative threads?
>>>> Yes. It has to do so, because Linux doesn't have async. cached IO 
>>>> and a client can queue several tens of commands at time. Then, on 
>>>> the  sequential IO with 1 command at time, CPU scheduler comes to 
>>>> play and  spreads those commands over those threads, so read ahead 
>>>> gets too small to cover the external link latency and fill both 
>>>> links with data, so  that uncovered latency kills throughput.
>>> Additionally, if the uncovered external link latency is too large, 
>>> one  more factor is getting noticeable: storage rotation latency. If 
>>> the next  unread sector is missed to be read at time, server has to 
>>> wait a full  rotation to start receiving data for the next block, 
>>> which even more  decreases the resulting throughput.
>
>> Thank you for the details. I've been working slowly on the idea, and
>> should be able to send you a patch in the next one or two days.
>
> Actually, there's one more thing, which should have been mentioned. It
> is possible that remote clients have several sequential read streams at
> time together with some "noise" of random requests. A good read-ahead
> subsystem should handle such case by keeping big read-ahead windows for
> the sequential streams and don't do any read ahead for the random
> requests. And all at the same time.
>
> Currently on such workloads read ahead will be completely disabled for
> all the requests. Hence, there is a possibility here to improve
> performance in 3-5 times or even more by making the workload more linear.

Are you sure? I'd expect such mixed-sequential-random pattern to be
handled by the current readahead code pretty well: sequential ones
will get large readahead and random ones won't get readahead at all.

Attached is the context readahead patch plus a kernel module for
readahead tracing and accounting, which will hopefully help clarify the
read patterns and readahead behaviors on production workloads. It is
based on 2.6.27 for your convenience, but also applies to 2.6.28.

The patch is not targeted for code review, but if anyone are interested,
you can take a look at try_context_readahead(). This is the only newly
introduced readahead policy, the other majorities are code refactor
and tracing facilities.

The newly introduced context readahead policy is disabled by default.
To enable it:
        echo 1 > /sys/block/sda/queue/context_readahead
I'm not sure for now whether this parameter will be a long term one, or
whether the context readahead policy should be enabled unconditionally.

The readahead accounting stats can be viewed by
        mount -t debugfs none /sys/kernel/debug
        cat /sys/kernel/debug/readahead/stats
The numbers can be reset by
        echo > /sys/kernel/debug/readahead/stats

Here is a sample output from my desktop:

% cat /sys/kernel/debug/readahead/stats
pattern         count sync_count  eof_count       size async_size     actual
none                0          0          0          0          0          0
initial0         3009       3009       2033          5          4          2
initial            35         35          0          5          4          3
subsequent       1294        240        827         52         49         26
marker            220          0        109         54         53         29
trail               0          0          0          0          0          0
oversize            0          0          0          0          0          0
reverse             0          0          0          0          0          0
stride              0          0          0          0          0          0
thrash              0          0          0          0          0          0
mmap             2833       2833       1379        142          0         47
fadvise             7          7          7          0          0         40
random           7621       7621         69          1          0          1
all             15019      13745       4424         33          5         12

The readahead/read tracing messages are disabled by default.
To enable them:
        echo 1 > /sys/kernel/debug/readahead/trace_enable
        echo 1 > /sys/kernel/debug/readahead/read_jprobes
They(especially the latter one) will generate a lot of printk messages like:

[  828.151013] readahead-initial0(pid=4644(zsh), dev=00:10(0:10), ino=351452(whoami), req=0+1, ra=0+4-3, async=0) = 4
[  828.167853] readahead-mmap(pid=4644(whoami), dev=00:10(0:10), ino=351452(whoami), req=0+0, ra=0+60-0, async=0) = 3
[  828.195652] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=115569(zsh_prompt), req=0+128, ra=0+120-60, async=0) = 3
[  828.225081] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=342086(.zsh_history), req=0+128, ra=0+120-60, async=0) = 4

[  964.471450] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=0, count=128)
[  964.471544] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=64, count=448)
[  964.471575] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=512, count=28)
[  964.472659] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=0, count=128)
[  964.473431] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=64, count=336)
[  964.475639] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383010(libc-2.7.so), pos=0, count=832)
[  964.479037] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=0, count=524288)
[  964.479166] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=2586, count=524288)

So please enable them only when necessary.

My recommendation for the double readahead in NFS client and NFS servers,
is to keep client side readahead size small and the server side one large.
For example, 128K-512K/1M-2M(more for RAID). The NFS client side readahead size
is not directly tunable, but setting rsize to a small value does the trick.
Currently the NFS magic is readahead_size=N*rsize. The default numbers in my
2.6.28 kernel are rsize=512k, N=15, readahead_size=7680k. The latter is
obviously way too large.

Thanks,
Fengguang

> I guess, such functionality can be done by keeping several read-ahead
> contexts not in file handle as now, but in inode, or ever in task_struct
> similarly to io_context. Or even as a part of struct io_context. Then
> those contexts would be rotated in, e.g., round robin manner. I have
> some basic thoughts in this area and would be glad to share them, if you
> are interested.
>
> Going further, ultimately, it would be great to provide somehow a
> capability to allow to assign for each read-ahead stream own IO context,
> because on the remote side in most cases such streams would correspond
> to different programs reading data in parallel. This should allow to
> increase performance even more.
>
>> Thanks,
>> Fengguang
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>

[-- Attachment #2: readahead-context-plus-tracing-module-2.6.27.patch --]
[-- Type: text/x-diff, Size: 25609 bytes --]

---
 block/blk-sysfs.c           |   25 +
 include/linux/backing-dev.h |    1 
 include/linux/fs.h          |   44 +++
 include/linux/radix-tree.h  |    2 
 lib/radix-tree.c            |   37 ++
 mm/Kconfig                  |   18 +
 mm/Makefile                 |    1 
 mm/filemap.c                |    2 
 mm/readahead.c              |  166 +++++++++--
 mm/readahead_trace.c        |  486 ++++++++++++++++++++++++++++++++++
 10 files changed, 754 insertions(+), 28 deletions(-)

--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -16,6 +16,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <linux/marker.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -204,6 +205,10 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	trace_mark(readahead_fadvise, "%p %p %lu %lu %d",
+			mapping, filp, offset, nr_to_read, ret);
+
 	return ret;
 }
 
@@ -217,10 +222,18 @@ int force_page_cache_readahead(struct ad
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read)
 {
+	int actual;
+
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+	actual = __do_page_cache_readahead(mapping, filp, offset,
+								nr_to_read, 0);
+
+	trace_mark(readahead_mmap, "%p %p %lu %lu %d",
+			mapping, filp, offset, nr_to_read, actual);
+
+	return actual;
 }
 
 /*
@@ -248,14 +261,21 @@ subsys_initcall(readahead_init);
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
-static unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+static unsigned long ra_submit(struct address_space *mapping,
+			       struct file *filp,
+			       pgoff_t offset,
+			       unsigned long req_size,
+			       struct file_ra_state *ra,
+			       int async)
 {
 	int actual;
 
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	trace_mark(readahead_generic, "%p %p %lu %lu %p %d %d",
+			mapping, filp, offset, req_size, ra, actual, async);
+
 	return actual;
 }
 
@@ -337,6 +357,60 @@ static unsigned long get_next_ra_size(st
  */
 
 /*
+ * Count continuously cached pages from @offset-1 to @offset-@max,
+ * this count is a conservative estimation of
+ * 	- length of the sequential read sequence, or
+ * 	- thrashing threshold in memory tight systems
+ */
+static unsigned long count_history_pages(struct address_space *mapping,
+					 struct file_ra_state *ra,
+					 pgoff_t offset, unsigned long max)
+{
+	pgoff_t head;
+
+	rcu_read_lock();
+	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	rcu_read_unlock();
+
+	return offset - 1 - head;
+}
+
+/*
+ * page cache context based read-ahead
+ */
+static int try_context_readahead(struct address_space *mapping,
+				 struct file_ra_state *ra,
+				 pgoff_t offset,
+				 unsigned int req_size,
+				 unsigned int max)
+{
+	unsigned long size;
+
+	size = count_history_pages(mapping, ra, offset, max);
+
+	/*
+	 * no history pages:
+	 * it could be a random read
+	 */
+	if (!size)
+		return 0;
+
+	/*
+	 * starts from beginning of file:
+	 * it is a strong indication of long-run stream (or whole-file-read)
+	 */
+	if (size >= offset)
+		size *= 2;
+
+	ra_set_pattern(ra, RA_PATTERN_HAS_TRAIL);
+	ra->start = offset;
+	ra->size = get_init_ra_size(size + req_size, max);
+	ra->async_size = ra->size;
+
+	return 1;
+}
+
+/*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
 static unsigned long
@@ -345,34 +419,30 @@ ondemand_readahead(struct address_space 
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	int	max = ra->ra_pages;	/* max readahead pages */
-	pgoff_t prev_offset;
-	int	sequential;
+	unsigned long	max = ra->ra_pages;	/* max readahead pages */
+	pgoff_t prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
+
+	/*
+	 * start of file
+	 */
+	if (!offset) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL0);
+		goto initial_readahead;
+	}
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
 	 * Ramp up sizes, and push forward the readahead window.
 	 */
-	if (offset && (offset == (ra->start + ra->size - ra->async_size) ||
-			offset == (ra->start + ra->size))) {
+	if ((offset == (ra->start + ra->size - ra->async_size) ||
+	     offset == (ra->start + ra->size))) {
+		ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
 		goto readit;
 	}
 
-	prev_offset = ra->prev_pos >> PAGE_CACHE_SHIFT;
-	sequential = offset - prev_offset <= 1UL || req_size > max;
-
-	/*
-	 * Standalone, small read.
-	 * Read as is, and do not pollute the readahead state.
-	 */
-	if (!hit_readahead_marker && !sequential) {
-		return __do_page_cache_readahead(mapping, filp,
-						offset, req_size, 0);
-	}
-
 	/*
 	 * Hit a marked page without valid readahead state.
 	 * E.g. interleaved reads.
@@ -389,6 +459,7 @@ ondemand_readahead(struct address_space 
 		if (!start || start - offset > max)
 			return 0;
 
+		ra_set_pattern(ra, RA_PATTERN_HIT_MARKER);
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
 		ra->size = get_next_ra_size(ra, max);
@@ -397,18 +468,59 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
-	 * It may be one of
-	 * 	- first read on start of file
-	 * 	- sequential cache miss
-	 * 	- oversize random read
-	 * Start readahead for it.
+	 * oversize read
+	 */
+	if (req_size > max) {
+		ra_set_pattern(ra, RA_PATTERN_OVERSIZE);
+		goto initial_readahead;
+	}
+
+	/*
+	 * sequential cache miss
+	 */
+	if (offset - prev_offset <= 1UL) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
+		goto initial_readahead;
+	}
+
+	/*
+	 * Query the page cache and look for the traces(cached history pages)
+	 * that a sequential stream would leave behind.
+	 */
+	if (mapping->backing_dev_info->context_readahead != 0 &&
+		try_context_readahead(mapping, ra, offset, req_size, max))
+		goto readit;
+
+	/*
+	 * Standalone, small read.
+	 * Read as is, and do not pollute the readahead state.
 	 */
+	ra_set_pattern(ra, RA_PATTERN_RANDOM);
+	ra->start = offset;
+	ra->size = req_size;
+	ra->async_size = 0;
+	goto readit;
+
+initial_readahead:
 	ra->start = offset;
 	ra->size = get_init_ra_size(req_size, max);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	if (req_size <= max / 2)
+		ra->async_size = ra->size - req_size;
+	else
+		ra->async_size = ra->size;
 
 readit:
-	return ra_submit(ra, mapping, filp);
+	/*
+	 * Will this read will hit the readahead marker made by itself?
+	 */
+	if (offset + min(req_size, max) >
+			ra->start + ra->size - ra->async_size) {
+		ra->async_size = get_next_ra_size(ra, max);
+		ra->size += ra->async_size;
+	}
+
+	return ra_submit(mapping, filp, offset, req_size, ra,
+			 hit_readahead_marker);
 }
 
 /**
--- mm.orig/block/blk-sysfs.c
+++ mm/block/blk-sysfs.c
@@ -75,6 +75,24 @@ queue_requests_store(struct request_queu
 	return ret;
 }
 
+static ssize_t queue_cxt_ra_show(struct request_queue *q, char *page)
+{
+	unsigned int cra = q->backing_dev_info.context_readahead;
+
+	return queue_var_show(cra, (page));
+}
+
+static ssize_t
+queue_cxt_ra_store(struct request_queue *q, const char *page, size_t count)
+{
+	unsigned long cra;
+	ssize_t ret = queue_var_store(&cra, page, count);
+
+	q->backing_dev_info.context_readahead = cra;
+
+	return ret;
+}
+
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
 	int ra_kb = q->backing_dev_info.ra_pages << (PAGE_CACHE_SHIFT - 10);
@@ -163,6 +181,12 @@ static struct queue_sysfs_entry queue_re
 	.store = queue_requests_store,
 };
 
+static struct queue_sysfs_entry queue_cxt_ra_entry = {
+	.attr = {.name = "context_readahead", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_cxt_ra_show,
+	.store = queue_cxt_ra_store,
+};
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -200,6 +224,7 @@ static struct queue_sysfs_entry queue_no
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
+	&queue_cxt_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,
--- mm.orig/include/linux/backing-dev.h
+++ mm/include/linux/backing-dev.h
@@ -41,6 +41,7 @@ enum bdi_stat_item {
 
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
+	int context_readahead;	/* enable page cache context readahead */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
--- mm.orig/lib/radix-tree.c
+++ mm/lib/radix-tree.c
@@ -665,6 +665,43 @@ unsigned long radix_tree_next_hole(struc
 }
 EXPORT_SYMBOL(radix_tree_next_hole);
 
+/**
+ *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
+ *	@root:		tree root
+ *	@index:		index key
+ *	@max_scan:	maximum range to search
+ *
+ *	Search backwards in the range [max(index-max_scan+1, 0), index]
+ *	for the first hole.
+ *
+ *	Returns: the index of the hole if found, otherwise returns an index
+ *	outside of the set specified (in which case 'index - return >= max_scan'
+ *	will be true). In rare cases of wrap-around, LONG_MAX will be returned.
+ *
+ *	radix_tree_next_hole may be called under rcu_read_lock. However, like
+ *	radix_tree_gang_lookup, this will not atomically search a snapshot of
+ *	the tree at a single point in time. For example, if a hole is created
+ *	at index 10, then subsequently a hole is created at index 5,
+ *	radix_tree_prev_hole covering both indexes may return 5 if called under
+ *	rcu_read_lock.
+ */
+unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
+				   unsigned long index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(root, index))
+			break;
+		index--;
+		if (index == LONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(radix_tree_prev_hole);
+
 static unsigned int
 __lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
--- mm.orig/include/linux/radix-tree.h
+++ mm/include/linux/radix-tree.h
@@ -167,6 +167,8 @@ radix_tree_gang_lookup_slot(struct radix
 			unsigned long first_index, unsigned int max_items);
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
--- mm.orig/include/linux/fs.h
+++ mm/include/linux/fs.h
@@ -789,11 +789,55 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
+	unsigned int flags;
 	int mmap_miss;			/* Cache miss stat for mmap accesses */
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
 /*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+#define RA_PATTERN_SHIFT	0
+#define RA_PATTERN_MASK		0xf
+
+enum readahead_pattern {
+	RA_PATTERN_NONE,
+	RA_PATTERN_INITIAL0,	/* start of file */
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_HIT_MARKER,
+	RA_PATTERN_HAS_TRAIL,
+	RA_PATTERN_OVERSIZE,
+	RA_PATTERN_REVERSE,
+	RA_PATTERN_STRIDE,
+	RA_PATTERN_THRASH,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for collecting stats */
+	RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(struct file_ra_state *ra)
+{
+	int pattern = (ra->flags >> RA_PATTERN_SHIFT) & RA_PATTERN_MASK;
+
+	if (pattern > RA_PATTERN_MAX)
+		pattern = RA_PATTERN_NONE;
+
+	return pattern;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+	ra->flags = (ra->flags & ~RA_PATTERN_MASK) |
+			(pattern << RA_PATTERN_SHIFT);
+}
+
+/*
  * Check if @index falls in the readahead windows.
  */
 static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
--- /dev/null
+++ mm/mm/readahead_trace.c
@@ -0,0 +1,486 @@
+/*
+ * Readahead I/O tracing and accounting
+ *
+ * Copyright 2008 Wu Fengguang, Intel Corporation
+ * Subject to the GNU Public License, version 2 or later.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/marker.h>
+#include <linux/kprobes.h>
+#include <linux/pagemap.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <asm/atomic.h>
+#include <linux/stringify.h>
+#include <linux/version.h>
+
+int option_read_jprobes;
+int option_trace_enable;
+u32 option_trace_pid;
+u32 option_trace_ino;
+
+struct readahead_trace_probe {
+	const char *name;
+	const char *format;
+	marker_probe_func *probe_func;
+	int pattern;
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,
+	RA_ACCOUNT_EOF,
+	RA_ACCOUNT_SYNC,
+	/* number of pages */
+	RA_ACCOUNT_SIZE,
+	RA_ACCOUNT_ASIZE,
+	RA_ACCOUNT_ACTUAL,
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static const char * const ra_pattern_names[] = {
+	[RA_PATTERN_NONE]		= "none",
+	[RA_PATTERN_INITIAL0]		= "initial0",
+	[RA_PATTERN_INITIAL]		= "initial",
+	[RA_PATTERN_SUBSEQUENT]		= "subsequent",
+	[RA_PATTERN_HIT_MARKER]		= "marker",
+	[RA_PATTERN_HAS_TRAIL]		= "trail",
+	[RA_PATTERN_OVERSIZE]		= "oversize",
+	[RA_PATTERN_REVERSE]		= "reverse",
+	[RA_PATTERN_STRIDE]		= "stride",
+	[RA_PATTERN_THRASH]		= "thrash",
+	[RA_PATTERN_MMAP_AROUND]	= "mmap",
+	[RA_PATTERN_FADVISE]		= "fadvise",
+	[RA_PATTERN_RANDOM]		= "random",
+	[RA_PATTERN_ALL]		= "all",
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+/*
+ * readahead tracing
+ */
+
+static void do_readahead_stats(struct address_space *mapping,
+			       struct file_ra_state *ra,
+			       int actual,
+			       int async,
+			       int pattern)
+{
+	pgoff_t lastpage = (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT;
+
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += ra->size;
+	ra_stats[pattern][RA_ACCOUNT_ASIZE] += ra->async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+	if (ra->start + ra->size > lastpage)
+		ra_stats[pattern][RA_ACCOUNT_EOF]++;
+	if (!async)
+		ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+}
+
+static void do_readahead_trace(struct address_space *mapping,
+			       struct file *filp,
+			       pgoff_t offset,
+			       unsigned long req_size,
+			       struct file_ra_state *ra,
+			       int actual,
+			       int async)
+{
+	int pattern = ra_pattern(ra);
+
+	BUILD_BUG_ON(ARRAY_SIZE(ra_pattern_names) != RA_PATTERN_MAX);
+
+	do_readahead_stats(mapping, ra, actual, async, pattern);
+	do_readahead_stats(mapping, ra, actual, async, RA_PATTERN_ALL);
+
+	if (!option_trace_enable)
+		return;
+	if (option_trace_pid && option_trace_pid != current->pid)
+		return;
+	if (option_trace_ino && option_trace_ino != mapping->host->i_ino)
+		return;
+
+	printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
+		"ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
+		ra_pattern_names[pattern],
+		current->pid, current->comm,
+		MAJOR(mapping->host->i_sb->s_dev),
+		MINOR(mapping->host->i_sb->s_dev),
+		mapping->host->i_sb->s_id,
+		mapping->host->i_ino,
+		filp->f_path.dentry->d_name.name,
+		offset, req_size,
+		ra->start, ra->size, ra->async_size,
+		async,
+		actual);
+}
+
+
+static void readahead_trace(void *readahead_trace_probe, void *call_data,
+			    const char *format, va_list *args)
+{
+	struct address_space *mapping;
+	struct file *filp;
+	pgoff_t offset;
+	unsigned long req_size;
+	struct file_ra_state *ra;
+	int actual;
+	int async;
+
+	mapping  = va_arg(*args, typeof(mapping));
+	filp     = va_arg(*args, typeof(filp));
+	offset   = va_arg(*args, typeof(offset));
+	req_size = va_arg(*args, typeof(req_size));
+	ra       = va_arg(*args, typeof(ra));
+	actual   = va_arg(*args, typeof(actual));
+	async    = va_arg(*args, typeof(async));
+
+	do_readahead_trace(mapping, filp, offset, req_size, ra, actual, async);
+}
+
+static void plain_readahead_trace(void *probe_private, void *call_data,
+				  const char *format, va_list *args)
+{
+	struct readahead_trace_probe *probe = probe_private;
+	struct address_space *mapping;
+	struct file *filp;
+	struct file_ra_state ra = {0};
+	int actual;
+
+	mapping  = va_arg(*args, typeof(mapping));
+	filp     = va_arg(*args, typeof(filp));
+	ra.start = va_arg(*args, typeof(ra.start));
+	ra.size  = va_arg(*args, typeof(ra.size));
+	actual   = va_arg(*args, typeof(actual));
+
+	ra_set_pattern(&ra, probe->pattern);
+
+	do_readahead_trace(mapping, filp, 0, 0, &ra, actual, 0);
+}
+
+/*
+ * read tracing
+ */
+
+static void read_trace(const char func_name[],
+		       struct file *file,
+		       loff_t pos, size_t count)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	if (!option_trace_enable)
+		return;
+	if (option_trace_pid && option_trace_pid != current->pid)
+		return;
+	if (option_trace_ino && option_trace_ino != inode->i_ino)
+		return;
+
+	printk(KERN_DEBUG "%s(pid=%d(%s), dev=%02x:%02x(%s), ino=%lu(%s), "
+			"pos=%llu, count=%lu)\n",
+			func_name,
+			current->pid, current->comm,
+			MAJOR(inode->i_sb->s_dev),
+			MINOR(inode->i_sb->s_dev),
+			inode->i_sb->s_id,
+			inode->i_ino,
+			file->f_path.dentry->d_name.name,
+			pos, count);
+}
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
+static ssize_t jdo_generic_file_splice_read(struct file *in,
+					    loff_t *ppos,
+					    struct pipe_inode_info *pipe,
+					    size_t len,
+					    unsigned int flags)
+{
+	read_trace("generic_file_splice_read", in, *ppos, len);
+	jprobe_return();
+	return 0;
+}
+#endif
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,25)
+static void jdo_do_generic_file_read(struct file *filp,
+				     loff_t *ppos,
+				     read_descriptor_t *desc,
+				     read_actor_t actor)
+{
+	read_trace("do_generic_file_read", filp, *ppos, desc->count);
+	jprobe_return();
+}
+#else
+static void jdo_do_generic_mapping_read(struct address_space *mapping,
+					struct file_ra_state *_ra,
+					struct file *filp,
+					loff_t *ppos,
+					read_descriptor_t *desc,
+					read_actor_t actor)
+{
+	read_trace("do_generic_mapping_read", filp, *ppos, desc->count);
+	jprobe_return();
+}
+#endif
+
+/*
+ * jprobes
+ */
+
+#define jprobe_entry(func)			\
+{						\
+	.entry = jdo_##func,			\
+	.kp.symbol_name = __stringify(func)	\
+}
+
+static struct jprobe jprobe_array[] = {
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,25)
+	jprobe_entry(do_generic_file_read),
+#else
+	jprobe_entry(do_generic_mapping_read),
+#endif
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,24)
+	jprobe_entry(generic_file_splice_read),
+#endif
+};
+
+static void insert_jprobes(void)
+{
+	int err;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(jprobe_array); i++) {
+		err = register_jprobe(jprobe_array + i);
+		if (err)
+			printk(KERN_ERR "unable to register probe %s\n",
+					jprobe_array[i].kp.symbol_name);
+	}
+}
+
+static void remove_jprobes(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(jprobe_array); i++)
+		unregister_jprobe(jprobe_array + i);
+}
+
+/*
+ * marker probes
+ */
+
+static struct readahead_trace_probe probe_array[] =
+{
+	{
+		.name = "readahead_generic",
+		.format = "%p %p %lu %lu %p %d %d",
+		.probe_func = readahead_trace,
+	},
+	{
+		.name = "readahead_mmap",
+		.format = "%p %p %lu %lu %d",
+		.probe_func = plain_readahead_trace,
+		.pattern = RA_PATTERN_MMAP_AROUND,
+	},
+	{
+		.name = "readahead_fadvise",
+		.format = "%p %p %lu %lu %d",
+		.probe_func = plain_readahead_trace,
+		.pattern = RA_PATTERN_FADVISE,
+	},
+};
+
+static void insert_marker(void)
+{
+	int err;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(probe_array); i++) {
+		err = marker_probe_register(probe_array[i].name,
+					    probe_array[i].format,
+					    probe_array[i].probe_func,
+					    probe_array + i);
+		if (err)
+			printk(KERN_ERR "unable to register probe %s\n",
+					probe_array[i].name);
+	}
+}
+
+static void remove_marker(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(probe_array); i++)
+		marker_probe_unregister(probe_array[i].name,
+					probe_array[i].probe_func,
+					probe_array + i);
+
+	synchronize_sched();
+}
+
+/*
+ * file operations for readahead_stats
+ */
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+	unsigned long count;
+
+	seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s\n",
+			"pattern",
+			"count", "sync_count", "eof_count",
+			"size", "async_size", "actual");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		count = ra_stats[i][RA_ACCOUNT_COUNT];
+		if (count == 0)
+			count = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_SIZE]   / count,
+				ra_stats[i][RA_ACCOUNT_ASIZE]  / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / count);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/*
+ * file operations for read_jprobes
+ */
+
+static int read_jprobes_set(void *data, u64 val)
+{
+	if (val && !option_read_jprobes)
+		insert_jprobes();
+	else if (!val && option_read_jprobes)
+		remove_jprobes();
+
+	*(int *)data = val;
+	WARN_ON(data != &option_read_jprobes);
+
+	return 0;
+}
+static int read_jprobes_get(void *data, u64 *val)
+{
+	*val = *(int *)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(read_jprobes_fops,
+			read_jprobes_get, read_jprobes_set, "%llu\n");
+
+/*
+ * debugfs entries: readahead/stats
+ */
+
+struct readahead_debug {
+	struct dentry *root;
+	struct dentry *stats;
+	struct dentry *trace_enable;
+	struct dentry *trace_pid;
+	struct dentry *read_jprobes;
+};
+static struct readahead_debug debug;
+
+static void remove_debugfs(void)
+{
+	debugfs_remove(debug.read_jprobes);
+	debugfs_remove(debug.trace_enable);
+	debugfs_remove(debug.trace_pid);
+	debugfs_remove(debug.stats);
+	debugfs_remove(debug.root);
+}
+
+static int create_debugfs(void)
+{
+	debug.root = debugfs_create_dir("readahead", NULL);
+	if (!debug.root)
+		goto error_debugfs;
+
+	debug.stats = debugfs_create_file("stats", 0644, debug.root,
+					  NULL, &readahead_stats_fops);
+	if (!debug.stats)
+		goto error_debugfs;
+
+	debug.trace_enable = debugfs_create_bool("trace_enable", 0644,
+						 debug.root,
+						 &option_trace_enable);
+	if (!debug.trace_enable)
+		goto error_debugfs;
+
+	debug.trace_pid = debugfs_create_u32("trace_pid", 0644, debug.root,
+					     &option_trace_pid);
+	if (!debug.trace_pid)
+		goto error_debugfs;
+
+	debug.read_jprobes = debugfs_create_file("read_jprobes", 0644,
+						 debug.root,
+						 &option_read_jprobes,
+						 &read_jprobes_fops);
+	if (!debug.read_jprobes)
+		goto error_debugfs;
+
+	return 0;
+
+error_debugfs:
+	printk(KERN_ERR "readahead: failed to create debugfs directory\n");
+	return -ENOMEM;
+}
+
+/*
+ * module init/exit
+ */
+
+static int __init readahead_probe_init(void)
+{
+	create_debugfs();
+	insert_marker();
+	return 0;
+}
+
+static void __exit readahead_probe_exit(void)
+{
+	remove_jprobes();
+	remove_marker();
+	remove_debugfs();
+}
+
+module_init(readahead_probe_init);
+module_exit(readahead_probe_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Wu Fengguang");
+MODULE_DESCRIPTION("readahead tracing and accounting");
--- mm.orig/mm/Kconfig
+++ mm/mm/Kconfig
@@ -208,3 +208,21 @@ config VIRT_TO_BUS
 
 config MMU_NOTIFIER
 	bool
+
+config READAHEAD_TRACE
+	tristate "Readahead tracing"
+	select MARKER
+	select KPROBES
+	select DEBUGFS
+	help
+	 This module injects code to show detailed readahead traces and do
+	 readahead events accounting.
+
+	 To actually get the data:
+
+	 # mount -t debugfs none /sys/kernel/debug
+
+	 After that you can do the following:
+
+	 # cat /sys/kernel/debug/readahead/stats     # check counters
+	 # echo > /sys/kernel/debug/readahead/stats  # reset counters
--- mm.orig/mm/Makefile
+++ mm/mm/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 
+obj-$(CONFIG_READAHEAD_TRACE) += readahead_trace.o
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -982,7 +982,7 @@ static void shrink_readahead_size_eio(st
  * This is really ugly. But the goto's actually try to clarify some
  * of the logic when it comes to error handling etc.
  */
-static void do_generic_file_read(struct file *filp, loff_t *ppos,
+void do_generic_file_read(struct file *filp, loff_t *ppos,
 		read_descriptor_t *desc, read_actor_t actor)
 {
 	struct address_space *mapping = filp->f_mapping;

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2008-11-28  0:48                             ` Wu Fengguang
@ 2009-02-12 18:35                               ` Vladislav Bolkhovitin
  2009-02-13  1:57                                 ` Wu Fengguang
  2009-02-17 19:01                                 ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-12 18:35 UTC (permalink / raw)
  To: Wu Fengguang, Jens Axboe
  Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 8524 bytes --]

Wu Fengguang, on 11/28/2008 03:48 AM wrote:
>> Actually, there's one more thing, which should have been mentioned. It
>> is possible that remote clients have several sequential read streams at
>> time together with some "noise" of random requests. A good read-ahead
>> subsystem should handle such case by keeping big read-ahead windows for
>> the sequential streams and don't do any read ahead for the random
>> requests. And all at the same time.
>>
>> Currently on such workloads read ahead will be completely disabled for
>> all the requests. Hence, there is a possibility here to improve
>> performance in 3-5 times or even more by making the workload more linear.
> 
> Are you sure? I'd expect such mixed-sequential-random pattern to be
> handled by the current readahead code pretty well: sequential ones
> will get large readahead and random ones won't get readahead at all.

No, sorry, my data was outdated. I rechecked and it works quite well now.

> Attached is the context readahead patch plus a kernel module for
> readahead tracing and accounting, which will hopefully help clarify the
> read patterns and readahead behaviors on production workloads. It is
> based on 2.6.27 for your convenience, but also applies to 2.6.28.
> 
> The patch is not targeted for code review, but if anyone are interested,
> you can take a look at try_context_readahead(). This is the only newly
> introduced readahead policy, the other majorities are code refactor
> and tracing facilities.
> 
> The newly introduced context readahead policy is disabled by default.
> To enable it:
>         echo 1 > /sys/block/sda/queue/context_readahead
> I'm not sure for now whether this parameter will be a long term one, or
> whether the context readahead policy should be enabled unconditionally.
> 
> The readahead accounting stats can be viewed by
>         mount -t debugfs none /sys/kernel/debug
>         cat /sys/kernel/debug/readahead/stats
> The numbers can be reset by
>         echo > /sys/kernel/debug/readahead/stats
> 
> Here is a sample output from my desktop:
> 
> % cat /sys/kernel/debug/readahead/stats
> pattern         count sync_count  eof_count       size async_size     actual
> none                0          0          0          0          0          0
> initial0         3009       3009       2033          5          4          2
> initial            35         35          0          5          4          3
> subsequent       1294        240        827         52         49         26
> marker            220          0        109         54         53         29
> trail               0          0          0          0          0          0
> oversize            0          0          0          0          0          0
> reverse             0          0          0          0          0          0
> stride              0          0          0          0          0          0
> thrash              0          0          0          0          0          0
> mmap             2833       2833       1379        142          0         47
> fadvise             7          7          7          0          0         40
> random           7621       7621         69          1          0          1
> all             15019      13745       4424         33          5         12
> 
> The readahead/read tracing messages are disabled by default.
> To enable them:
>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>         echo 1 > /sys/kernel/debug/readahead/read_jprobes
> They(especially the latter one) will generate a lot of printk messages like:
> 
> [  828.151013] readahead-initial0(pid=4644(zsh), dev=00:10(0:10), ino=351452(whoami), req=0+1, ra=0+4-3, async=0) = 4
> [  828.167853] readahead-mmap(pid=4644(whoami), dev=00:10(0:10), ino=351452(whoami), req=0+0, ra=0+60-0, async=0) = 3
> [  828.195652] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=115569(zsh_prompt), req=0+128, ra=0+120-60, async=0) = 3
> [  828.225081] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=342086(.zsh_history), req=0+128, ra=0+120-60, async=0) = 4
> 
> [  964.471450] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=0, count=128)
> [  964.471544] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=64, count=448)
> [  964.471575] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=512, count=28)
> [  964.472659] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=0, count=128)
> [  964.473431] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=64, count=336)
> [  964.475639] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383010(libc-2.7.so), pos=0, count=832)
> [  964.479037] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=0, count=524288)
> [  964.479166] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=2586, count=524288)
> 
> So please enable them only when necessary.
> 
> My recommendation for the double readahead in NFS client and NFS servers,
> is to keep client side readahead size small and the server side one large.
> For example, 128K-512K/1M-2M(more for RAID). The NFS client side readahead size
> is not directly tunable, but setting rsize to a small value does the trick.
> Currently the NFS magic is readahead_size=N*rsize. The default numbers in my
> 2.6.28 kernel are rsize=512k, N=15, readahead_size=7680k. The latter is
> obviously way too large.

Sorry for such a huge delay. There were many other activities I had to 
do before + I had to be sure I didn't miss anything.

We didn't use NFS, we used SCST (http://scst.sourceforge.net) with 
iSCSI-SCST target driver. It has similar to NFS architecture, where N 
threads (N=5 in this case) handle IO from remote initiators (clients) 
coming from wire using iSCSI protocol. In addition, SCST has patch 
called export_alloc_io_context (see 
http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
queue IO using single IO context, so we can see if context RA can 
replace grouping IO threads in single IO context.

Unfortunately, the results are negative. We find neither any advantages 
of context RA over current RA implementation, nor possibility for 
context RA to replace grouping IO threads in single IO context.

Setup on the target (server) was the following. 2 SATA drives grouped in 
md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero 
of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB) 
copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3 
partitions. The first partition was 10% of space in the beginning of the 
device, the last partition was 10% of space in the end of the device, 
the middle one was the rest in the middle of the space them. Then the 
first and the last partitions were exported to the initiator (client). 
They were /dev/sdb and /dev/sdc on it correspondingly.

Then 4 2.6.27.12 kernels were build:

1. With all SCST patches

2. With all SCST patches, except export_alloc_io_context

3. With all SCST patches + context RA patch

4. With all SCST patches, except export_alloc_io_context, but with 
context RA patch.

Memory on both initiator and target was limited to 512MB. Link was 1GbE.

Then for those kernels the following tests were ran:

1. dd if=/dev/sdb of=/dev/null bs=64K count=80000

2. dd if=/dev/sdc of=/dev/null bs=64K count=80000

3. while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done and 
simultaneously dd if=/dev/sdb of=/dev/null bs=64K count=80000. Results 
from the latter dd was written. This test allowed to see how well 
simultaneous reads are handled.

You can find the results in the attachement.

You can see that context RA doesn't improve anything, while grouping IO 
in single IO context provides almost 100% improvement in throughput.

Additional interesting observation is how badly simultaneous read IO 
streams are handled, if they aren't grouped in the corresponding IO 
contexts. In test 3 the result was as low as 4(!)MB/s. Wu, Jens, do you 
have any explanation on this? Why the inner tracks have so big preference?

Another thing looks suspicious for me. If simultaneous read IO streams 
are sent and they are grouped in the corresponding IO contexts, dd from 
sdb has only 20MB/s throughput. Considering that a single stream from it 
has about 100MB/s, shouldn't that value be at least 30-35MB/s? Is it the 
same issue as above, but with smaller implication?

Thanks,
Vlad

[-- Attachment #2: cfq-scheduler.txt --]
[-- Type: text/plain, Size: 14159 bytes --]

kernel: 2.6.27.12-all_patches - all scst patches


#cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
#cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

#free
             total       used       free     shared    buffers     cached
Mem:        508180     502448       5732          0     430704      27740
-/+ buffers/cache:      44004     464176
Swap:            0          0          0


scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 102 MB/s
	b) 102 MB/s
	c) 102 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 74,3 MB/s
	b) 74,5 MB/s
	c) 74,4 MB/s

Run at the same time:		
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,6 MB/s
	b) 22,8 MB/s
	c) 24,1 MB/s
	d) 23,1 MB/s
	
------------------------------------------------------------------------
kernel: 2.6.27.12-except_export_alloc - all scst patches, except export_alloc_io_context.patch

# cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]


scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 48,6 MB/s
	b) 49,2 MB/s
	c) 48,9 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 48,3 MB/s
	b) 48,5 MB/s
	c) 47,9 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 4,2 MB/s
	b) 3,9 MB/s
	c) 4,1 MB/s


	
---------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
default /dev/md0 readahead = 512
linux-4dtq:~ #
linux-4dtq:~ #
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/md0
256
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # blockdev --getra /dev/md0
512
linux-4dtq:~ # free
             total       used       free     shared    buffers     cached
Mem:        508168     104480     403688          0       4516      64632
-/+ buffers/cache:      35332     472836
Swap:            0          0          0


	
scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048


#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 96,9 MB/s
	b) 101 MB/s
	c) 101, MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 73,9 MB/s
	b) 74,1 MB/s
	c) 73,3 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 22,1 MB/s
	b) 21,6 MB/s
	c) 21,1 MB/s

	
	
	
-----------------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
set: /dev/md0 readahead = 1024

linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/md0
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # blockdev --getra /dev/md0
1024
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a)  102 MB/s
	b)  100 MB/s
	c)  104 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a)  74,1 MB/s
	b)  73,7 MB/s
	c)  74,0 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a)  22,0 MB/s
	b)  21,5 MB/s
	c)  22,9 MB/s

------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-readahead - all scst patches + readahead-context.patch
set: /dev/md0 readahead = 2048

linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # blockdev --setra 1024 /dev/sda
linux-4dtq:~ # blockdev --setra 1024 /dev/sdb
linux-4dtq:~ # blockdev --setra 2048 /dev/md0
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/md0
2048


scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 103 MB/s
	b) 102 MB/s
	c) 102 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 73,7 MB/s
	b) 74,1 MB/s
	c) 74,3 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 20,1 MB/s
	b) 21,5 MB/s
	c) 21,8 MB/s

---------------------------------------------------
Kernel uname -r: 2.6.27.12-readahead
readahead set to: 4M

linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # blockdev --getra /dev/sda
4096
inux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 99,4 MB/s
	b) 96,4 MB/s
	c) 101 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 73,5 MB/s
	b) 74,3 MB/s
	c) 73,9 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,0 MB/s
	b) 19,8 MB/s
	c) 22,4 MB/s

------------------------------------------------------------------------
kernel: uname -r 2.6.27.12-except_export+readahead - all scst patches except export_alloc_io_context.patch + readahead-contex.patch
default /dev/md0 readahead = 512

linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/sda
1024
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/sdb
1024
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # blockdev --getra /dev/md0
512
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:/.kernels/scst-kernel-4/scstadmin # cat /sys/block/sda/queue/context_readahead
1


scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 47,7 MB/s
	b) 47,6 MB/s
	c) 47,6 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 47,6 MB/s
	b) 47,2 MB/s
	c) 46,8 MB/s

Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 3,5 MB/s
	b) 3,1 MB/s
	c) 3,5 MB/s

-----------------------------------------	
kernel: uname -r: 2.6.27.12-except_export+readahead
readahead set to: 4M

linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1
linux-4dtq:~ # blockdev --getra /dev/sda
4096
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ #

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst: Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 49,1 MB/s
	b) 49,0 MB/s
	c) 48,7 MB/s
#dd if=/dev/sdc of=/dev/null bs=64K count=80000
	a) 47,9 MB/s
	b) 46,6 MB/s
	c) 47,2 MB/s
Run at the same time:	
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 3,6 MB/s
	b) 3,4 MB/s
	c) 3,6 MB/s


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-12 18:35                               ` Vladislav Bolkhovitin
@ 2009-02-13  1:57                                 ` Wu Fengguang
  2009-02-13 20:08                                   ` Vladislav Bolkhovitin
  2009-02-17 19:01                                   ` Vladislav Bolkhovitin
  2009-02-17 19:01                                 ` Vladislav Bolkhovitin
  1 sibling, 2 replies; 70+ messages in thread
From: Wu Fengguang @ 2009-02-13  1:57 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
> Sorry for such a huge delay. There were many other activities I had to  
> do before + I had to be sure I didn't miss anything.
>
> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with  
> iSCSI-SCST target driver. It has similar to NFS architecture, where N  
> threads (N=5 in this case) handle IO from remote initiators (clients)  
> coming from wire using iSCSI protocol. In addition, SCST has patch  
> called export_alloc_io_context (see  
> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads  
> queue IO using single IO context, so we can see if context RA can  
> replace grouping IO threads in single IO context.
>
> Unfortunately, the results are negative. We find neither any advantages  
> of context RA over current RA implementation, nor possibility for  
> context RA to replace grouping IO threads in single IO context.
>
> Setup on the target (server) was the following. 2 SATA drives grouped in  
> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero  
> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)  
> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3  
> partitions. The first partition was 10% of space in the beginning of the  
> device, the last partition was 10% of space in the end of the device,  
> the middle one was the rest in the middle of the space them. Then the  
> first and the last partitions were exported to the initiator (client).  
> They were /dev/sdb and /dev/sdc on it correspondingly.

Vladislav, Thank you for the benchmarks! I'm very interested in
optimizing your workload and figuring out what happens underneath.

Are the client and server two standalone boxes connected by GBE?

When you set readahead sizes in the benchmarks, you are setting them
in the server side? I.e. "linux-4dtq" is the SCST server? What's the
client side readahead size?

It would help a lot to debug readahead if you can provide the
server side readahead stats and trace log for the worst case.
This will automatically answer the above questions as well as disclose
the micro-behavior of readahead:

        mount -t debugfs none /sys/kernel/debug

        echo > /sys/kernel/debug/readahead/stats # reset counters
        # do benchmark
        cat /sys/kernel/debug/readahead/stats

        echo 1 > /sys/kernel/debug/readahead/trace_enable
        # do micro-benchmark, i.e. run the same benchmark for a short time
        echo 0 > /sys/kernel/debug/readahead/trace_enable
        dmesg

The above readahead trace should help find out how the client side
sequential reads convert into server side random reads, and how we can
prevent that.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-13  1:57                                 ` Wu Fengguang
@ 2009-02-13 20:08                                   ` Vladislav Bolkhovitin
  2009-02-16  2:34                                     ` Wu Fengguang
  2009-02-17 19:01                                   ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-13 20:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

Wu Fengguang, on 02/13/2009 04:57 AM wrote:
> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>> Sorry for such a huge delay. There were many other activities I had to  
>> do before + I had to be sure I didn't miss anything.
>>
>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with  
>> iSCSI-SCST target driver. It has similar to NFS architecture, where N  
>> threads (N=5 in this case) handle IO from remote initiators (clients)  
>> coming from wire using iSCSI protocol. In addition, SCST has patch  
>> called export_alloc_io_context (see  
>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads  
>> queue IO using single IO context, so we can see if context RA can  
>> replace grouping IO threads in single IO context.
>>
>> Unfortunately, the results are negative. We find neither any advantages  
>> of context RA over current RA implementation, nor possibility for  
>> context RA to replace grouping IO threads in single IO context.
>>
>> Setup on the target (server) was the following. 2 SATA drives grouped in  
>> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero  
>> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)  
>> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3  
>> partitions. The first partition was 10% of space in the beginning of the  
>> device, the last partition was 10% of space in the end of the device,  
>> the middle one was the rest in the middle of the space them. Then the  
>> first and the last partitions were exported to the initiator (client).  
>> They were /dev/sdb and /dev/sdc on it correspondingly.
> 
> Vladislav, Thank you for the benchmarks! I'm very interested in
> optimizing your workload and figuring out what happens underneath.
> 
> Are the client and server two standalone boxes connected by GBE?

Yes, they directly connected using GbE.

> When you set readahead sizes in the benchmarks, you are setting them
> in the server side? I.e. "linux-4dtq" is the SCST server?

Yes, it's the server. On the client all the parameters were left default.

> What's the
> client side readahead size?

Default, i.e. 128K

> It would help a lot to debug readahead if you can provide the
> server side readahead stats and trace log for the worst case.
> This will automatically answer the above questions as well as disclose
> the micro-behavior of readahead:
> 
>         mount -t debugfs none /sys/kernel/debug
> 
>         echo > /sys/kernel/debug/readahead/stats # reset counters
>         # do benchmark
>         cat /sys/kernel/debug/readahead/stats
> 
>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>         # do micro-benchmark, i.e. run the same benchmark for a short time
>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>         dmesg
> 
> The above readahead trace should help find out how the client side
> sequential reads convert into server side random reads, and how we can
> prevent that.

We will do it as soon as we have a free window on that system.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-13 20:08                                   ` Vladislav Bolkhovitin
@ 2009-02-16  2:34                                     ` Wu Fengguang
  2009-02-17 19:03                                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2009-02-16  2:34 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>> Sorry for such a huge delay. There were many other activities I had 
>>> to  do before + I had to be sure I didn't miss anything.
>>>
>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with   
>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N 
>>>  threads (N=5 in this case) handle IO from remote initiators 
>>> (clients)  coming from wire using iSCSI protocol. In addition, SCST 
>>> has patch  called export_alloc_io_context (see   
>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
>>>  queue IO using single IO context, so we can see if context RA can   
>>> replace grouping IO threads in single IO context.
>>>
>>> Unfortunately, the results are negative. We find neither any 
>>> advantages  of context RA over current RA implementation, nor 
>>> possibility for  context RA to replace grouping IO threads in single 
>>> IO context.
>>>
>>> Setup on the target (server) was the following. 2 SATA drives grouped 
>>> in  md RAID-0 with average local read throughput ~120MB/s ("dd 
>>> if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs "20971520000 
>>> bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md device was 
>>> partitioned on 3  partitions. The first partition was 10% of space in 
>>> the beginning of the  device, the last partition was 10% of space in 
>>> the end of the device,  the middle one was the rest in the middle of 
>>> the space them. Then the  first and the last partitions were exported 
>>> to the initiator (client).  They were /dev/sdb and /dev/sdc on it 
>>> correspondingly.
>>
>> Vladislav, Thank you for the benchmarks! I'm very interested in
>> optimizing your workload and figuring out what happens underneath.
>>
>> Are the client and server two standalone boxes connected by GBE?
>
> Yes, they directly connected using GbE.
>
>> When you set readahead sizes in the benchmarks, you are setting them
>> in the server side? I.e. "linux-4dtq" is the SCST server?
>
> Yes, it's the server. On the client all the parameters were left default.
>
>> What's the
>> client side readahead size?
>
> Default, i.e. 128K
>
>> It would help a lot to debug readahead if you can provide the
>> server side readahead stats and trace log for the worst case.
>> This will automatically answer the above questions as well as disclose
>> the micro-behavior of readahead:
>>
>>         mount -t debugfs none /sys/kernel/debug
>>
>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>         # do benchmark
>>         cat /sys/kernel/debug/readahead/stats
>>
>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>         dmesg
>>
>> The above readahead trace should help find out how the client side
>> sequential reads convert into server side random reads, and how we can
>> prevent that.
>
> We will do it as soon as we have a free window on that system.

Thank you. For NFS, the client side read/readahead requests will be
split into units of rsize which will be served by a pool of nfsd
concurrently and possibly out of order. Does SCST have the same
process? If so, what's the rsize value for your SCST benchmarks?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-12 18:35                               ` Vladislav Bolkhovitin
  2009-02-13  1:57                                 ` Wu Fengguang
@ 2009-02-17 19:01                                 ` Vladislav Bolkhovitin
  2009-02-19  1:38                                   ` Wu Fengguang
  1 sibling, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-17 19:01 UTC (permalink / raw)
  To: Wu Fengguang, Jens Axboe
  Cc: Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

Vladislav Bolkhovitin, on 02/12/2009 09:35 PM wrote:
> Additional interesting observation is how badly simultaneous read IO 
> streams are handled, if they aren't grouped in the corresponding IO 
> contexts. In test 3 the result was as low as 4(!)MB/s. Wu, Jens, do you 
> have any explanation on this? Why the inner tracks have so big preference?

I realized, there is another explanation: access becomes about to be 
completely random. I checked and it is true. Here is a sample "iostat -x 
3" output on the server:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.57   26.62    0.00   72.81

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
sda             710.00     0.00   47.33    0.00  6058.67     0.00 
128.00     0.12    2.59   2.51  11.87
sdb             710.00     0.00   47.33    0.00  6058.67     0.00 
128.00     3.99   84.34  21.13 100.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdc1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
sdc2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00 1514.67    0.00 12117.33     0.00 
8.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  757.33    0.00  6058.67     0.00 
8.00    32.87   43.41   1.32 100.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  757.33    0.00  6058.67     0.00 
8.00    32.79   43.33   1.32 100.00

Thanks,
Vlad


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-13  1:57                                 ` Wu Fengguang
  2009-02-13 20:08                                   ` Vladislav Bolkhovitin
@ 2009-02-17 19:01                                   ` Vladislav Bolkhovitin
  2009-02-19  2:05                                     ` Wu Fengguang
  1 sibling, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-17 19:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2983 bytes --]

Wu Fengguang, on 02/13/2009 04:57 AM wrote:
> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>> Sorry for such a huge delay. There were many other activities I had to  
>> do before + I had to be sure I didn't miss anything.
>>
>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with  
>> iSCSI-SCST target driver. It has similar to NFS architecture, where N  
>> threads (N=5 in this case) handle IO from remote initiators (clients)  
>> coming from wire using iSCSI protocol. In addition, SCST has patch  
>> called export_alloc_io_context (see  
>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads  
>> queue IO using single IO context, so we can see if context RA can  
>> replace grouping IO threads in single IO context.
>>
>> Unfortunately, the results are negative. We find neither any advantages  
>> of context RA over current RA implementation, nor possibility for  
>> context RA to replace grouping IO threads in single IO context.
>>
>> Setup on the target (server) was the following. 2 SATA drives grouped in  
>> md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero  
>> of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)  
>> copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3  
>> partitions. The first partition was 10% of space in the beginning of the  
>> device, the last partition was 10% of space in the end of the device,  
>> the middle one was the rest in the middle of the space them. Then the  
>> first and the last partitions were exported to the initiator (client).  
>> They were /dev/sdb and /dev/sdc on it correspondingly.
> 
> Vladislav, Thank you for the benchmarks! I'm very interested in
> optimizing your workload and figuring out what happens underneath.
> 
> Are the client and server two standalone boxes connected by GBE?
> 
> When you set readahead sizes in the benchmarks, you are setting them
> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
> client side readahead size?
> 
> It would help a lot to debug readahead if you can provide the
> server side readahead stats and trace log for the worst case.
> This will automatically answer the above questions as well as disclose
> the micro-behavior of readahead:
> 
>         mount -t debugfs none /sys/kernel/debug
> 
>         echo > /sys/kernel/debug/readahead/stats # reset counters
>         # do benchmark
>         cat /sys/kernel/debug/readahead/stats
> 
>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>         # do micro-benchmark, i.e. run the same benchmark for a short time
>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>         dmesg
> 
> The above readahead trace should help find out how the client side
> sequential reads convert into server side random reads, and how we can
> prevent that.

See attached. Could you comment the logs, please, so I will also be able 
to read them in the future?

Thank you,
Vlad


[-- Attachment #2: RA-debug.zip --]
[-- Type: application/zip, Size: 18593 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-16  2:34                                     ` Wu Fengguang
@ 2009-02-17 19:03                                       ` Vladislav Bolkhovitin
  2009-02-18 18:14                                         ` Vladislav Bolkhovitin
  2009-02-19  1:35                                         ` Wu Fengguang
  0 siblings, 2 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-17 19:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

Wu Fengguang, on 02/16/2009 05:34 AM wrote:
> On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>> Sorry for such a huge delay. There were many other activities I had 
>>>> to  do before + I had to be sure I didn't miss anything.
>>>>
>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with   
>>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N 
>>>>  threads (N=5 in this case) handle IO from remote initiators 
>>>> (clients)  coming from wire using iSCSI protocol. In addition, SCST 
>>>> has patch  called export_alloc_io_context (see   
>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
>>>>  queue IO using single IO context, so we can see if context RA can   
>>>> replace grouping IO threads in single IO context.
>>>>
>>>> Unfortunately, the results are negative. We find neither any 
>>>> advantages  of context RA over current RA implementation, nor 
>>>> possibility for  context RA to replace grouping IO threads in single 
>>>> IO context.
>>>>
>>>> Setup on the target (server) was the following. 2 SATA drives grouped 
>>>> in  md RAID-0 with average local read throughput ~120MB/s ("dd 
>>>> if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs "20971520000 
>>>> bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md device was 
>>>> partitioned on 3  partitions. The first partition was 10% of space in 
>>>> the beginning of the  device, the last partition was 10% of space in 
>>>> the end of the device,  the middle one was the rest in the middle of 
>>>> the space them. Then the  first and the last partitions were exported 
>>>> to the initiator (client).  They were /dev/sdb and /dev/sdc on it 
>>>> correspondingly.
>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>> optimizing your workload and figuring out what happens underneath.
>>>
>>> Are the client and server two standalone boxes connected by GBE?
>> Yes, they directly connected using GbE.
>>
>>> When you set readahead sizes in the benchmarks, you are setting them
>>> in the server side? I.e. "linux-4dtq" is the SCST server?
>> Yes, it's the server. On the client all the parameters were left default.
>>
>>> What's the
>>> client side readahead size?
>> Default, i.e. 128K
>>
>>> It would help a lot to debug readahead if you can provide the
>>> server side readahead stats and trace log for the worst case.
>>> This will automatically answer the above questions as well as disclose
>>> the micro-behavior of readahead:
>>>
>>>         mount -t debugfs none /sys/kernel/debug
>>>
>>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>>         # do benchmark
>>>         cat /sys/kernel/debug/readahead/stats
>>>
>>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>         dmesg
>>>
>>> The above readahead trace should help find out how the client side
>>> sequential reads convert into server side random reads, and how we can
>>> prevent that.
>> We will do it as soon as we have a free window on that system.
> 
> Thank you. For NFS, the client side read/readahead requests will be
> split into units of rsize which will be served by a pool of nfsd
> concurrently and possibly out of order. Does SCST have the same
> process? If so, what's the rsize value for your SCST benchmarks?

No, there is no such splitting in SCST. Client sees raw SCSI disks from 
server and what client sends is directly and in full size sent by the 
server to its backstorage using regular buffered read() 
(fd->f_op->aio_read() followed by 
wait_on_retry_sync_kiocb()/wait_on_sync_kiocb() to be precise).

Thanks,
Vlad


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-17 19:03                                       ` Vladislav Bolkhovitin
@ 2009-02-18 18:14                                         ` Vladislav Bolkhovitin
  2009-02-19  1:35                                         ` Wu Fengguang
  1 sibling, 0 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-02-18 18:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 4161 bytes --]

Vladislav Bolkhovitin, on 02/17/2009 10:03 PM wrote:
> Wu Fengguang, on 02/16/2009 05:34 AM wrote:
>> On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Sorry for such a huge delay. There were many other activities I had 
>>>>> to  do before + I had to be sure I didn't miss anything.
>>>>>
>>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with   
>>>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N 
>>>>>  threads (N=5 in this case) handle IO from remote initiators 
>>>>> (clients)  coming from wire using iSCSI protocol. In addition, SCST 
>>>>> has patch  called export_alloc_io_context (see   
>>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
>>>>>  queue IO using single IO context, so we can see if context RA can   
>>>>> replace grouping IO threads in single IO context.
>>>>>
>>>>> Unfortunately, the results are negative. We find neither any 
>>>>> advantages  of context RA over current RA implementation, nor 
>>>>> possibility for  context RA to replace grouping IO threads in single 
>>>>> IO context.
>>>>>
>>>>> Setup on the target (server) was the following. 2 SATA drives grouped 
>>>>> in  md RAID-0 with average local read throughput ~120MB/s ("dd 
>>>>> if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs "20971520000 
>>>>> bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md device was 
>>>>> partitioned on 3  partitions. The first partition was 10% of space in 
>>>>> the beginning of the  device, the last partition was 10% of space in 
>>>>> the end of the device,  the middle one was the rest in the middle of 
>>>>> the space them. Then the  first and the last partitions were exported 
>>>>> to the initiator (client).  They were /dev/sdb and /dev/sdc on it 
>>>>> correspondingly.
>>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>>> optimizing your workload and figuring out what happens underneath.
>>>>
>>>> Are the client and server two standalone boxes connected by GBE?
>>> Yes, they directly connected using GbE.
>>>
>>>> When you set readahead sizes in the benchmarks, you are setting them
>>>> in the server side? I.e. "linux-4dtq" is the SCST server?
>>> Yes, it's the server. On the client all the parameters were left default.
>>>
>>>> What's the
>>>> client side readahead size?
>>> Default, i.e. 128K
>>>
>>>> It would help a lot to debug readahead if you can provide the
>>>> server side readahead stats and trace log for the worst case.
>>>> This will automatically answer the above questions as well as disclose
>>>> the micro-behavior of readahead:
>>>>
>>>>         mount -t debugfs none /sys/kernel/debug
>>>>
>>>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>>>         # do benchmark
>>>>         cat /sys/kernel/debug/readahead/stats
>>>>
>>>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>>         dmesg
>>>>
>>>> The above readahead trace should help find out how the client side
>>>> sequential reads convert into server side random reads, and how we can
>>>> prevent that.
>>> We will do it as soon as we have a free window on that system.
>> Thank you. For NFS, the client side read/readahead requests will be
>> split into units of rsize which will be served by a pool of nfsd
>> concurrently and possibly out of order. Does SCST have the same
>> process? If so, what's the rsize value for your SCST benchmarks?
> 
> No, there is no such splitting in SCST. Client sees raw SCSI disks from 
> server and what client sends is directly and in full size sent by the 
> server to its backstorage using regular buffered read() 
> (fd->f_op->aio_read() followed by 
> wait_on_retry_sync_kiocb()/wait_on_sync_kiocb() to be precise).

Update. We ran the same tests with deadline I/O scheduler and had 
roughly the same results as with CFQ, see attachment.

> Thanks,
> Vlad
> 
> 


[-- Attachment #2: 2.6.27.12-except_export+readahead-4M-deadline.txt --]
[-- Type: text/plain, Size: 2395 bytes --]

linux-4dtq:~ # uname -r
2.6.27.12-except_export+readahead
-scheduler = deadline
- RA = 4M


linux-4dtq:~ # free
             total       used       free     shared    buffers     cached
Mem:        508168     111288     396880          0       4476      62648
-/+ buffers/cache:      44164     464004
Swap:            0          0          0


linux-4dtq:~ # echo deadline > /sys/block/sdb/queue/scheduler
linux-4dtq:~ # echo deadline > /sys/block/sda/queue/scheduler
linux-4dtq:~ # cat /sys/block/sdb/queue/scheduler
noop anticipatory [deadline] cfq
linux-4dtq:~ # cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq

linux-4dtq:~ # echo 1 > /sys/block/sda/queue/context_readahead
linux-4dtq:~ # echo 1 > /sys/block/sdb/queue/context_readahead
linux-4dtq:~ # cat /sys/block/sdb/queue/context_readahead
1
linux-4dtq:~ # cat /sys/block/sda/queue/context_readahead
1

linux-4dtq:~ # blockdev --setra 4096 /dev/sda
linux-4dtq:~ # blockdev --setra 4096 /dev/sdb
linux-4dtq:~ # blockdev --getra /dev/sdb
4096
linux-4dtq:~ # blockdev --getra /dev/sda
4096

linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]
mdadm: /dev/md/0 has been started with 2 drives.
linux-4dtq:~ # vgchange -a y
  3 logical volume(s) in volume group "raid" now active
linux-4dtq:~ # lvs
  LV   VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  1st  raid -wi-a-  46.00G
  2nd  raid -wi-a- 374.00G
  3rd  raid -wi-a-  46.00G

scst: Using security group "Default" for initiator "iqn.1996-04.de.suse:01:aadab8bc4be5"
iscsi-scst:		Negotiated parameters: InitialR2T No, ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 262144, MaxXmitDataSegmentLength 131072,
iscsi-scst:     MaxBurstLength 1048576, FirstBurstLength 262144, DefaultTime2Wait 2, DefaultTime2Retain 0,
iscsi-scst:     MaxOutstandingR2T 1, DataPDUInOrder Yes, DataSequenceInOrder Yes, ErrorRecoveryLevel 0,
iscsi-scst:     HeaderDigest None, DataDigest None, OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048

1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 54,1 MB/s
b) 55,6 MB/s
c) 54,3 MB/s

2) dd if=/dev/sdc of=/dev/null bs=64K count=80000
a) 71,3 MB/s
b) 73,8 MB/s
c) 72,7 MB/s

3)Run at the same time:	
while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,3 MB/s
b) 5.0 MB/s 
c) 5.2 MB/s

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-17 19:03                                       ` Vladislav Bolkhovitin
  2009-02-18 18:14                                         ` Vladislav Bolkhovitin
@ 2009-02-19  1:35                                         ` Wu Fengguang
  1 sibling, 0 replies; 70+ messages in thread
From: Wu Fengguang @ 2009-02-19  1:35 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

On Tue, Feb 17, 2009 at 10:03:23PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/16/2009 05:34 AM wrote:
>> On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Sorry for such a huge delay. There were many other activities I 
>>>>> had to  do before + I had to be sure I didn't miss anything.
>>>>>
>>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) 
>>>>> with   iSCSI-SCST target driver. It has similar to NFS 
>>>>> architecture, where N  threads (N=5 in this case) handle IO from 
>>>>> remote initiators (clients)  coming from wire using iSCSI 
>>>>> protocol. In addition, SCST has patch  called 
>>>>> export_alloc_io_context (see    
>>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO 
>>>>> threads  queue IO using single IO context, so we can see if 
>>>>> context RA can   replace grouping IO threads in single IO 
>>>>> context.
>>>>>
>>>>> Unfortunately, the results are negative. We find neither any  
>>>>> advantages  of context RA over current RA implementation, nor  
>>>>> possibility for  context RA to replace grouping IO threads in 
>>>>> single IO context.
>>>>>
>>>>> Setup on the target (server) was the following. 2 SATA drives 
>>>>> grouped in  md RAID-0 with average local read throughput ~120MB/s 
>>>>> ("dd if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs 
>>>>> "20971520000 bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md 
>>>>> device was partitioned on 3  partitions. The first partition was 
>>>>> 10% of space in the beginning of the  device, the last partition 
>>>>> was 10% of space in the end of the device,  the middle one was 
>>>>> the rest in the middle of the space them. Then the  first and the 
>>>>> last partitions were exported to the initiator (client).  They 
>>>>> were /dev/sdb and /dev/sdc on it correspondingly.
>>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>>> optimizing your workload and figuring out what happens underneath.
>>>>
>>>> Are the client and server two standalone boxes connected by GBE?
>>> Yes, they directly connected using GbE.
>>>
>>>> When you set readahead sizes in the benchmarks, you are setting them
>>>> in the server side? I.e. "linux-4dtq" is the SCST server?
>>> Yes, it's the server. On the client all the parameters were left default.
>>>
>>>> What's the
>>>> client side readahead size?
>>> Default, i.e. 128K
>>>
>>>> It would help a lot to debug readahead if you can provide the
>>>> server side readahead stats and trace log for the worst case.
>>>> This will automatically answer the above questions as well as disclose
>>>> the micro-behavior of readahead:
>>>>
>>>>         mount -t debugfs none /sys/kernel/debug
>>>>
>>>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>>>         # do benchmark
>>>>         cat /sys/kernel/debug/readahead/stats
>>>>
>>>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>>         dmesg
>>>>
>>>> The above readahead trace should help find out how the client side
>>>> sequential reads convert into server side random reads, and how we can
>>>> prevent that.
>>> We will do it as soon as we have a free window on that system.
>>
>> Thank you. For NFS, the client side read/readahead requests will be
>> split into units of rsize which will be served by a pool of nfsd
>> concurrently and possibly out of order. Does SCST have the same
>> process? If so, what's the rsize value for your SCST benchmarks?
>
> No, there is no such splitting in SCST. Client sees raw SCSI disks from  
> server and what client sends is directly and in full size sent by the  
> server to its backstorage using regular buffered read()  
> (fd->f_op->aio_read() followed by  
> wait_on_retry_sync_kiocb()/wait_on_sync_kiocb() to be precise).

Then it's weird that the server is seeing 1-page sized read requests:

        readahead-marker(pid=3844(vdiskd4_4), dev=00:02(bdev), ino=0(raid-3rd), req=9160+1, ra=9192+32-32, async=1) = 32
        readahead-marker(pid=3842(vdiskd4_2), dev=00:02(bdev), ino=0(raid-3rd), req=9192+1, ra=9224+32-32, async=1) = 32
        readahead-marker(pid=3841(vdiskd4_1), dev=00:02(bdev), ino=0(raid-3rd), req=9224+1, ra=9256+32-32, async=1) = 32
        readahead-marker(pid=3844(vdiskd4_4), dev=00:02(bdev), ino=0(raid-3rd), req=9256+1, ra=9288+32-32, async=1) = 32

Here the first line means a 32-page readahead I/O was submitted for a
1-page read request.

The 1-page read size only adds overheads to CPU/NIC, but not disk I/O.
The trace shows that readahead is doing a good job, however the
readahead size is the default 128K, not 2M. That's a big problem.

The command "blockdev --setra 4096 /dev/sda" takes no effect at all.
Maybe you should put that command after mdadm? i.e.

        linux-4dtq:~ # mdadm --assemble /dev/md0 /dev/sd[ab]

        linux-4dtq:~ # blockdev --setra 4096 /dev/sda
        linux-4dtq:~ # blockdev --setra 4096 /dev/sdb

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-17 19:01                                 ` Vladislav Bolkhovitin
@ 2009-02-19  1:38                                   ` Wu Fengguang
  0 siblings, 0 replies; 70+ messages in thread
From: Wu Fengguang @ 2009-02-19  1:38 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

On Tue, Feb 17, 2009 at 10:01:13PM +0300, Vladislav Bolkhovitin wrote:
> Vladislav Bolkhovitin, on 02/12/2009 09:35 PM wrote:
>> Additional interesting observation is how badly simultaneous read IO  
>> streams are handled, if they aren't grouped in the corresponding IO  
>> contexts. In test 3 the result was as low as 4(!)MB/s. Wu, Jens, do you 
>> have any explanation on this? Why the inner tracks have so big 
>> preference?
>
> I realized, there is another explanation: access becomes about to be  
> completely random. I checked and it is true. Here is a sample "iostat -x  
> 3" output on the server:

Yes it's all about 64K sized reads. Is this the stripe size of md0?

Thanks,
Fengguang

> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.00    0.00    0.57   26.62    0.00   72.81
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s  
> avgrq-sz avgqu-sz   await  svctm  %util
> sda             710.00     0.00   47.33    0.00  6058.67     0.00 128.00  
>    0.12    2.59   2.51  11.87
> sdb             710.00     0.00   47.33    0.00  6058.67     0.00 128.00  
>    3.99   84.34  21.13 100.00
> sdc               0.00     0.00    0.00    0.00     0.00     0.00 0.00    
>  0.00    0.00   0.00   0.00
> sdc1              0.00     0.00    0.00    0.00     0.00     0.00 0.00    
>  0.00    0.00   0.00   0.00
> sdc2              0.00     0.00    0.00    0.00     0.00     0.00 0.00    
>  0.00    0.00   0.00   0.00
> md0               0.00     0.00 1514.67    0.00 12117.33     0.00 8.00    
>  0.00    0.00   0.00   0.00
> dm-0              0.00     0.00  757.33    0.00  6058.67     0.00 8.00    
> 32.87   43.41   1.32 100.00
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00 0.00    
>  0.00    0.00   0.00   0.00
> dm-2              0.00     0.00  757.33    0.00  6058.67     0.00 8.00    
> 32.79   43.33   1.32 100.00
>
> Thanks,
> Vlad

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-17 19:01                                   ` Vladislav Bolkhovitin
@ 2009-02-19  2:05                                     ` Wu Fengguang
  2009-03-19 17:44                                       ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2009-02-19  2:05 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel, linux-nfs

On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>> Sorry for such a huge delay. There were many other activities I had 
>>> to  do before + I had to be sure I didn't miss anything.
>>>
>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with   
>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N 
>>>  threads (N=5 in this case) handle IO from remote initiators 
>>> (clients)  coming from wire using iSCSI protocol. In addition, SCST 
>>> has patch  called export_alloc_io_context (see   
>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
>>>  queue IO using single IO context, so we can see if context RA can   
>>> replace grouping IO threads in single IO context.
>>>
>>> Unfortunately, the results are negative. We find neither any 
>>> advantages  of context RA over current RA implementation, nor 
>>> possibility for  context RA to replace grouping IO threads in single 
>>> IO context.
>>>
>>> Setup on the target (server) was the following. 2 SATA drives grouped 
>>> in  md RAID-0 with average local read throughput ~120MB/s ("dd 
>>> if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs "20971520000 
>>> bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md device was 
>>> partitioned on 3  partitions. The first partition was 10% of space in 
>>> the beginning of the  device, the last partition was 10% of space in 
>>> the end of the device,  the middle one was the rest in the middle of 
>>> the space them. Then the  first and the last partitions were exported 
>>> to the initiator (client).  They were /dev/sdb and /dev/sdc on it 
>>> correspondingly.
>>
>> Vladislav, Thank you for the benchmarks! I'm very interested in
>> optimizing your workload and figuring out what happens underneath.
>>
>> Are the client and server two standalone boxes connected by GBE?
>>
>> When you set readahead sizes in the benchmarks, you are setting them
>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>> client side readahead size?
>>
>> It would help a lot to debug readahead if you can provide the
>> server side readahead stats and trace log for the worst case.
>> This will automatically answer the above questions as well as disclose
>> the micro-behavior of readahead:
>>
>>         mount -t debugfs none /sys/kernel/debug
>>
>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>         # do benchmark
>>         cat /sys/kernel/debug/readahead/stats
>>
>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>         dmesg
>>
>> The above readahead trace should help find out how the client side
>> sequential reads convert into server side random reads, and how we can
>> prevent that.
>
> See attached. Could you comment the logs, please, so I will also be able  
> to read them in the future?

Vladislav, thank you for the logs!

The printk format for the following lines is:

+       printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
+               "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
+               ra_pattern_names[pattern],
+               current->pid, current->comm,
+               MAJOR(mapping->host->i_sb->s_dev),
+               MINOR(mapping->host->i_sb->s_dev),
+               mapping->host->i_sb->s_id,
+               mapping->host->i_ino,
+               filp->f_path.dentry->d_name.name,
+               offset, req_size,
+               ra->start, ra->size, ra->async_size,
+               async,
+               actual);

readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32

The above trace shows that the readahead logic is working pretty well,
however the max readahead size(32 pages) is way too small. This can
also be confirmed in the following table, where the average readahead
request size/async_size and actual readahead I/O size are all 30.

linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
pattern         count sync_count  eof_count       size async_size     actual
none                0          0          0          0          0          0
initial0           71         71         41          4          3          2
initial            23         23          0          4          3          4
subsequent       3845          4         21         31         31         31
marker           4222          0          1         31         31         31
trail               0          0          0          0          0          0
oversize            0          0          0          0          0          0
reverse             0          0          0          0          0          0
stride              0          0          0          0          0          0
thrash              0          0          0          0          0          0
mmap              135        135         15         32          0         17
fadvise           180        180        180          0          0          1
random             23         23          2          1          0          1
all              8499        436        260         30         30         30
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^

I suspect that your main performance problem comes from the small read/readahead size.
If larger values are used, even the vanilla 2.6.27 kernel will perform well.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-02-19  2:05                                     ` Wu Fengguang
@ 2009-03-19 17:44                                       ` Vladislav Bolkhovitin
  2009-03-20  8:53                                         ` Vladislav Bolkhovitin
  2009-03-23  1:42                                         ` Wu Fengguang
  0 siblings, 2 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-03-19 17:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

[-- Attachment #1: Type: text/plain, Size: 22830 bytes --]

Wu Fengguang, on 02/19/2009 05:05 AM wrote:
> On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>> Sorry for such a huge delay. There were many other activities I had 
>>>> to  do before + I had to be sure I didn't miss anything.
>>>>
>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) with   
>>>> iSCSI-SCST target driver. It has similar to NFS architecture, where N 
>>>>  threads (N=5 in this case) handle IO from remote initiators 
>>>> (clients)  coming from wire using iSCSI protocol. In addition, SCST 
>>>> has patch  called export_alloc_io_context (see   
>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads 
>>>>  queue IO using single IO context, so we can see if context RA can   
>>>> replace grouping IO threads in single IO context.
>>>>
>>>> Unfortunately, the results are negative. We find neither any 
>>>> advantages  of context RA over current RA implementation, nor 
>>>> possibility for  context RA to replace grouping IO threads in single 
>>>> IO context.
>>>>
>>>> Setup on the target (server) was the following. 2 SATA drives grouped 
>>>> in  md RAID-0 with average local read throughput ~120MB/s ("dd 
>>>> if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs "20971520000 
>>>> bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md device was 
>>>> partitioned on 3  partitions. The first partition was 10% of space in 
>>>> the beginning of the  device, the last partition was 10% of space in 
>>>> the end of the device,  the middle one was the rest in the middle of 
>>>> the space them. Then the  first and the last partitions were exported 
>>>> to the initiator (client).  They were /dev/sdb and /dev/sdc on it 
>>>> correspondingly.
>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>> optimizing your workload and figuring out what happens underneath.
>>>
>>> Are the client and server two standalone boxes connected by GBE?
>>>
>>> When you set readahead sizes in the benchmarks, you are setting them
>>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>>> client side readahead size?
>>>
>>> It would help a lot to debug readahead if you can provide the
>>> server side readahead stats and trace log for the worst case.
>>> This will automatically answer the above questions as well as disclose
>>> the micro-behavior of readahead:
>>>
>>>         mount -t debugfs none /sys/kernel/debug
>>>
>>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>>         # do benchmark
>>>         cat /sys/kernel/debug/readahead/stats
>>>
>>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>         dmesg
>>>
>>> The above readahead trace should help find out how the client side
>>> sequential reads convert into server side random reads, and how we can
>>> prevent that.
>> See attached. Could you comment the logs, please, so I will also be able  
>> to read them in the future?
> 
> Vladislav, thank you for the logs!
> 
> The printk format for the following lines is:
> 
> +       printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
> +               "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
> +               ra_pattern_names[pattern],
> +               current->pid, current->comm,
> +               MAJOR(mapping->host->i_sb->s_dev),
> +               MINOR(mapping->host->i_sb->s_dev),
> +               mapping->host->i_sb->s_id,
> +               mapping->host->i_ino,
> +               filp->f_path.dentry->d_name.name,
> +               offset, req_size,
> +               ra->start, ra->size, ra->async_size,
> +               async,
> +               actual);
> 
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
> readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
> readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32
> 
> The above trace shows that the readahead logic is working pretty well,
> however the max readahead size(32 pages) is way too small. This can
> also be confirmed in the following table, where the average readahead
> request size/async_size and actual readahead I/O size are all 30.
> 
> linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
> pattern         count sync_count  eof_count       size async_size     actual
> none                0          0          0          0          0          0
> initial0           71         71         41          4          3          2
> initial            23         23          0          4          3          4
> subsequent       3845          4         21         31         31         31
> marker           4222          0          1         31         31         31
> trail               0          0          0          0          0          0
> oversize            0          0          0          0          0          0
> reverse             0          0          0          0          0          0
> stride              0          0          0          0          0          0
> thrash              0          0          0          0          0          0
> mmap              135        135         15         32          0         17
> fadvise           180        180        180          0          0          1
> random             23         23          2          1          0          1
> all              8499        436        260         30         30         30
>                                                     ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I suspect that your main performance problem comes from the small read/readahead size.
> If larger values are used, even the vanilla 2.6.27 kernel will perform well.

Yes, it was misconfiguration on our side: readahead size was not set 
correctly on all devices. In the correct configuration context based RA 
shows constant advantage over the current vanilla algorithm, but not as 
much as I would expect. It still performs considerably worse, than in 
case when all the IO threads work in the same IO context. To remind, our 
setup and tests described in http://lkml.org/lkml/2009/2/12/277.

Here are the conclusions from tests:

  1. Making all IO threads work in the same IO context with CFQ (vanilla 
RA and default RA size) brings near 100% link utilization on single 
stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there 
is 100% improvement of CFQ over deadline. With 2 read streams CFQ has 
ever more advantage: >400% (23MB/s vs 5MB/s).

  2. All IO threads work in different IO contexts. With vanilla RA and 
default RA size CFQ performs 50% worse (48MB/s), even worse than deadline.

  3. All IO threads work in different IO contexts. With default RA size 
context RA brings on single stream 40% improvement with deadline (71MB/s 
vs 51MB/s), no improvement with cfq (48MB/s).

  4. All IO threads work in different IO contexts. With higher RA sizes 
there is stable 6% improvement with context RA over vanilla RA with CFQ 
starting from 20%. Deadline performs similarly. In parallel reads 
improvement is bigger: 30% on 4M RA size with deadline (39MB/s vs 27MB/s)

  5. All IO threads work in different IO contexts. The best performance 
achieved with RA maximum size 4M on both RA algorithms, but starting 
from size 1M it starts growing very slowly.

  6. Unexpected result. In case, when ll IO threads work in the same IO 
context with CFQ increasing RA size *decreases* throughput. I think this 
is, because RA requests performed as single big READ requests, while 
requests coming from remote clients are much smaller in size (up to 
256K), so, when the read by RA data transferred to the remote client on 
100MB/s speed, the backstorage media gets rotated a bit, so the next 
read request must wait the rotation latency (~0.1ms on 7200RPM). This is 
well conforms with (3) above, when context RA has 40% advantage over 
vanilla RA with default RA, but much smaller with higher RA.

Bottom line IMHO conclusions:

1. Context RA should be considered after additional examination to 
replace current RA algorithm in the kernel

2. It would be better to increase default RA size to 1024K

*AND* one of the following:

3.1. All RA requests should be split in smaller requests with size up to 
256K, which should not be merged with any other request

OR

3.2. New RA requests should be sent before the previous one completed to 
don't let the storage device rotate too far to need full rotation to 
serve the next request.

I like suggestion 3.1 a lot more, since it should be simple to implement 
and has the following 2 positive side effects:

1. It would allow to minimize negative effect of higher RA size on the 
I/O delay latency by allowing CFQ to switch to too long waiting 
requests, when necessary.

2. It would allow better requests pipelining, which is very important to 
minimize uplink latency for synchronous requests (i.e. with only one IO 
request at time, next request issued, when the previous one completed). 
You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware 
recommends for maximum performance set max_sectors_kb as low as *64K* 
with 16MB RA. It allows to maximize serving commands pipelining. And 
this suggestion really works allowing to improve throughput in 50-100%!

Here are the raw numbers. I also attached context RA debug output for 
2MB RA size case for your viewing pleasure.

--------------------------------------------------------------------

Performance baseline: all IO threads work in the same IO context, 
current vanilla RA, default RA size:

CFQ scheduler:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 102 MB/s
         b) 102 MB/s
         c) 102 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 21,6 MB/s
         b) 22,8 MB/s
         c) 24,1 MB/s
         d) 23,1 MB/s

Deadline scheduler:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 51,1 MB/s
         b) 51,4 MB/s
         c) 51,1 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 4,7 MB/s
         b) 4,6 MB/s
         c) 4,8 MB/s

--------------------------------------------------------------------

RA performance baseline: all IO threads work in different IO contexts, 
current vanilla RA, default RA size:

CFQ:

#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 48,6 MB/s
         b) 49,2 MB/s
         c) 48,9 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 4,2 MB/s
         b) 3,9 MB/s
         c) 4,1 MB/s

Deadline:

1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 53,2 MB/s
         b) 51,8 MB/s
         c) 51,6 MB/s

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 5,1 MB/s
         b) 4,6 MB/s
         c) 4,8 MB/s

--------------------------------------------------------------------

Context RA, all IO threads work in different IO contexts, default RA size:

CFQ:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 47,9 MB/s
         b) 48,2 MB/s
         c) 48,1 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 3,5 MB/s
         b) 3,6 MB/s
         c) 3,8 MB/s

Deadline:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 72,4 MB/s
         b) 68,3 MB/s
         c) 71,3 MB/s

Run at the same time:
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
         a) 4,3 MB/s
         b) 5,0 MB/s
         c) 4,8 MB/s

--------------------------------------------------------------------

Vanilla RA, all IO threads work in different IO contexts, various RA sizes:

CFQ:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 60,5 MB/s
	b) 59,3 MB/s
	c) 59,7 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 9,4 MB/s
	b) 9,4 MB/s
	c) 9,1 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 74,7 MB/s
	b) 73,2 MB/s
	c) 74,1 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 13,7 MB/s
	b) 13,6 MB/s
	c) 13,1 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 76,7 MB/s
	b) 76,8 MB/s
	c) 76,6 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,8 MB/s
	b) 22,1 MB/s
	c) 20,3 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 80,8 MB/s
	b) 80.8 MB/s
	c) 80,3 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 29,6 MB/s
	b) 29,4 MB/s
	c) 27,2 MB/s

=== Deadline:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 68,4 MB/s
	b) 67,0 MB/s
	c) 67,6 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 8,8 MB/s
	b) 8,9 MB/s
	c) 8,7 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 81,0 MB/s
	b) 82,4 MB/s
	c) 81,7 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 13,5 MB/s
	b) 13,1 MB/s
	c) 12,9 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 81,1 MB/s
	b) 80,1 MB/s
	c) 81,8 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,9 MB/s
	b) 20,7 MB/s
	c) 21,3 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,1 MB/s
	b) 82,7 MB/s
	c) 82,9 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 27,9 MB/s
	b) 23,5 MB/s
	c) 27,6 MB/s

--------------------------------------------------------------------

Context RA, all IO threads work in different IO contexts, various RA sizes:

CFQ:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 63,7 MB/s
	b) 63,5 MB/s
	c) 62,8 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 7,1 MB/s
	b) 6,7 MB/s
	c) 7,0 MB/s	
	d) 6,9 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 81,1 MB/s
	b) 81,8 MB/s
	c)  MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 14,1 MB/s
	b) 14,0 MB/s
	c) 14,1 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 81,6 MB/s
	b) 81,4 MB/s
	c) 86,0 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 22,3 MB/s
	b) 21,5 MB/s
	c) 21,7 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,1 MB/s
	b) 83,5 MB/s
	c) 82,9 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 32,8 MB/s
	b) 32,7 MB/s
	c) 30,2 MB/s

=== Deadline:

RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 68,8 MB/s
	b) 68,9 MB/s
	c) 69,0 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 8,7 MB/s
	b) 9,0 MB/s
	c) 8,9 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,5 MB/s
	b) 83,1 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 14,0 MB/s
	b) 13.9 MB/s
	c) 13,8 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 82,6 MB/s
	b) 82,4 MB/s
	c) 81,9 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,9 MB/s
	b) 23,1 MB/s
	c) 17,8 MB/s	
	d) 21,1 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 84,5 MB/s
	b) 83,7 MB/s
	c) 83,8 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 39,9 MB/s
	b) 39,5 MB/s
	c) 38,4 MB/s

--------------------------------------------------------------------

all IO threads work in the same IO context, context RA, various RA sizes:

=== CFQ:

--- RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 86,4 MB/s
	b) 87,9 MB/s
	c) 86,7 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 17,8 MB/s
	b) 18,3 MB/s
	c) 17,7 MB/s

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,3 MB/s
	b) 81,6 MB/s
	c) 81,9 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 22,1 MB/s
	b) 21,5 MB/s
	c) 21,2 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 81,1 MB/s
	b) 81,0 MB/s
	c) 81,6 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 22,2 MB/s
	b) 20,2 MB/s
	c) 20,9 MB/s

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,4 MB/s
	b) 82,8 MB/s
	c) 83,3 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 22,6 MB/s
	b) 23,4 MB/s
	c) 21,8 MB/s

=== Deadline:

--- RA 512K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 70,0 MB/s
	b) 70,7 MB/s
	c) 69,7 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 9,1 MB/s
	b) 8,3 MB/s
	c) 8,4 MB/s	

--- RA 1024K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 84,3 MB/s
	b) 83,2 MB/s
	c) 83,3 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 13,9 MB/s
	b) 13,1 MB/s
	c) 13,4 MB/s

--- RA 2048K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 82,6 MB/s
	b) 82,1 MB/s
	c) 82,3 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 21,6 MB/s
	b) 22,4 MB/s
	c) 21,3 MB/s	

--- RA 4096K:

dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 83,8 MB/s
	b) 83,8 MB/s
	c) 83,1 MB/s

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 39,5 MB/s
	b) 39,6 MB/s
	c) 37,0 MB/s

Thanks,
Vlad

[-- Attachment #2: ra.tar.bz2 --]
[-- Type: application/x-bzip, Size: 26317 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-03-19 17:44                                       ` Vladislav Bolkhovitin
@ 2009-03-20  8:53                                         ` Vladislav Bolkhovitin
  2009-03-23  1:42                                         ` Wu Fengguang
  1 sibling, 0 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-03-20  8:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

Vladislav Bolkhovitin, on 03/19/2009 08:44 PM wrote:
>   6. Unexpected result. In case, when all IO threads work in the same IO 
> context with CFQ increasing RA size *decreases* throughput. I think this 
> is, because RA requests performed as single big READ requests, while 
> requests coming from remote clients are much smaller in size (up to 
> 256K), so, when the read by RA data transferred to the remote client on 
> 100MB/s speed, the backstorage media gets rotated a bit, so the next 
> read request must wait the rotation latency (~0.1ms on 7200RPM).
                                                ^^^^^^^^^^^^^^^^^
Oops, it's RP*M*, hence the full rotation latency is in 60 times more, 
i.e. 8.3 ms.

Sorry,
Vlad


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-03-19 17:44                                       ` Vladislav Bolkhovitin
  2009-03-20  8:53                                         ` Vladislav Bolkhovitin
@ 2009-03-23  1:42                                         ` Wu Fengguang
  2009-04-21 18:18                                           ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2009-03-23  1:42 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

On Thu, Mar 19, 2009 at 08:44:13PM +0300, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 02/19/2009 05:05 AM wrote:
>> On Tue, Feb 17, 2009 at 10:01:40PM +0300, Vladislav Bolkhovitin wrote:
>>> Wu Fengguang, on 02/13/2009 04:57 AM wrote:
>>>> On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:
>>>>> Sorry for such a huge delay. There were many other activities I 
>>>>> had to  do before + I had to be sure I didn't miss anything.
>>>>>
>>>>> We didn't use NFS, we used SCST (http://scst.sourceforge.net) 
>>>>> with   iSCSI-SCST target driver. It has similar to NFS 
>>>>> architecture, where N  threads (N=5 in this case) handle IO from 
>>>>> remote initiators (clients)  coming from wire using iSCSI 
>>>>> protocol. In addition, SCST has patch  called 
>>>>> export_alloc_io_context (see    
>>>>> http://lkml.org/lkml/2008/12/10/282), which allows for the IO 
>>>>> threads  queue IO using single IO context, so we can see if 
>>>>> context RA can   replace grouping IO threads in single IO 
>>>>> context.
>>>>>
>>>>> Unfortunately, the results are negative. We find neither any  
>>>>> advantages  of context RA over current RA implementation, nor  
>>>>> possibility for  context RA to replace grouping IO threads in 
>>>>> single IO context.
>>>>>
>>>>> Setup on the target (server) was the following. 2 SATA drives 
>>>>> grouped in  md RAID-0 with average local read throughput ~120MB/s 
>>>>> ("dd if=/dev/zero  of=/dev/md0 bs=1M count=20000" outputs 
>>>>> "20971520000 bytes (21 GB)  copied, 177,742 s, 118 MB/s"). The md 
>>>>> device was partitioned on 3  partitions. The first partition was 
>>>>> 10% of space in the beginning of the  device, the last partition 
>>>>> was 10% of space in the end of the device,  the middle one was 
>>>>> the rest in the middle of the space them. Then the  first and the 
>>>>> last partitions were exported to the initiator (client).  They 
>>>>> were /dev/sdb and /dev/sdc on it correspondingly.
>>>> Vladislav, Thank you for the benchmarks! I'm very interested in
>>>> optimizing your workload and figuring out what happens underneath.
>>>>
>>>> Are the client and server two standalone boxes connected by GBE?
>>>>
>>>> When you set readahead sizes in the benchmarks, you are setting them
>>>> in the server side? I.e. "linux-4dtq" is the SCST server? What's the
>>>> client side readahead size?
>>>>
>>>> It would help a lot to debug readahead if you can provide the
>>>> server side readahead stats and trace log for the worst case.
>>>> This will automatically answer the above questions as well as disclose
>>>> the micro-behavior of readahead:
>>>>
>>>>         mount -t debugfs none /sys/kernel/debug
>>>>
>>>>         echo > /sys/kernel/debug/readahead/stats # reset counters
>>>>         # do benchmark
>>>>         cat /sys/kernel/debug/readahead/stats
>>>>
>>>>         echo 1 > /sys/kernel/debug/readahead/trace_enable
>>>>         # do micro-benchmark, i.e. run the same benchmark for a short time
>>>>         echo 0 > /sys/kernel/debug/readahead/trace_enable
>>>>         dmesg
>>>>
>>>> The above readahead trace should help find out how the client side
>>>> sequential reads convert into server side random reads, and how we can
>>>> prevent that.
>>> See attached. Could you comment the logs, please, so I will also be 
>>> able  to read them in the future?
>>
>> Vladislav, thank you for the logs!
>>
>> The printk format for the following lines is:
>>
>> +       printk(KERN_DEBUG "readahead-%s(pid=%d(%s), dev=%02x:%02x(%s), "
>> +               "ino=%lu(%s), req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d\n",
>> +               ra_pattern_names[pattern],
>> +               current->pid, current->comm,
>> +               MAJOR(mapping->host->i_sb->s_dev),
>> +               MINOR(mapping->host->i_sb->s_dev),
>> +               mapping->host->i_sb->s_id,
>> +               mapping->host->i_ino,
>> +               filp->f_path.dentry->d_name.name,
>> +               offset, req_size,
>> +               ra->start, ra->size, ra->async_size,
>> +               async,
>> +               actual);
>>
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10596+1, ra=10628+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10628+1, ra=10660+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10660+1, ra=10692+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10692+1, ra=10724+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10724+1, ra=10756+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10756+1, ra=10788+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10788+1, ra=10820+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=10820+1, ra=10852+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=10852+1, ra=10884+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10884+1, ra=10916+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=10916+1, ra=10948+32-32, async=1) = 32
>> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=10948+1, ra=10980+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=10980+1, ra=11012+32-32, async=1) = 32
>> readahead-marker(pid=3838(vdiskd3_3), dev=00:02(bdev), ino=0(raid-1st), req=11012+1, ra=11044+32-32, async=1) = 32
>> readahead-marker(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11044+1, ra=11076+32-32, async=1) = 32
>> readahead-subsequent(pid=3836(vdiskd3_1), dev=00:02(bdev), ino=0(raid-1st), req=11076+1, ra=11108+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11108+1, ra=11140+32-32, async=1) = 32
>> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11140+1, ra=11172+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11172+1, ra=11204+32-32, async=1) = 32
>> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11204+1, ra=11236+32-32, async=1) = 32
>> readahead-marker(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11236+1, ra=11268+32-32, async=1) = 32
>> readahead-subsequent(pid=3837(vdiskd3_2), dev=00:02(bdev), ino=0(raid-1st), req=11268+1, ra=11300+32-32, async=1) = 32
>> readahead-marker(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11300+1, ra=11332+32-32, async=1) = 32
>> readahead-subsequent(pid=3835(vdiskd3_0), dev=00:02(bdev), ino=0(raid-1st), req=11332+1, ra=11364+32-32, async=1) = 32
>> readahead-marker(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11364+1, ra=11396+32-32, async=1) = 32
>> readahead-subsequent(pid=3839(vdiskd3_4), dev=00:02(bdev), ino=0(raid-1st), req=11396+1, ra=11428+32-32, async=1) = 32
>>
>> The above trace shows that the readahead logic is working pretty well,
>> however the max readahead size(32 pages) is way too small. This can
>> also be confirmed in the following table, where the average readahead
>> request size/async_size and actual readahead I/O size are all 30.
>>
>> linux-4dtq:/ # cat /sys/kernel/debug/readahead/stats
>> pattern         count sync_count  eof_count       size async_size     actual
>> none                0          0          0          0          0          0
>> initial0           71         71         41          4          3          2
>> initial            23         23          0          4          3          4
>> subsequent       3845          4         21         31         31         31
>> marker           4222          0          1         31         31         31
>> trail               0          0          0          0          0          0
>> oversize            0          0          0          0          0          0
>> reverse             0          0          0          0          0          0
>> stride              0          0          0          0          0          0
>> thrash              0          0          0          0          0          0
>> mmap              135        135         15         32          0         17
>> fadvise           180        180        180          0          0          1
>> random             23         23          2          1          0          1
>> all              8499        436        260         30         30         30
>>                                                     ^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> I suspect that your main performance problem comes from the small read/readahead size.
>> If larger values are used, even the vanilla 2.6.27 kernel will perform well.
>
> Yes, it was misconfiguration on our side: readahead size was not set  
> correctly on all devices. In the correct configuration context based RA  
> shows constant advantage over the current vanilla algorithm, but not as  
> much as I would expect. It still performs considerably worse, than in  
> case when all the IO threads work in the same IO context. To remind, our  
> setup and tests described in http://lkml.org/lkml/2009/2/12/277.

Vladislav, thank you very much for the great efforts and details!

> Here are the conclusions from tests:
>
>  1. Making all IO threads work in the same IO context with CFQ (vanilla  
> RA and default RA size) brings near 100% link utilization on single  
> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there  
> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has  
> ever more advantage: >400% (23MB/s vs 5MB/s).

The ideal 2-stream throughput should be >60MB/s, so I guess there are
still room of improvements for the CFQ's 23MB/s?

>  2. All IO threads work in different IO contexts. With vanilla RA and  
> default RA size CFQ performs 50% worse (48MB/s), even worse than 
> deadline.
>
>  3. All IO threads work in different IO contexts. With default RA size  
> context RA brings on single stream 40% improvement with deadline (71MB/s  
> vs 51MB/s), no improvement with cfq (48MB/s).
>
>  4. All IO threads work in different IO contexts. With higher RA sizes  
> there is stable 6% improvement with context RA over vanilla RA with CFQ  
> starting from 20%. Deadline performs similarly. In parallel reads  
> improvement is bigger: 30% on 4M RA size with deadline (39MB/s vs 27MB/s)

Your attached readahead trace shows that context RA is submitting
perfect 256-page async readahead IOs. (However the readahead-subsequent
cache hits are puzzling.)

The vanilla RA detects concurrent streams in a passive/opportunistic way.
The context RA works in an active/guaranteed way. It's also better at
serving the NFS style cooperative io processes. And your SCST workload
looks very like NFS.

The one fact I cannot understand is that SCST seems to breaking up the
client side 64K reads into server side 4K reads(above readahead layer).
But I remember you told me that SCST don't do NFS rsize style split-ups.
Is this a bug? The 4K read size is too small to be CPU/network friendly...
Where are the split-up and re-assemble done? On the client side or
internal to the server?

>  5. All IO threads work in different IO contexts. The best performance  
> achieved with RA maximum size 4M on both RA algorithms, but starting  
> from size 1M it starts growing very slowly.

True.

>  6. Unexpected result. In case, when ll IO threads work in the same IO  
> context with CFQ increasing RA size *decreases* throughput. I think this  
> is, because RA requests performed as single big READ requests, while  
> requests coming from remote clients are much smaller in size (up to  
> 256K), so, when the read by RA data transferred to the remote client on  
> 100MB/s speed, the backstorage media gets rotated a bit, so the next  
> read request must wait the rotation latency (~0.1ms on 7200RPM). This is  
> well conforms with (3) above, when context RA has 40% advantage over  
> vanilla RA with default RA, but much smaller with higher RA.

Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

> Bottom line IMHO conclusions:
>
> 1. Context RA should be considered after additional examination to  
> replace current RA algorithm in the kernel

That's my plan to push context RA to mainline. And thank you very much
for providing and testing out a real world application for it!

> 2. It would be better to increase default RA size to 1024K

That's a long wish to increase the default RA size. However I have a
vague feeling that it would be better to first make the lower layers
more smart on max_sectors_kb granularity request splitting and batching.

> *AND* one of the following:
>
> 3.1. All RA requests should be split in smaller requests with size up to  
> 256K, which should not be merged with any other request

Are you referring to max_sectors_kb?

What's your max_sectors_kb and nr_requests? Something like

        grep -r . /sys/block/sda/queue/

> OR
>
> 3.2. New RA requests should be sent before the previous one completed to  
> don't let the storage device rotate too far to need full rotation to  
> serve the next request.

Linus has a mmap readahead cleanup patch that can do this. It
basically replaces a {find_lock_page(); readahead();} sequence into
{find_get_page(); readahead(); lock_page();}.

I'll try to push that patch into mainline.

> I like suggestion 3.1 a lot more, since it should be simple to implement  
> and has the following 2 positive side effects:
>
> 1. It would allow to minimize negative effect of higher RA size on the  
> I/O delay latency by allowing CFQ to switch to too long waiting  
> requests, when necessary.
>
> 2. It would allow better requests pipelining, which is very important to  
> minimize uplink latency for synchronous requests (i.e. with only one IO  
> request at time, next request issued, when the previous one completed).  
> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware  
> recommends for maximum performance set max_sectors_kb as low as *64K*  
> with 16MB RA. It allows to maximize serving commands pipelining. And  
> this suggestion really works allowing to improve throughput in 50-100%!
>
> Here are the raw numbers. I also attached context RA debug output for  
> 2MB RA size case for your viewing pleasure.

Thank you, it helped a lot!

(Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Thanks,
Fengguang

> --------------------------------------------------------------------
>
> Performance baseline: all IO threads work in the same IO context,  
> current vanilla RA, default RA size:
>
> CFQ scheduler:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 102 MB/s
>         b) 102 MB/s
>         c) 102 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 21,6 MB/s
>         b) 22,8 MB/s
>         c) 24,1 MB/s
>         d) 23,1 MB/s
>
> Deadline scheduler:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 51,1 MB/s
>         b) 51,4 MB/s
>         c) 51,1 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 4,7 MB/s
>         b) 4,6 MB/s
>         c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> RA performance baseline: all IO threads work in different IO contexts,  
> current vanilla RA, default RA size:
>
> CFQ:
>
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 48,6 MB/s
>         b) 49,2 MB/s
>         c) 48,9 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 4,2 MB/s
>         b) 3,9 MB/s
>         c) 4,1 MB/s
>
> Deadline:
>
> 1) dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 53,2 MB/s
>         b) 51,8 MB/s
>         c) 51,6 MB/s
>
> Run at the same time:
> #while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> #dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 5,1 MB/s
>         b) 4,6 MB/s
>         c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> Context RA, all IO threads work in different IO contexts, default RA size:
>
> CFQ:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 47,9 MB/s
>         b) 48,2 MB/s
>         c) 48,1 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 3,5 MB/s
>         b) 3,6 MB/s
>         c) 3,8 MB/s
>
> Deadline:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 72,4 MB/s
>         b) 68,3 MB/s
>         c) 71,3 MB/s
>
> Run at the same time:
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>         a) 4,3 MB/s
>         b) 5,0 MB/s
>         c) 4,8 MB/s
>
> --------------------------------------------------------------------
>
> Vanilla RA, all IO threads work in different IO contexts, various RA sizes:
>
> CFQ:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 60,5 MB/s
> 	b) 59,3 MB/s
> 	c) 59,7 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 9,4 MB/s
> 	b) 9,4 MB/s
> 	c) 9,1 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 74,7 MB/s
> 	b) 73,2 MB/s
> 	c) 74,1 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 13,7 MB/s
> 	b) 13,6 MB/s
> 	c) 13,1 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 76,7 MB/s
> 	b) 76,8 MB/s
> 	c) 76,6 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 21,8 MB/s
> 	b) 22,1 MB/s
> 	c) 20,3 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 80,8 MB/s
> 	b) 80.8 MB/s
> 	c) 80,3 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 29,6 MB/s
> 	b) 29,4 MB/s
> 	c) 27,2 MB/s
>
> === Deadline:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 68,4 MB/s
> 	b) 67,0 MB/s
> 	c) 67,6 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 8,8 MB/s
> 	b) 8,9 MB/s
> 	c) 8,7 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 81,0 MB/s
> 	b) 82,4 MB/s
> 	c) 81,7 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 13,5 MB/s
> 	b) 13,1 MB/s
> 	c) 12,9 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 81,1 MB/s
> 	b) 80,1 MB/s
> 	c) 81,8 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 21,9 MB/s
> 	b) 20,7 MB/s
> 	c) 21,3 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,1 MB/s
> 	b) 82,7 MB/s
> 	c) 82,9 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 27,9 MB/s
> 	b) 23,5 MB/s
> 	c) 27,6 MB/s
>
> --------------------------------------------------------------------
>
> Context RA, all IO threads work in different IO contexts, various RA sizes:
>
> CFQ:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 63,7 MB/s
> 	b) 63,5 MB/s
> 	c) 62,8 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 7,1 MB/s
> 	b) 6,7 MB/s
> 	c) 7,0 MB/s	
> 	d) 6,9 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 81,1 MB/s
> 	b) 81,8 MB/s
> 	c)  MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 14,1 MB/s
> 	b) 14,0 MB/s
> 	c) 14,1 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 81,6 MB/s
> 	b) 81,4 MB/s
> 	c) 86,0 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 22,3 MB/s
> 	b) 21,5 MB/s
> 	c) 21,7 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,1 MB/s
> 	b) 83,5 MB/s
> 	c) 82,9 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 32,8 MB/s
> 	b) 32,7 MB/s
> 	c) 30,2 MB/s
>
> === Deadline:
>
> RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 68,8 MB/s
> 	b) 68,9 MB/s
> 	c) 69,0 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 8,7 MB/s
> 	b) 9,0 MB/s
> 	c) 8,9 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,5 MB/s
> 	b) 83,1 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 14,0 MB/s
> 	b) 13.9 MB/s
> 	c) 13,8 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 82,6 MB/s
> 	b) 82,4 MB/s
> 	c) 81,9 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 21,9 MB/s
> 	b) 23,1 MB/s
> 	c) 17,8 MB/s	
> 	d) 21,1 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 84,5 MB/s
> 	b) 83,7 MB/s
> 	c) 83,8 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 39,9 MB/s
> 	b) 39,5 MB/s
> 	c) 38,4 MB/s
>
> --------------------------------------------------------------------
>
> all IO threads work in the same IO context, context RA, various RA sizes:
>
> === CFQ:
>
> --- RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 86,4 MB/s
> 	b) 87,9 MB/s
> 	c) 86,7 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 17,8 MB/s
> 	b) 18,3 MB/s
> 	c) 17,7 MB/s
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,3 MB/s
> 	b) 81,6 MB/s
> 	c) 81,9 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 22,1 MB/s
> 	b) 21,5 MB/s
> 	c) 21,2 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 81,1 MB/s
> 	b) 81,0 MB/s
> 	c) 81,6 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 22,2 MB/s
> 	b) 20,2 MB/s
> 	c) 20,9 MB/s
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,4 MB/s
> 	b) 82,8 MB/s
> 	c) 83,3 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 22,6 MB/s
> 	b) 23,4 MB/s
> 	c) 21,8 MB/s
>
> === Deadline:
>
> --- RA 512K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 70,0 MB/s
> 	b) 70,7 MB/s
> 	c) 69,7 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 9,1 MB/s
> 	b) 8,3 MB/s
> 	c) 8,4 MB/s	
>
> --- RA 1024K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 84,3 MB/s
> 	b) 83,2 MB/s
> 	c) 83,3 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 13,9 MB/s
> 	b) 13,1 MB/s
> 	c) 13,4 MB/s
>
> --- RA 2048K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 82,6 MB/s
> 	b) 82,1 MB/s
> 	c) 82,3 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 21,6 MB/s
> 	b) 22,4 MB/s
> 	c) 21,3 MB/s	
>
> --- RA 4096K:
>
> dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 83,8 MB/s
> 	b) 83,8 MB/s
> 	c) 83,1 MB/s
>
> Run at the same time:		
> linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 39,5 MB/s
> 	b) 39,6 MB/s
> 	c) 37,0 MB/s
>
> Thanks,
> Vlad



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-03-23  1:42                                         ` Wu Fengguang
@ 2009-04-21 18:18                                           ` Vladislav Bolkhovitin
  2009-04-24  8:43                                             ` Wu Fengguang
  0 siblings, 1 reply; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-04-21 18:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

[-- Attachment #1: Type: text/plain, Size: 9445 bytes --]

Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>> Here are the conclusions from tests:
>>
>>  1. Making all IO threads work in the same IO context with CFQ (vanilla  
>> RA and default RA size) brings near 100% link utilization on single  
>> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there  
>> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has  
>> ever more advantage: >400% (23MB/s vs 5MB/s).
> 
> The ideal 2-stream throughput should be >60MB/s, so I guess there are
> still room of improvements for the CFQ's 23MB/s?

Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K 
we were able to get ~40MB/s, see the previous e-mail and below.

> The one fact I cannot understand is that SCST seems to breaking up the
> client side 64K reads into server side 4K reads(above readahead layer).
> But I remember you told me that SCST don't do NFS rsize style split-ups.
> Is this a bug? The 4K read size is too small to be CPU/network friendly...
> Where are the split-up and re-assemble done? On the client side or
> internal to the server?

This is on the client's side. See the target's log in the attachment. 
Here is the summary of commands data sizes, came to the server, for "dd 
if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:

4K                       11
8K                       0
16K                      0
32K                      0
64K                      0
128K                     81
256K                     8
512K                     0
1024K                    0
2048K                    0
4096K                    0

There's a way too many 4K requests. Apparently, the requests submission 
path isn't optimal.

Actually, this is another question I wanted to rise from the very 
beginning.

>>  6. Unexpected result. In case, when ll IO threads work in the same IO  
>> context with CFQ increasing RA size *decreases* throughput. I think this  
>> is, because RA requests performed as single big READ requests, while  
>> requests coming from remote clients are much smaller in size (up to  
>> 256K), so, when the read by RA data transferred to the remote client on  
>> 100MB/s speed, the backstorage media gets rotated a bit, so the next  
>> read request must wait the rotation latency (~0.1ms on 7200RPM). This is  
>> well conforms with (3) above, when context RA has 40% advantage over  
>> vanilla RA with default RA, but much smaller with higher RA.
> 
> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

That doesn't matter, because new request from the client won't come 
until all data for the previous one transferred to it. And that transfer 
is done on a very *finite* speed.

>> Bottom line IMHO conclusions:
>>
>> 1. Context RA should be considered after additional examination to  
>> replace current RA algorithm in the kernel
> 
> That's my plan to push context RA to mainline. And thank you very much
> for providing and testing out a real world application for it!

You're welcome!

>> 2. It would be better to increase default RA size to 1024K
> 
> That's a long wish to increase the default RA size. However I have a
> vague feeling that it would be better to first make the lower layers
> more smart on max_sectors_kb granularity request splitting and batching.

Can you elaborate more on that, please?

>> *AND* one of the following:
>>
>> 3.1. All RA requests should be split in smaller requests with size up to  
>> 256K, which should not be merged with any other request
> 
> Are you referring to max_sectors_kb?

Yes

> What's your max_sectors_kb and nr_requests? Something like
> 
>         grep -r . /sys/block/sda/queue/

Default: 512 and 128 correspondingly.

>> OR
>>
>> 3.2. New RA requests should be sent before the previous one completed to  
>> don't let the storage device rotate too far to need full rotation to  
>> serve the next request.
> 
> Linus has a mmap readahead cleanup patch that can do this. It
> basically replaces a {find_lock_page(); readahead();} sequence into
> {find_get_page(); readahead(); lock_page();}.
> 
> I'll try to push that patch into mainline.

Good!

>> I like suggestion 3.1 a lot more, since it should be simple to implement  
>> and has the following 2 positive side effects:
>>
>> 1. It would allow to minimize negative effect of higher RA size on the  
>> I/O delay latency by allowing CFQ to switch to too long waiting  
>> requests, when necessary.
>>
>> 2. It would allow better requests pipelining, which is very important to  
>> minimize uplink latency for synchronous requests (i.e. with only one IO  
>> request at time, next request issued, when the previous one completed).  
>> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware  
>> recommends for maximum performance set max_sectors_kb as low as *64K*  
>> with 16MB RA. It allows to maximize serving commands pipelining. And  
>> this suggestion really works allowing to improve throughput in 50-100%!

Seems I should elaborate more on this. Case, when client is remote has a 
fundamental difference from the case, when client is local, for which 
Linux currently optimized. When client is local data delivered to it 
from the page cache with a virtually infinite speed. But when client is 
remote data delivered to it from the server's cache on a *finite* speed. 
In our case this speed is about the same as speed of reading data to the 
cache from the storage. It has the following consequences:

1. Data for any READ request at first transferred from the storage to 
the cache, then from the cache to the client. If those transfers are 
done purely sequentially without overlapping, i.e. without any 
readahead, resulting throughput T can be found from equation: 1/T = 
1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the 
local (i.e. from the storage) and remote links. In case, when Tlocal ~= 
Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

2. If data transfers on the local and remote links aren't coordinated, 
it is possible that only one link transfers data at some time. From the 
(1) above you can calculate that % of this "idle" time is % of the lost 
throughput. I.e. to get the maximum throughput both links should 
transfer data as simultaneous as possible. For our case, when Tlocal ~= 
Tremote, both links should be all the time busy. Moreover, it is 
possible that the local transfer finished, but during the remote 
transfer the storage media rotated too far, so for the next request it 
will be needed to wait the full rotation to finish (i.e. several ms of 
lost bandwidth).

Thus, to get the maximum possible throughput, we need to maximize 
simultaneous load of both local and remote links. It can be done by 
using well known pipelining technique. For that client should read the 
same amount of data at once, but those read should be split on smaller 
chunks, like 64K at time. This approach looks being against the 
"conventional wisdom", saying that bigger request means bigger 
throughput, but, in fact, it doesn't, because the same (big) amount of 
data are read at time. Bigger count of smaller requests will make more 
simultaneous load on both participating in the data transfers links. In 
fact, even if client is local, in most cases there is a second data 
transfer link. It's in the storage. This is especially true for RAID 
controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K 
and increase RA in the above link? ;)

Of course, max_sectors_kb should be decreased only for smart devices, 
which allow >1 outstanding requests at time, i.e. for all modern 
SCSI/SAS/SATA/iSCSI/FC/etc. drives.

There is an objection against having too many outstanding requests at 
time. This is latency. But, since overall size of all requests remains 
unchanged, this objection isn't relevant in this proposal. There is the 
same latency-related objection against increasing RA. But many small 
individual RA requests it isn't relevant as well.

We did some measurements to support the this proposal. They were done 
only with deadline scheduler to make the picture clearer. They were done 
with context RA. The tests were the same as before.

--- Baseline, all default:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 51,1 MB/s
          b) 51,4 MB/s
          c) 51,1 MB/s

Run at the same time:
# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 4,7 MB/s
          b) 4,6 MB/s
          c) 4,8 MB/s

--- Client - all default, on the server max_sectors_kb set to 64K:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
     - 100 MB/s
     - 100 MB/s
     - 102 MB/s

# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
     - 5,2 MB/s
     - 5,3 MB/s
     - 4,2 MB/s

100% and 8% improvement comparing to the baseline.

 From the previous e-mail you can see that with 4096K RA

# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 39,9 MB/s
	b) 39,5 MB/s
	c) 38,4 MB/s

I.e. there is 760% improvement over the baseline.

Thus, I believe, that for all devices, supporting queue depths >1, 
max_sectors_kb should be set by default to 64K (or to 128K, maybe, but 
not more), and default RA increased to at least 1M, better 2-4M.

> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Sure

Thanks,
Vlad


[-- Attachment #2: req_split.log.bz2 --]
[-- Type: application/x-bzip, Size: 5683 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-04-21 18:18                                           ` Vladislav Bolkhovitin
@ 2009-04-24  8:43                                             ` Wu Fengguang
  2009-05-12 18:13                                               ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 70+ messages in thread
From: Wu Fengguang @ 2009-04-24  8:43 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

On Tue, Apr 21, 2009 at 10:18:25PM +0400, Vladislav Bolkhovitin wrote:
> Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>>> Here are the conclusions from tests:
>>>
>>>  1. Making all IO threads work in the same IO context with CFQ 
>>> (vanilla  RA and default RA size) brings near 100% link utilization 
>>> on single  stream reads (100MB/s) and with deadline about 50% 
>>> (50MB/s). I.e. there  is 100% improvement of CFQ over deadline. With 
>>> 2 read streams CFQ has  ever more advantage: >400% (23MB/s vs 5MB/s).
>>
>> The ideal 2-stream throughput should be >60MB/s, so I guess there are
>> still room of improvements for the CFQ's 23MB/s?
>
> Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K  
> we were able to get ~40MB/s, see the previous e-mail and below.

Why do you think it's in readahead? The readahead account/tracing data
you provided in a previous email shows context readahead to work just fine:

        On initiator run benchmark:
        dd if=/dev/sdb of=/dev/null bs=64K count=8000
        dd if=/dev/sdc of=/dev/null bs=64K count=8000

        linux-4dtq:/.kernels/scst-kernel-4 # cat /sys/kernel/debug/readahead/stats
        pattern         count sync_count  eof_count       size async_size     actual
        initial0            4          4          1          4          3          3
        subsequent        288          0          0        470        251        251
        marker            722          0          0        255        255        255
        mmap                2          2          2         32          0         12
        fadvise             2          2          2          0          0         69
        all              1018          8          5        314        252        252

The last line says the server side does 1018 readaheads with average
size 1008K, which is pretty close to the max readahead size 1024K.

Raw performance numbers are not enough here. The readahead size and
the actual IO size, and possibly the readahead traces and IO traces
are the most vivid and ultimate way to judge who's behaving wrong
and provide the hint to a solution.

>> The one fact I cannot understand is that SCST seems to breaking up the
>> client side 64K reads into server side 4K reads(above readahead layer).
>> But I remember you told me that SCST don't do NFS rsize style split-ups.
>> Is this a bug? The 4K read size is too small to be CPU/network friendly...
>> Where are the split-up and re-assemble done? On the client side or
>> internal to the server?
>
> This is on the client's side. See the target's log in the attachment.  
>
> Here is the summary of commands data sizes, came to the server, for "dd  
> if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
>
> 4K                       11
> 8K                       0
> 16K                      0
> 32K                      0
> 64K                      0
> 128K                     81
> 256K                     8
> 512K                     0
> 1024K                    0
> 2048K                    0
> 4096K                    0
>
> There's a way too many 4K requests. Apparently, the requests submission  
> path isn't optimal.

So it's the client side that splits 128K (default sized) readahead IO
into 4-256K SCST requests that sent over the network?

It looks good enough since it's mostly 128K requests. However this is
still not in line with previous data:

        req=0+1, ra=0+4-3, async=0) = 4
        req=1+1, ra=4+16-16, async=1) = 16
        req=4+1, ra=20+32-32, async=1) = 32
        req=20+1, ra=52+64-64, async=1) = 64
        req=52+1, ra=116+128-128, async=1) = 128
        req=116+1, ra=244+256-256, async=1) = 256
        req=244+1, ra=500+256-256, async=1) = 256
        req=500+1, ra=756+256-256, async=1) = 256
        req=756+1, ra=1012+256-256, async=1) = 256
        req=1012+1, ra=1268+256-256, async=1) = 256
        req=1268+1, ra=1268+512-256, async=1) = 256
        req=1524+1, ra=1780+256-256, async=1) = 256
        req=1780+1, ra=2036+256-256, async=1) = 256
        req=2036+1, ra=2292+256-256, async=1) = 256
        req=2292+1, ra=2548+256-256, async=1) = 256
        req=2548+1, ra=2804+256-256, async=1) = 256
        req=2804+1, ra=2804+512-256, async=1) = 256
        req=3060+1, ra=3060+512-256, async=1) = 256
        req=3316+1, ra=3572+256-256, async=1) = 256
        req=3572+1, ra=3828+256-256, async=1) = 256
        req=3828+1, ra=4084+256-256, async=1) = 256
        req=4084+1, ra=4340+256-256, async=1) = 256
        req=4340+1, ra=4596+256-256, async=1) = 256
        req=4596+1, ra=4596+512-256, async=1) = 256
        req=4852+1, ra=5108+256-256, async=1) = 256
        req=5108+1, ra=5108+512-256, async=1) = 256
        [and 480 more lines of pattern "req=*+1,...= 256")

Basically, the server side ondemand_readahead()
        - perceived *all* about 1-page read requests
        - submitted *always* 256-page readahead IO
          (except during the initial size ramp up process)

Maybe the runtime condition is somehow different for the above and the
below traces.

        [  457.003661] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.008686] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.010915] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
        [  457.015238] [4209]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.015419] [4211]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
        [  457.021696] [4208]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.024205] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
        [  457.028455] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.028695] [4210]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
        [  457.033565] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.035750] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.037323] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.041132] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.043251] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.045455] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.047949] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.051463] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.052136] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.057734] [4207]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.058007] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
        [  457.063185] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.063404] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
        [  457.068749] [4211]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
        [  457.069007] [4211]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
        [  457.071339] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.075250] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.077416] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.079892] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
        [  457.080492] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)

There is an interesting pattern in the above trace: it tend to be
32-page aligned. An unligned 1-page read will always be followed by an
31-page or 63-page read, which then again make it aligned to 32-page
boundaries :-)

> Actually, this is another question I wanted to rise from the very  
> beginning.
>
>>>  6. Unexpected result. In case, when ll IO threads work in the same 
>>> IO  context with CFQ increasing RA size *decreases* throughput. I 
>>> think this  is, because RA requests performed as single big READ 
>>> requests, while  requests coming from remote clients are much smaller 
>>> in size (up to  256K), so, when the read by RA data transferred to 
>>> the remote client on  100MB/s speed, the backstorage media gets 
>>> rotated a bit, so the next  read request must wait the rotation 
>>> latency (~0.1ms on 7200RPM). This is  well conforms with (3) above, 
>>> when context RA has 40% advantage over  vanilla RA with default RA, 
>>> but much smaller with higher RA.
>>
>> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
>
> That doesn't matter, because new request from the client won't come  
> until all data for the previous one transferred to it. And that transfer  
> is done on a very *finite* speed.

Async readahead does matter. The readahead pipeline is all you need to
*fully* fill the storage/network channels (with good readahead size).

The client side request time doesn't matter (ie. not too late to
impact performance). The 

Let's look at the default case, where
        - client readahead size = 128K
        - server readahead size = 128K
        - no split of request size by SCST or max_sectors_kb
and assume
        - application read size = huge
        - application read bw = infinite
        - disk bw = network bw = limited

Imagine three 128K chunks in the file:

         chunk A                             chunk B                         chunk C
================================--------------------------------________________________________
application read blocked here   client side readahead IO        server side readahead IO
(ongoing network IO)            (ongoing network IO)            (ongoing disk IO)

Normally the application can read and consume data very fast.
So in most time, it will be blocked somewhere for network IO.
Let's assume it is blocking at the first page of chunk A.

Just before the application blocks on chunk A, it will trigger a
(client side) readahead request for chunk B, which in turn will
trigger a (server side) readahead request for chunk C.

When disk bw = network bw, it _will_ quickly enter a steady state, in
which the network transfer of B and disk read of C proceed concurrently.

The below tables demonstrate the steps into this pipelined steady state.
(we ignore disk seek time for simplicity)

1) no request merging in SCST and block layer:
        time    0       3       5       6       7
        client  ab      ab      ABcd    ABCde   ABCDef
        server  abc     ABC     ABCde   ABCDef  ABCDEfg
        net transfers   ...     AB      C       D
        disk transfers  ABC     ..      D       E

2) disk merge, no SCST merge:
        time    0       3       5       6       7       8       9
        client  ab      ab      ABcd    ABCde   ABCde   ABCDef  ABCDEfg
        server  abc     ABC     ABCde   ABCdef  ABCDEf  ABCDEFg ABCDEFGh
        net transfers   ...     AB      C       .       D       E
        disk transfers  ABC     ..              DE      F       G

3) application (limited) read speed = 2 * disk bw:
        time    0       3       5       5.5     6       6.5     7
        client  ab      ab      AB      ABc     ABcd    ABCd    ABCde
        server  abc     ABC     ABC     ABCd    ABCDe   ABCDe   ABCDEf
        net transfers   ...     AB      .               C       D
        disk transfers  ABC     ..      .               D       E

Legends:
- lower case 'a': request for chunk A submitted, but IO has not completed
- upper case 'A': request for chunk A submitted, and IO has completed
- dot '.': net/disk idled for one time slot

Annotations for case (2):
T0: single network IO request for chunk A&B, or AB in short;
    single disk IO request for chunk ABC.
T3: disk IO for ABC completes; network idle
T5: disk idle; network IO for AB completes;
    application proceeds quickly to B and then C,
    which triggers two network IO requests, one for C and another for D;
    which triggers one disk IO request for DE(requests for them come
    close and hence get merged into one).
T6: disk busy(half way in DE); network completes request C;
    which quickly triggers network request for E and disk request for F.
T7: disk completes DE; network idle before that.
T8: disk completes F; network completes D;
    which in turn lead to network request for F and disk request for G.
    ==> pipeline established!

Case (2)/(3) are more realistic cases, where two requests come close in
time will be merged in the block layer, but not at SCST client side.

For all three cases, we start by single large request for several
chunks, and quickly converges to a steady state: a pipeline of
single-chunk disk/network requests.

>>> Bottom line IMHO conclusions:
>>>
>>> 1. Context RA should be considered after additional examination to   
>>> replace current RA algorithm in the kernel
>>
>> That's my plan to push context RA to mainline. And thank you very much
>> for providing and testing out a real world application for it!
>
> You're welcome!

Good news: context readahead is now queued in the -mm tree :-)

>>> 2. It would be better to increase default RA size to 1024K
>>
>> That's a long wish to increase the default RA size. However I have a
>> vague feeling that it would be better to first make the lower layers
>> more smart on max_sectors_kb granularity request splitting and batching.
>
> Can you elaborate more on that, please?

Not exactly the above words. But the basic idea is to reduce latency
on sync IO:
- readahead algorithm classify its IO requests into sync/async ones;
- use the SATA/SCSI priority bit for sync/async requests
Each of them could introduce some problems though...

>>> *AND* one of the following:
>>>
>>> 3.1. All RA requests should be split in smaller requests with size up 
>>> to  256K, which should not be merged with any other request
>>
>> Are you referring to max_sectors_kb?
>
> Yes
>
>> What's your max_sectors_kb and nr_requests? Something like
>>
>>         grep -r . /sys/block/sda/queue/
>
> Default: 512 and 128 correspondingly.

OK.

>>> OR
>>>
>>> 3.2. New RA requests should be sent before the previous one completed 
>>> to  don't let the storage device rotate too far to need full rotation 
>>> to  serve the next request.
>>
>> Linus has a mmap readahead cleanup patch that can do this. It
>> basically replaces a {find_lock_page(); readahead();} sequence into
>> {find_get_page(); readahead(); lock_page();}.
>>
>> I'll try to push that patch into mainline.
>
> Good!

Good news 2: mmap readahead is also sitting in the -mm tree :-)

>>> I like suggestion 3.1 a lot more, since it should be simple to 
>>> implement  and has the following 2 positive side effects:
>>>
>>> 1. It would allow to minimize negative effect of higher RA size on 
>>> the  I/O delay latency by allowing CFQ to switch to too long waiting  
>>> requests, when necessary.
>>>
>>> 2. It would allow better requests pipelining, which is very important 
>>> to  minimize uplink latency for synchronous requests (i.e. with only 
>>> one IO  request at time, next request issued, when the previous one 
>>> completed).  You can see in 
>>> http://www.3ware.com/kb/article.aspx?id=11050 that 3ware  recommends 
>>> for maximum performance set max_sectors_kb as low as *64K*  with 16MB 
>>> RA. It allows to maximize serving commands pipelining. And  this 
>>> suggestion really works allowing to improve throughput in 50-100%!
>
> Seems I should elaborate more on this. Case, when client is remote has a  
> fundamental difference from the case, when client is local, for which  
> Linux currently optimized. When client is local data delivered to it  
> from the page cache with a virtually infinite speed. But when client is  
> remote data delivered to it from the server's cache on a *finite* speed.  
> In our case this speed is about the same as speed of reading data to the  
> cache from the storage. It has the following consequences:
>
> 1. Data for any READ request at first transferred from the storage to  
> the cache, then from the cache to the client. If those transfers are  
> done purely sequentially without overlapping, i.e. without any  
> readahead, resulting throughput T can be found from equation: 1/T =  
> 1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the  
> local (i.e. from the storage) and remote links. In case, when Tlocal ~=  
> Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

Fortunately I can quickly absorb your idea ;-) Because I can recall
vividly when downloading files in a pretty old kernel, the network IO
and writeback works by turns instead of in a pipelined way. Hmm, it's
an interesting behavior that deserves more researches :-)

> 2. If data transfers on the local and remote links aren't coordinated,  
> it is possible that only one link transfers data at some time. From the  
> (1) above you can calculate that % of this "idle" time is % of the lost  
> throughput. I.e. to get the maximum throughput both links should  
> transfer data as simultaneous as possible. For our case, when Tlocal ~=  
> Tremote, both links should be all the time busy. Moreover, it is  
> possible that the local transfer finished, but during the remote  
> transfer the storage media rotated too far, so for the next request it  
> will be needed to wait the full rotation to finish (i.e. several ms of  
> lost bandwidth).

This is merely a possible state. Can you prove that it is in fact a
*stable* one?

> Thus, to get the maximum possible throughput, we need to maximize  
> simultaneous load of both local and remote links. It can be done by  
> using well known pipelining technique. For that client should read the  
> same amount of data at once, but those read should be split on smaller  
> chunks, like 64K at time. This approach looks being against the  
> "conventional wisdom", saying that bigger request means bigger  
> throughput, but, in fact, it doesn't, because the same (big) amount of  
> data are read at time. Bigger count of smaller requests will make more  
> simultaneous load on both participating in the data transfers links. In  
> fact, even if client is local, in most cases there is a second data  
> transfer link. It's in the storage. This is especially true for RAID  
> controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K  
> and increase RA in the above link? ;)

Sure 64K is good for 3ware - hw raid stripe sizes are typically small.
I wonder if there is a 'stripe size' concept for multi-channel SSDs,
and their typical values :-)

> Of course, max_sectors_kb should be decreased only for smart devices,  
> which allow >1 outstanding requests at time, i.e. for all modern  
> SCSI/SAS/SATA/iSCSI/FC/etc. drives.

But I guess the gain of tiny max_sectors_kb is more marginal for
single spindle case. Just a guess.

> There is an objection against having too many outstanding requests at  
> time. This is latency. But, since overall size of all requests remains  
> unchanged, this objection isn't relevant in this proposal. There is the  
> same latency-related objection against increasing RA. But many small  
> individual RA requests it isn't relevant as well.
>
> We did some measurements to support the this proposal. They were done  
> only with deadline scheduler to make the picture clearer. They were done  
> with context RA. The tests were the same as before.
>
> --- Baseline, all default:
>
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>          a) 51,1 MB/s
>          b) 51,4 MB/s
>          c) 51,1 MB/s
>
> Run at the same time:
> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>          a) 4,7 MB/s
>          b) 4,6 MB/s
>          c) 4,8 MB/s
>
> --- Client - all default, on the server max_sectors_kb set to 64K:
>
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>     - 100 MB/s
>     - 100 MB/s
>     - 102 MB/s
>
> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>     - 5,2 MB/s
>     - 5,3 MB/s
>     - 4,2 MB/s
>
> 100% and 8% improvement comparing to the baseline.

They are definitely encouraging numbers. The 64K max_sectors_kb is
obviously doing good here. However.. How do you know it's essentially
a pipeline issue?  What exactly happened behind the scheme?

> From the previous e-mail you can see that with 4096K RA
>
> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
> 	a) 39,9 MB/s
> 	b) 39,5 MB/s
> 	c) 38,4 MB/s
>
> I.e. there is 760% improvement over the baseline.

Which baseline? You are comparing 128K/4MB RA sizes, under the default
max_sectors_kb?

> Thus, I believe, that for all devices, supporting queue depths >1,  
> max_sectors_kb should be set by default to 64K (or to 128K, maybe, but  
> not more), and default RA increased to at least 1M, better 2-4M.
>
>> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)
>
> Sure

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Slow file transfer speeds with CFQ IO scheduler in some cases
  2009-04-24  8:43                                             ` Wu Fengguang
@ 2009-05-12 18:13                                               ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 70+ messages in thread
From: Vladislav Bolkhovitin @ 2009-05-12 18:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jens Axboe, Jeff Moyer, Vitaly V. Bursov, linux-kernel,
	linux-nfs, lukasz.jurewicz

Wu Fengguang, on 04/24/2009 12:43 PM wrote:
> On Tue, Apr 21, 2009 at 10:18:25PM +0400, Vladislav Bolkhovitin wrote:
>> Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>>>> Here are the conclusions from tests:
>>>>
>>>>  1. Making all IO threads work in the same IO context with CFQ 
>>>> (vanilla  RA and default RA size) brings near 100% link utilization 
>>>> on single  stream reads (100MB/s) and with deadline about 50% 
>>>> (50MB/s). I.e. there  is 100% improvement of CFQ over deadline. With 
>>>> 2 read streams CFQ has  ever more advantage: >400% (23MB/s vs 5MB/s).
>>> The ideal 2-stream throughput should be >60MB/s, so I guess there are
>>> still room of improvements for the CFQ's 23MB/s?
>> Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K  
>> we were able to get ~40MB/s, see the previous e-mail and below.
> 
> Why do you think it's in readahead?

See the previous results with deadline scheduler. Deadline scheduler was 
chosen because it doesn't influence the stream of commands by additional 
delays, like CFQ.

Default RA size:

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 4,7 MB/s
          b) 4,6 MB/s
          c) 4,8 MB/s

RA 4096K:

Run at the same time:		
linux-4dtq:~ # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
linux-4dtq:~ # dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 39,5 MB/s
	b) 39,6 MB/s
	c) 37,0 MB/s

 >700% of improvement is quite impressive, isn't it?

> The readahead account/tracing data
> you provided in a previous email shows context readahead to work just fine:
> 
>         On initiator run benchmark:
>         dd if=/dev/sdb of=/dev/null bs=64K count=8000
>         dd if=/dev/sdc of=/dev/null bs=64K count=8000
> 
>         linux-4dtq:/.kernels/scst-kernel-4 # cat /sys/kernel/debug/readahead/stats
>         pattern         count sync_count  eof_count       size async_size     actual
>         initial0            4          4          1          4          3          3
>         subsequent        288          0          0        470        251        251
>         marker            722          0          0        255        255        255
>         mmap                2          2          2         32          0         12
>         fadvise             2          2          2          0          0         69
>         all              1018          8          5        314        252        252
> 
> The last line says the server side does 1018 readaheads with average
> size 1008K, which is pretty close to the max readahead size 1024K.
> 
> Raw performance numbers are not enough here. The readahead size and
> the actual IO size, and possibly the readahead traces and IO traces
> are the most vivid and ultimate way to judge who's behaving wrong
> and provide the hint to a solution.

Well, RA can work well, but _size_ can be not _too good_. Current 
default 128K is simply much too small. This is why I propose to increase 
it to at least 2M (better 4M) and decrease max_sectors_kb for tagged 
queuing devices to at max 128K (better 64K) to minimize possible latency 
influence.

BTW, Linus recently also found out that "Doing a "echo 64 > 
max_sectors_kb" does seem to make my experience nicer. At no really 
noticeable downside in throughput that I can see" 
(http://lwn.net/Articles/328381/)

>>> The one fact I cannot understand is that SCST seems to breaking up the
>>> client side 64K reads into server side 4K reads(above readahead layer).
>>> But I remember you told me that SCST don't do NFS rsize style split-ups.
>>> Is this a bug? The 4K read size is too small to be CPU/network friendly...
>>> Where are the split-up and re-assemble done? On the client side or
>>> internal to the server?
>> This is on the client's side. See the target's log in the attachment.  
>>
>> Here is the summary of commands data sizes, came to the server, for "dd  
>> if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
>>
>> 4K                       11
>> 8K                       0
>> 16K                      0
>> 32K                      0
>> 64K                      0
>> 128K                     81
>> 256K                     8
>> 512K                     0
>> 1024K                    0
>> 2048K                    0
>> 4096K                    0
>>
>> There's a way too many 4K requests. Apparently, the requests submission  
>> path isn't optimal.
> 
> So it's the client side that splits 128K (default sized) readahead IO
> into 4-256K SCST requests that sent over the network?
> 
> It looks good enough since it's mostly 128K requests. However this is
> still not in line with previous data:
> 
>         req=0+1, ra=0+4-3, async=0) = 4
>         req=1+1, ra=4+16-16, async=1) = 16
>         req=4+1, ra=20+32-32, async=1) = 32
>         req=20+1, ra=52+64-64, async=1) = 64
>         req=52+1, ra=116+128-128, async=1) = 128
>         req=116+1, ra=244+256-256, async=1) = 256
>         req=244+1, ra=500+256-256, async=1) = 256
>         req=500+1, ra=756+256-256, async=1) = 256
>         req=756+1, ra=1012+256-256, async=1) = 256
>         req=1012+1, ra=1268+256-256, async=1) = 256
>         req=1268+1, ra=1268+512-256, async=1) = 256
>         req=1524+1, ra=1780+256-256, async=1) = 256
>         req=1780+1, ra=2036+256-256, async=1) = 256
>         req=2036+1, ra=2292+256-256, async=1) = 256
>         req=2292+1, ra=2548+256-256, async=1) = 256
>         req=2548+1, ra=2804+256-256, async=1) = 256
>         req=2804+1, ra=2804+512-256, async=1) = 256
>         req=3060+1, ra=3060+512-256, async=1) = 256
>         req=3316+1, ra=3572+256-256, async=1) = 256
>         req=3572+1, ra=3828+256-256, async=1) = 256
>         req=3828+1, ra=4084+256-256, async=1) = 256
>         req=4084+1, ra=4340+256-256, async=1) = 256
>         req=4340+1, ra=4596+256-256, async=1) = 256
>         req=4596+1, ra=4596+512-256, async=1) = 256
>         req=4852+1, ra=5108+256-256, async=1) = 256
>         req=5108+1, ra=5108+512-256, async=1) = 256
>         [and 480 more lines of pattern "req=*+1,...= 256")
> 
> Basically, the server side ondemand_readahead()
>         - perceived *all* about 1-page read requests
>         - submitted *always* 256-page readahead IO
>           (except during the initial size ramp up process)
> 
> Maybe the runtime condition is somehow different for the above and the
> below traces.

I took the above numbers on my local system with 15K RPM parallel SCSI 
drive. Also, I forgot to note that, for instance, 128K means "requests 
with sizes between 64K and 128K".

>         [  457.003661] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.008686] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.010915] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
>         [  457.015238] [4209]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.015419] [4211]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
>         [  457.021696] [4208]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.024205] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
>         [  457.028455] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.028695] [4210]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
>         [  457.033565] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.035750] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.037323] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.041132] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.043251] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.045455] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.047949] [4211]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.051463] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.052136] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.057734] [4207]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.058007] [4207]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
>         [  457.063185] [4210]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.063404] [4210]: vdisk_exec_read:2198:reading(iv_count 63, full_len 258048)
>         [  457.068749] [4211]: vdisk_exec_read:2198:reading(iv_count 1, full_len 4096)
>         [  457.069007] [4211]: vdisk_exec_read:2198:reading(iv_count 31, full_len 126976)
>         [  457.071339] [4208]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.075250] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.077416] [4207]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.079892] [4209]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
>         [  457.080492] [4210]: vdisk_exec_read:2198:reading(iv_count 32, full_len 131072)
> 
> There is an interesting pattern in the above trace: it tend to be
> 32-page aligned. An unligned 1-page read will always be followed by an
> 31-page or 63-page read, which then again make it aligned to 32-page
> boundaries :-)

Frankly, I don't have full understanding why it is so, I only see that 
Linux generates very spread by sizes requests from 4K to double RA size. 
I guess, this is because of interaction of how requests are submitted, 
merged and the queue plugging/unplugging.

If you need me to provide you any data/logs to investigate this just let 
me know. I can give you a full list of SCSI commands coming from client, 
for instance.

Actually, you can download SCST and iSCSI-SCST from 
http://scst.sourceforge.net/downloads.html and try yourself. Setup is 
quite simple, see http://scst.sourceforge.net/iscsi-scst-howto.txt.

To get list of coming commands in kernel log you should run after SCST load

# echo "add scsi" >/proc/scsi_tgt/trace_level

Distribution of requests sizes you can find in /proc/scsi_tgt/sgv, count 
of outstanding at any time commands in /proc/scsi_tgt/sessions.

>> Actually, this is another question I wanted to rise from the very  
>> beginning.
>>
>>>>  6. Unexpected result. In case, when ll IO threads work in the same 
>>>> IO  context with CFQ increasing RA size *decreases* throughput. I 
>>>> think this  is, because RA requests performed as single big READ 
>>>> requests, while  requests coming from remote clients are much smaller 
>>>> in size (up to  256K), so, when the read by RA data transferred to 
>>>> the remote client on  100MB/s speed, the backstorage media gets 
>>>> rotated a bit, so the next  read request must wait the rotation 
>>>> latency (~0.1ms on 7200RPM). This is  well conforms with (3) above, 
>>>> when context RA has 40% advantage over  vanilla RA with default RA, 
>>>> but much smaller with higher RA.
>>> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
>> That doesn't matter, because new request from the client won't come  
>> until all data for the previous one transferred to it. And that transfer  
>> is done on a very *finite* speed.
> 
> Async readahead does matter. The readahead pipeline is all you need to
> *fully* fill the storage/network channels (with good readahead size).
> 
> The client side request time doesn't matter (ie. not too late to
> impact performance). The 
> 
> Let's look at the default case, where
>         - client readahead size = 128K
>         - server readahead size = 128K
>         - no split of request size by SCST or max_sectors_kb
> and assume
>         - application read size = huge
>         - application read bw = infinite
>         - disk bw = network bw = limited
> 
> Imagine three 128K chunks in the file:
> 
>          chunk A                             chunk B                         chunk C
> ================================--------------------------------________________________________
> application read blocked here   client side readahead IO        server side readahead IO
> (ongoing network IO)            (ongoing network IO)            (ongoing disk IO)
> 
> Normally the application can read and consume data very fast.
> So in most time, it will be blocked somewhere for network IO.
> Let's assume it is blocking at the first page of chunk A.
> 
> Just before the application blocks on chunk A, it will trigger a
> (client side) readahead request for chunk B, which in turn will
> trigger a (server side) readahead request for chunk C.
> 
> When disk bw = network bw, it _will_ quickly enter a steady state, in
> which the network transfer of B and disk read of C proceed concurrently.
> 
> The below tables demonstrate the steps into this pipelined steady state.
> (we ignore disk seek time for simplicity)
> 
> 1) no request merging in SCST and block layer:
>         time    0       3       5       6       7
>         client  ab      ab      ABcd    ABCde   ABCDef
>         server  abc     ABC     ABCde   ABCDef  ABCDEfg
>         net transfers   ...     AB      C       D
>         disk transfers  ABC     ..      D       E
> 
> 2) disk merge, no SCST merge:
>         time    0       3       5       6       7       8       9
>         client  ab      ab      ABcd    ABCde   ABCde   ABCDef  ABCDEfg
>         server  abc     ABC     ABCde   ABCdef  ABCDEf  ABCDEFg ABCDEFGh
>         net transfers   ...     AB      C       .       D       E
>         disk transfers  ABC     ..              DE      F       G
> 
> 3) application (limited) read speed = 2 * disk bw:
>         time    0       3       5       5.5     6       6.5     7
>         client  ab      ab      AB      ABc     ABcd    ABCd    ABCde
>         server  abc     ABC     ABC     ABCd    ABCDe   ABCDe   ABCDEf
>         net transfers   ...     AB      .               C       D
>         disk transfers  ABC     ..      .               D       E
> 
> Legends:
> - lower case 'a': request for chunk A submitted, but IO has not completed
> - upper case 'A': request for chunk A submitted, and IO has completed
> - dot '.': net/disk idled for one time slot
> 
> Annotations for case (2):
> T0: single network IO request for chunk A&B, or AB in short;
>     single disk IO request for chunk ABC.
> T3: disk IO for ABC completes; network idle
> T5: disk idle; network IO for AB completes;
>     application proceeds quickly to B and then C,
>     which triggers two network IO requests, one for C and another for D;
>     which triggers one disk IO request for DE(requests for them come
>     close and hence get merged into one).
> T6: disk busy(half way in DE); network completes request C;
>     which quickly triggers network request for E and disk request for F.
> T7: disk completes DE; network idle before that.
> T8: disk completes F; network completes D;
>     which in turn lead to network request for F and disk request for G.
>     ==> pipeline established!
> 
> Case (2)/(3) are more realistic cases, where two requests come close in
> time will be merged in the block layer, but not at SCST client side.
> 
> For all three cases, we start by single large request for several
> chunks, and quickly converges to a steady state: a pipeline of
> single-chunk disk/network requests.

I have two things to say here:

1. Why do you think B and C can't be merged by block layer on the 
server? Default max_sectors_kb 512K perfectly allows that. On practice I 
see it happens all the time:

1.1. Distribution of requests sizes after a number of "dd if=/dev/sdX 
of=/dev/null bs=512K count=2000" calls with deadline IO scheduler on the 
both sides:

4K                       216
8K                       60
16K                      77
32K                      138
64K                      200
128K                     11276
256K                     18474
 >=512K                   0

Look, how many there are 256K requests. Most.

1.2. I can see how many requests at time sent by client. In most cases 
this number is 1 and only sometimes 2.

1.3. I made some measurements on my local single drive system, capable 
to produce 130MB/s reads, with deadline IO scheduler on the both sides 
and a *single IO thread* on the server:

# dd if=/dev/sdc of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 16.6582 seconds, 62.9 MB/s

This is about 60% of maximum possible bandwidth. So, definitely, 
pipelining works rather doesn't work.

But with max_sectors_kb set to 32K on the client situation gets much better:

# dd if=/dev/sdc of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 10.602 seconds, 98.9 MB/s

Requests distribution:

4K                       7570
8K                       755
16K                      1334
32K                      31478
 >=64K                    0

At time there are up to 8 outstanding requests, 5 in average.

Here pipelining definitely does work. 57% improvement.

2. At practice, there are many cases, when there are >2 "pipes" with 
limited bandwidth. The simplest example is when data digest used with 
iSCSI. Here there are data digest calculation (CRC32C) "pipe" on both 
client and server. So, for maximum throughput the pipelining depth 
should be at least 4. See the measurements (setup is the same as above):

Default:

# dd if=/dev/sde of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 21.9438 seconds, 47.8 MB/s

max_sectors_kb 32K:

# dd if=/dev/sde of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 15.4442 seconds, 67.9 MB/s

48% improvement.

NULLIO (i.e. without disk access):

Default:

# dd if=/dev/sdh of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 16.0533 seconds, 65.3 MB/s

max_sectors_kb 32K:

# dd if=/dev/sdh of=/dev/null bs=512K count=2000
2000+0 records in
2000+0 records out
1048576000 bytes (1.0 GB) copied, 12.6053 seconds, 83.2 MB/s

28% improvement.

Impressive difference, isn't it?

Another example where deep pipelining is needed is hardware RAID cards, 
like 3ware (see below).

>>>> Bottom line IMHO conclusions:
>>>>
>>>> 1. Context RA should be considered after additional examination to   
>>>> replace current RA algorithm in the kernel
>>> That's my plan to push context RA to mainline. And thank you very much
>>> for providing and testing out a real world application for it!
>> You're welcome!
> 
> Good news: context readahead is now queued in the -mm tree :-)

Good!

>>>> 2. It would be better to increase default RA size to 1024K
>>> That's a long wish to increase the default RA size. However I have a
>>> vague feeling that it would be better to first make the lower layers
>>> more smart on max_sectors_kb granularity request splitting and batching.
>> Can you elaborate more on that, please?
> 
> Not exactly the above words. But the basic idea is to reduce latency
> on sync IO:
> - readahead algorithm classify its IO requests into sync/async ones;
> - use the SATA/SCSI priority bit for sync/async requests

Hmm, I can't recall SCSI has any priority bit. And, if it had, for what 
it should be used?

>> 1. Data for any READ request at first transferred from the storage to  
>> the cache, then from the cache to the client. If those transfers are  
>> done purely sequentially without overlapping, i.e. without any  
>> readahead, resulting throughput T can be found from equation: 1/T =  
>> 1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the  
>> local (i.e. from the storage) and remote links. In case, when Tlocal ~=  
>> Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)
> 
> Fortunately I can quickly absorb your idea ;-) Because I can recall
> vividly when downloading files in a pretty old kernel, the network IO
> and writeback works by turns instead of in a pipelined way. Hmm, it's
> an interesting behavior that deserves more researches :-)

Yes, definitely. I feel, if I wanted, I could have PhD or two in this 
area :-)

>> 2. If data transfers on the local and remote links aren't coordinated,  
>> it is possible that only one link transfers data at some time. From the  
>> (1) above you can calculate that % of this "idle" time is % of the lost  
>> throughput. I.e. to get the maximum throughput both links should  
>> transfer data as simultaneous as possible. For our case, when Tlocal ~=  
>> Tremote, both links should be all the time busy. Moreover, it is  
>> possible that the local transfer finished, but during the remote  
>> transfer the storage media rotated too far, so for the next request it  
>> will be needed to wait the full rotation to finish (i.e. several ms of  
>> lost bandwidth).
> 
> This is merely a possible state. Can you prove that it is in fact a
> *stable* one?

Yes, this is rarely possible state, because all the modern HDDs have at 
least 2MB RA buffer. But still possible, especially with old/cheap HDDs 
with small built-in buffer.

>> Thus, to get the maximum possible throughput, we need to maximize  
>> simultaneous load of both local and remote links. It can be done by  
>> using well known pipelining technique. For that client should read the  
>> same amount of data at once, but those read should be split on smaller  
>> chunks, like 64K at time. This approach looks being against the  
>> "conventional wisdom", saying that bigger request means bigger  
>> throughput, but, in fact, it doesn't, because the same (big) amount of  
>> data are read at time. Bigger count of smaller requests will make more  
>> simultaneous load on both participating in the data transfers links. In  
>> fact, even if client is local, in most cases there is a second data  
>> transfer link. It's in the storage. This is especially true for RAID  
>> controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K  
>> and increase RA in the above link? ;)
> 
> Sure 64K is good for 3ware - hw raid stripe sizes are typically small.
> I wonder if there is a 'stripe size' concept for multi-channel SSDs,
> and their typical values :-)

No, I don't think it's anyhow related to the stripe sizes. I believe, 
it's rather related to the fact that those cards are physical RAIDs, 
i.e. have built-in CPU and big internal memory (hundreds of MBs), hence 
have additional internal data transfer links, hence need deeper 
pipelining to work in full performance.

>> There is an objection against having too many outstanding requests at  
>> time. This is latency. But, since overall size of all requests remains  
>> unchanged, this objection isn't relevant in this proposal. There is the  
>> same latency-related objection against increasing RA. But many small  
>> individual RA requests it isn't relevant as well.
>>
>> We did some measurements to support the this proposal. They were done  
>> only with deadline scheduler to make the picture clearer. They were done  
>> with context RA. The tests were the same as before.
>>
>> --- Baseline, all default:
>>
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>>          a) 51,1 MB/s
>>          b) 51,4 MB/s
>>          c) 51,1 MB/s
>>
>> Run at the same time:
>> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>>          a) 4,7 MB/s
>>          b) 4,6 MB/s
>>          c) 4,8 MB/s
>>
>> --- Client - all default, on the server max_sectors_kb set to 64K:
>>
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>>     - 100 MB/s
>>     - 100 MB/s
>>     - 102 MB/s
>>
>> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>>     - 5,2 MB/s
>>     - 5,3 MB/s
>>     - 4,2 MB/s
>>
>> 100% and 8% improvement comparing to the baseline.
> 
> They are definitely encouraging numbers. The 64K max_sectors_kb is
> obviously doing good here. However.. How do you know it's essentially
> a pipeline issue?

Yes, see above.

>> From the previous e-mail you can see that with 4096K RA
>>
>> # while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
>> # dd if=/dev/sdb of=/dev/null bs=64K count=80000
>> 	a) 39,9 MB/s
>> 	b) 39,5 MB/s
>> 	c) 38,4 MB/s
>>
>> I.e. there is 760% improvement over the baseline.
> 
> Which baseline?

Search "baseline" word in one of my previous e-mails (one with the 
measurement results). For your convenience it is:

Default RA size with deadline:

Run at the same time:
#while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
#dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 4,7 MB/s
          b) 4,6 MB/s
          c) 4,8 MB/s


> You are comparing 128K/4MB RA sizes, under the default
> max_sectors_kb?

Yes.

Vlad

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2009-05-12 18:14 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-09 18:04 Slow file transfer speeds with CFQ IO scheduler in some cases Vitaly V. Bursov
2008-11-09 18:30 ` Alexey Dobriyan
2008-11-09 18:32   ` Vitaly V. Bursov
2008-11-10 10:44 ` Jens Axboe
2008-11-10 13:51   ` Jeff Moyer
2008-11-10 13:56     ` Jens Axboe
2008-11-10 17:16       ` Vitaly V. Bursov
2008-11-10 17:35         ` Jens Axboe
2008-11-10 18:27           ` Vitaly V. Bursov
2008-11-10 18:29             ` Jens Axboe
2008-11-10 18:39               ` Jeff Moyer
2008-11-10 18:42               ` Jens Axboe
2008-11-10 21:51             ` Jeff Moyer
2008-11-11  9:34               ` Jens Axboe
2008-11-11  9:35                 ` Jens Axboe
2008-11-11 11:52                   ` Jens Axboe
2008-11-11 16:48                     ` Jeff Moyer
2008-11-11 18:08                       ` Jens Axboe
2008-11-11 16:53                     ` Vitaly V. Bursov
2008-11-11 18:06                       ` Jens Axboe
2008-11-11 19:36                         ` Jeff Moyer
2008-11-11 21:41                           ` Jeff Layton
2008-11-11 21:59                             ` Jeff Layton
2008-11-12 12:20                               ` Jens Axboe
2008-11-12 12:45                                 ` Jeff Layton
2008-11-12 12:54                                   ` Christoph Hellwig
2008-11-11 19:42                         ` Vitaly V. Bursov
2008-11-12 18:32       ` Jeff Moyer
2008-11-12 19:02         ` Jens Axboe
2008-11-13  8:51           ` Wu Fengguang
2008-11-13  8:54             ` Jens Axboe
2008-11-14  1:36               ` Wu Fengguang
2008-11-25 11:02                 ` Vladislav Bolkhovitin
2008-11-25 11:25                   ` Wu Fengguang
2008-11-25 15:21                   ` Jeff Moyer
2008-11-25 16:17                     ` Vladislav Bolkhovitin
2008-11-13 18:46             ` Vitaly V. Bursov
2008-11-25 10:59             ` Vladislav Bolkhovitin
2008-11-25 11:30               ` Wu Fengguang
2008-11-25 11:41                 ` Vladislav Bolkhovitin
2008-11-25 11:49                   ` Wu Fengguang
2008-11-25 12:03                     ` Vladislav Bolkhovitin
2008-11-25 12:09                       ` Vladislav Bolkhovitin
2008-11-25 12:15                         ` Wu Fengguang
2008-11-27 17:46                           ` Vladislav Bolkhovitin
2008-11-28  0:48                             ` Wu Fengguang
2009-02-12 18:35                               ` Vladislav Bolkhovitin
2009-02-13  1:57                                 ` Wu Fengguang
2009-02-13 20:08                                   ` Vladislav Bolkhovitin
2009-02-16  2:34                                     ` Wu Fengguang
2009-02-17 19:03                                       ` Vladislav Bolkhovitin
2009-02-18 18:14                                         ` Vladislav Bolkhovitin
2009-02-19  1:35                                         ` Wu Fengguang
2009-02-17 19:01                                   ` Vladislav Bolkhovitin
2009-02-19  2:05                                     ` Wu Fengguang
2009-03-19 17:44                                       ` Vladislav Bolkhovitin
2009-03-20  8:53                                         ` Vladislav Bolkhovitin
2009-03-23  1:42                                         ` Wu Fengguang
2009-04-21 18:18                                           ` Vladislav Bolkhovitin
2009-04-24  8:43                                             ` Wu Fengguang
2009-05-12 18:13                                               ` Vladislav Bolkhovitin
2009-02-17 19:01                                 ` Vladislav Bolkhovitin
2009-02-19  1:38                                   ` Wu Fengguang
2008-11-24 15:33           ` Jeff Moyer
2008-11-24 18:13             ` Jens Axboe
2008-11-24 18:50               ` Jeff Moyer
2008-11-24 18:51                 ` Jens Axboe
2008-11-13  6:54         ` Vitaly V. Bursov
2008-11-13 14:32           ` Jeff Moyer
2008-11-13 18:33             ` Vitaly V. Bursov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).