LKML Archive on lore.kernel.org help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> To: akpm@linux-foundation.org, Ingo Molnar <mingo@elte.hu>, linux-kernel@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>, Rusty Russell <rusty@rustcorp.com.au> Subject: [patch 6/7] Immediate Values - Documentation Date: Sat, 02 Feb 2008 16:08:34 -0500 [thread overview] Message-ID: <20080202211207.428937604@polymtl.ca> (raw) In-Reply-To: 20080202210828.840735763@polymtl.ca [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: immediate-values-documentation.patch --] [-- Type: text/plain, Size: 8867 bytes --] Changelog: - Remove imv_set_early (removed from API). - Use imv_* instead of immediate_*. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Rusty Russell <rusty@rustcorp.com.au> --- Documentation/immediate.txt | 221 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 221 insertions(+) Index: linux-2.6-lttng/Documentation/immediate.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/Documentation/immediate.txt 2008-02-01 07:42:01.000000000 -0500 @@ -0,0 +1,221 @@ + Using the Immediate Values + + Mathieu Desnoyers + + +This document introduces Immediate Values and their use. + + +* Purpose of immediate values + +An immediate value is used to compile into the kernel variables that sit within +the instruction stream. They are meant to be rarely updated but read often. +Using immediate values for these variables will save cache lines. + +This infrastructure is specialized in supporting dynamic patching of the values +in the instruction stream when multiple CPUs are running without disturbing the +normal system behavior. + +Compiling code meant to be rarely enabled at runtime can be done using +if (unlikely(imv_read(var))) as condition surrounding the code. The +smallest data type required for the test (an 8 bits char) is preferred, since +some architectures, such as powerpc, only allow up to 16 bits immediate values. + + +* Usage + +In order to use the "immediate" macros, you should include linux/immediate.h. + +#include <linux/immediate.h> + +DEFINE_IMV(char, this_immediate); +EXPORT_IMV_SYMBOL(this_immediate); + + +And use, in the body of a function: + +Use imv_set(this_immediate) to set the immediate value. + +Use imv_read(this_immediate) to read the immediate value. + +The immediate mechanism supports inserting multiple instances of the same +immediate. Immediate values can be put in inline functions, inlined static +functions, and unrolled loops. + +If you have to read the immediate values from a function declared as __init or +__exit, you should explicitly use _imv_read(), which will fall back on a +global variable read. Failing to do so will leave a reference to the __init +section after it is freed (it would generate a modpost warning). + +You can choose to set an initial static value to the immediate by using, for +instance: + +DEFINE_IMV(long, myptr) = 10; + + +* Optimization for a given architecture + +One can implement optimized immediate values for a given architecture by +replacing asm-$ARCH/immediate.h. + + +* Performance improvement + + + * Memory hit for a data-based branch + +Here are the results on a 3GHz Pentium 4: + +number of tests: 100 +number of branches per test: 100000 +memory hit cycles per iteration (mean): 636.611 +L1 cache hit cycles per iteration (mean): 89.6413 +instruction stream based test, cycles per iteration (mean): 85.3438 +Just getting the pointer from a modulo on a pseudo-random value, doing + nothing with it, cycles per iteration (mean): 77.5044 + +So: +Base case: 77.50 cycles +instruction stream based test: +7.8394 cycles +L1 cache hit based test: +12.1369 cycles +Memory load based test: +559.1066 cycles + +So let's say we have a ping flood coming at +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) +7674 packets per second. If we put 2 markers for irq entry/exit, it +brings us to 15348 markers sites executed per second. + +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 +We therefore have a 0.29% slowdown just on this case. + +Compared to this, the instruction stream based test will cause a +slowdown of: + +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004 +For a 0.004% slowdown. + +If we plan to use this for memory allocation, spinlock, and all sorts of +very high event rate tracing, we can assume it will execute 10 to 100 +times more sites per second, which brings us to 0.4% slowdown with the +instruction stream based test compared to 29% slowdown with the memory +load based test on a system with high memory pressure. + + + + * Markers impact under heavy memory load + +Running a kernel with my LTTng instrumentation set, in a test that +generates memory pressure (from userspace) by trashing L1 and L2 caches +between calls to getppid() (note: syscall_trace is active and calls +a marker upon syscall entry and syscall exit; markers are disarmed). +This test is done in user-space, so there are some delays due to IRQs +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 +nice level) + +My first set of results: Linear cache trashing, turned out not to be +very interesting, because it seems like the linearity of the memset on a +full array is somehow detected and it does not "really" trash the +caches. + +Now the most interesting result: Random walk L1 and L2 trashing +surrounding a getppid() call. + +- Markers compiled out (but syscall_trace execution forced) +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.033 cycles +getppid: 1681.4 cycles +With memory pressure +Reading timestamps takes 102.938 cycles +getppid: 15691.6 cycles + + +- With the immediate values based markers: +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.006 cycles +getppid: 1681.84 cycles +With memory pressure +Reading timestamps takes 100.291 cycles +getppid: 11793 cycles + + +- With global variables based markers: +number of tests: 10000 +No memory pressure +Reading timestamps takes 107.999 cycles +getppid: 1669.06 cycles +With memory pressure +Reading timestamps takes 102.839 cycles +getppid: 12535 cycles + +The result is quite interesting in that the kernel is slower without +markers than with markers. I explain it by the fact that the data +accessed is not laid out in the same manner in the cache lines when the +markers are compiled in or out. It seems that it aligns the function's +data better to compile-in the markers in this case. + +But since the interesting comparison is between the immediate values and +global variables based markers, and because they share the same memory +layout, except for the movl being replaced by a movz, we see that the +global variable based markers (2 markers) adds 742 cycles to each system +call (syscall entry and exit are traced and memory locations for both +global variables lie on the same cache line). + + +- Test redone with less iterations, but with error estimates + +10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with +syscall trace inactive, comparing the case with memory pressure and without +memory pressure. (sorry, my system is not setup to execute syscall_trace this +time, but it will make the point anyway). + +No memory pressure +Reading timestamps: 150.92 cycles, std dev. 1.01 cycles +getppid: 1462.09 cycles, std dev. 18.87 cycles + +With memory pressure +Reading timestamps: 578.22 cycles, std dev. 269.51 cycles +getppid: 17113.33 cycles, std dev. 1655.92 cycles + + +Now for memory read timing: (10 runs, branches per test: 100000) +Memory read based branch: + 644.09 cycles, std dev. 11.39 cycles +L1 cache hit based branch: + 88.16 cycles, std dev. 1.35 cycles + + +So, now that we have the raw results, let's calculate: + +Memory read: +644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles + +Getppid without memory pressure: +1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles + +Getppid with memory pressure: +17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles + +Therefore, if we add 2 markers not based on immediate values to the getppid +code, which would add 2 memory reads, we would add +2 * 555.93±12.74 = 1111.86±25.48 cycles + +Therefore, + +1111.86±25.48 / 16535.11±1677.71 = 0.0672 + relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2)) + = 0.1040 + absolute error: 0.1040 * 0.0672 = 0.0070 + +Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 % + +We can therefore affirm that adding 2 markers to getppid, on a system with high +memory pressure, would have a performance hit of at least 6.0% on the system +call time, all within the uncertainty limits of these tests. The same applies to +other kernel code paths. The smaller those code paths are, the highest the +impact ratio will be. + +Therefore, not only is it interesting to use the immediate values to dynamically +activate dormant code such as the markers, but I think it should also be +considered as a replacement for many of the "read-mostly" static variables. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2008-02-02 21:13 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2008-02-02 21:08 [patch 0/7] Immediate Values Mathieu Desnoyers 2008-02-02 21:08 ` [patch 1/7] Immediate Values - Architecture Independent Code Mathieu Desnoyers 2008-02-26 22:52 ` Jason Baron 2008-02-26 23:12 ` Mathieu Desnoyers 2008-02-26 23:34 ` Mathieu Desnoyers 2008-02-27 16:44 ` Jason Baron 2008-02-27 17:01 ` Jason Baron 2008-02-27 19:05 ` Mathieu Desnoyers 2008-02-28 16:33 ` [patch 1/2] add ALL_CPUS option to stop_machine_run() Jason Baron 2008-02-28 22:09 ` Max Krasnyanskiy 2008-02-28 22:14 ` Mathieu Desnoyers 2008-02-29 2:39 ` Jason Baron 2008-02-29 9:00 ` Ingo Molnar 2008-02-29 18:24 ` Max Krasnyanskiy 2008-02-29 19:15 ` Ingo Molnar 2008-02-29 19:58 ` Max Krasnyanskiy 2008-03-03 4:12 ` Rusty Russell 2008-03-04 0:30 ` Max Krasnyanskiy 2008-03-04 2:36 ` Rusty Russell 2008-03-04 4:11 ` Max Krasnyansky 2008-03-02 23:32 ` Rusty Russell 2008-02-28 16:37 ` [patch 2/2] implement immediate updating via stop_machine_run() Jason Baron 2008-02-29 13:43 ` Mathieu Desnoyers 2008-02-28 16:50 ` [patch 1/7] Immediate Values - Architecture Independent Code Jason Baron 2008-02-02 21:08 ` [patch 2/7] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers 2008-02-02 21:08 ` [patch 3/7] Immediate Values - x86 Optimization Mathieu Desnoyers 2008-02-02 21:08 ` [patch 4/7] Add text_poke and sync_core to powerpc Mathieu Desnoyers 2008-02-02 21:08 ` [patch 5/7] Immediate Values - Powerpc Optimization Mathieu Desnoyers 2008-02-02 21:08 ` Mathieu Desnoyers [this message] 2008-02-02 21:08 ` [patch 7/7] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers -- strict thread matches above, loose matches on Subject: below -- 2007-09-18 21:07 [patch 0/7] Immediate Values for 2.6.23-rc6-mm1 Mathieu Desnoyers 2007-09-18 21:07 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers 2007-09-17 18:42 [patch 0/7] Immediate Values Mathieu Desnoyers 2007-09-17 18:42 ` [patch 6/7] Immediate Values - Documentation Mathieu Desnoyers 2007-09-17 20:55 ` Randy Dunlap 2007-09-18 13:13 ` Mathieu Desnoyers
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20080202211207.428937604@polymtl.ca \ --to=mathieu.desnoyers@polymtl.ca \ --cc=akpm@linux-foundation.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mingo@elte.hu \ --cc=rusty@rustcorp.com.au \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).