LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 0/8] Immediate Values (now with merged x86 support)
@ 2007-11-13 18:58 Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 1/8] Immediate Values - Architecture Independent Code Mathieu Desnoyers
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel

Hi Andrew,

Here is the latest version of the Immediate Values. It supports x86 as a single
architecture. I also removed the *_early API which could have been confusing to
developers : choosing the right algorithm is now done internally by keeping
track of where we are in the kernel boot process.

The patchset depends on :
"Instrumentation menu removal"
and
"Text Edit Lock".

It is based on 2.6.24-rc2-git3 and applies in this order :

#Immediate Values
#for -mm
immediate-values-architecture-independent-code.patch
immediate-values-kconfig-embedded.patch
immediate-values-move-kprobes-x86-restore-interrupt-to-kdebug-h.patch
add-asm-compat-to-x86.patch
immediate-values-x86-optimization.patch
immediate-values-powerpc-optimization.patch
immediate-values-documentation.patch
#
profiling-use-immediate-values.patch

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 1/8] Immediate Values - Architecture Independent Code
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 2/8] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, Rusty Russell

[-- Attachment #1: immediate-values-architecture-independent-code.patch --]
[-- Type: text/plain, Size: 16652 bytes --]

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _immediate_read() version, which uses standard global
variables, and optimized per architecture immediate_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__immediate" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure :   getppid() takes  1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __immediate is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0.  So
  the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
  that.
- Does not update tainted module's immediate values.
- remove immediate_*_t types, add DECLARE_IMMEDIATE() and DEFINE_IMMEDIATE().
  - immediate_read(&var) becomes immediate_read(var) because of this.
- Adding a new EXPORT_IMMEDIATE_SYMBOL(_GPL).
- remove immediate_if(). Should use if (unlikely(immediate_read(var))) instead.
  - Wait until we have gcc support before we add the immediate_if macro, since
    its form may have to change.
- Document immediate_set_early(). This design is chosen over immediate_init()
  because:
    - We can mark the function __init so it is freed after boot.
    - Module code patching can also be done by the "normal" function. This is
      preferred, even if it can be slightly slower, because update performance
      does not matter.
    - immediate_init() could confuse users because we can actually "initialize"
      an immediate value to a certain value, which will be used as initial
      value. e.g.:
        static DEFINE_IMMEDIATE(int, var) = 10;
- Dont't declare the __immediate section in vmlinux.lds.h, just put the content
  in the rodata section.
- Simplify interface : remove immediate_set_early, keep track of kernel boot
  status internally.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
---
 include/asm-generic/vmlinux.lds.h |    4 +
 include/linux/immediate.h         |   83 ++++++++++++++++++++++++++++++++++++
 include/linux/module.h            |   16 +++++++
 init/main.c                       |    8 +++
 kernel/Makefile                   |    1 
 kernel/immediate.c                |   86 ++++++++++++++++++++++++++++++++++++++
 kernel/module.c                   |   50 +++++++++++++++++++++-
 7 files changed, 247 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/linux/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/immediate.h	2007-11-13 09:46:32.000000000 -0500
@@ -0,0 +1,83 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+#include <asm/immediate.h>
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)						\
+	do {								\
+		name##__immediate = (i);				\
+		core_immediate_update();				\
+		module_immediate_update();				\
+	} while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_immediate_update(void);
+extern void immediate_update_range(const struct __immediate *begin,
+	const struct __immediate *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define immediate_read(name)		_immediate_read(name)
+
+/**
+ * immediate_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define immediate_set(name, i)		(name##__immediate = (i))
+
+static inline void core_immediate_update(void) { }
+static inline void module_immediate_update(void) { }
+
+#endif
+
+#define DECLARE_IMMEDIATE(type, name) extern __typeof__(type) name##__immediate
+#define DEFINE_IMMEDIATE(type, name)  __typeof__(type) name##__immediate
+
+#define EXPORT_IMMEDIATE_SYMBOL(name) EXPORT_SYMBOL(name##__immediate)
+#define EXPORT_IMMEDIATE_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__immediate)
+
+/**
+ * _immediate_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _immediate_read(name)		(name##__immediate)
+
+#endif
Index: linux-2.6-lttng/include/linux/module.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/module.h	2007-11-13 09:41:47.000000000 -0500
+++ linux-2.6-lttng/include/linux/module.h	2007-11-13 09:46:32.000000000 -0500
@@ -15,6 +15,7 @@
 #include <linux/stringify.h>
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
+#include <linux/immediate.h>
 #include <linux/marker.h>
 #include <asm/local.h>
 
@@ -355,6 +356,10 @@ struct module
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
 	char *args;
+#ifdef CONFIG_IMMEDIATE
+	const struct __immediate *immediate;
+	unsigned int num_immediate;
+#endif
 #ifdef CONFIG_MARKERS
 	struct marker *markers;
 	unsigned int num_markers;
@@ -464,6 +469,9 @@ extern void print_modules(void);
 
 extern void module_update_markers(struct module *probe_module, int *refcount);
 
+extern void _module_immediate_update(void);
+extern void module_immediate_update(void);
+
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -568,6 +576,14 @@ static inline void module_update_markers
 {
 }
 
+static inline void _module_immediate_update(void)
+{
+}
+
+static inline void module_immediate_update(void)
+{
+}
+
 #endif /* CONFIG_MODULES */
 
 struct device_driver;
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2007-11-13 09:41:47.000000000 -0500
+++ linux-2.6-lttng/kernel/module.c	2007-11-13 09:48:06.000000000 -0500
@@ -33,6 +33,7 @@
 #include <linux/cpu.h>
 #include <linux/moduleparam.h>
 #include <linux/errno.h>
+#include <linux/immediate.h>
 #include <linux/err.h>
 #include <linux/vermagic.h>
 #include <linux/notifier.h>
@@ -1673,6 +1674,7 @@ static struct module *load_module(void _
 	unsigned int unusedcrcindex;
 	unsigned int unusedgplindex;
 	unsigned int unusedgplcrcindex;
+	unsigned int immediateindex;
 	unsigned int markersindex;
 	unsigned int markersstringsindex;
 	struct module *mod;
@@ -1771,6 +1773,7 @@ static struct module *load_module(void _
 #ifdef ARCH_UNWIND_SECTION_NAME
 	unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME);
 #endif
+	immediateindex = find_sec(hdr, sechdrs, secstrings, "__immediate");
 
 	/* Don't keep modinfo section */
 	sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
@@ -1922,6 +1925,11 @@ static struct module *load_module(void _
 	mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr;
 	if (gplfuturecrcindex)
 		mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr;
+#ifdef CONFIG_IMMEDIATE
+	mod->immediate = (void *)sechdrs[immediateindex].sh_addr;
+	mod->num_immediate =
+		sechdrs[immediateindex].sh_size / sizeof(*mod->immediate);
+#endif
 
 	mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr;
 	if (unusedcrcindex)
@@ -1989,11 +1997,16 @@ static struct module *load_module(void _
 
 	add_kallsyms(mod, sechdrs, symindex, strindex, secstrings);
 
+	if (!mod->taints) {
 #ifdef CONFIG_MARKERS
-	if (!mod->taints)
 		marker_update_probe_range(mod->markers,
 			mod->markers + mod->num_markers, NULL, NULL);
 #endif
+#ifdef CONFIG_IMMEDIATE
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+#endif
+	}
 	err = module_finalize(hdr, sechdrs, mod);
 	if (err < 0)
 		goto cleanup;
@@ -2600,3 +2613,38 @@ void module_update_markers(struct module
 	mutex_unlock(&module_mutex);
 }
 #endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_immediate_update(void)
+{
+	struct module *mod;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		immediate_update_range(mod->immediate,
+			mod->immediate + mod->num_immediate);
+	}
+}
+EXPORT_SYMBOL_GPL(_module_immediate_update);
+
+/**
+ * module_immediate_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_immediate_update(void)
+{
+	mutex_lock(&module_mutex);
+	_module_immediate_update();
+	mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_immediate_update);
+#endif
Index: linux-2.6-lttng/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/kernel/immediate.c	2007-11-13 09:46:32.000000000 -0500
@@ -0,0 +1,86 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int immediate_early_boot_complete;
+
+extern const struct __immediate __start___immediate[];
+extern const struct __immediate __stop___immediate[];
+
+/*
+ * immediate_mutex nests inside module_mutex. immediate_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(immediate_mutex);
+
+/**
+ * immediate_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Sets a range of immediates to a enabled state : set the enable bit.
+ */
+void immediate_update_range(const struct __immediate *begin,
+		const struct __immediate *end)
+{
+	const struct __immediate *iter;
+	int ret;
+	if (immediate_early_boot_complete) {
+		for (iter = begin; iter < end; iter++) {
+			mutex_lock(&immediate_mutex);
+			kernel_text_lock();
+			ret = arch_immediate_update(iter);
+			kernel_text_unlock();
+			if (ret)
+				printk(KERN_WARNING
+					"Invalid immediate value. "
+					"Variable at %p, "
+					"instruction at %p, size %lu\n",
+					(void *)iter->immediate,
+					(void *)iter->var, iter->size);
+			mutex_unlock(&immediate_mutex);
+		}
+	} else
+		for (iter = begin; iter < end; iter++)
+			arch_immediate_update_early(iter);
+}
+EXPORT_SYMBOL_GPL(immediate_update_range);
+
+/**
+ * immediate_update - update all immediate values in the kernel
+ * @lock: should a module_mutex be taken ?
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_immediate_update(void)
+{
+	/* Core kernel immediates */
+	immediate_update_range(__start___immediate, __stop___immediate);
+}
+EXPORT_SYMBOL_GPL(core_immediate_update);
+
+void __init immediate_init_complete(void)
+{
+	immediate_early_boot_complete = 1;
+}
Index: linux-2.6-lttng/init/main.c
===================================================================
--- linux-2.6-lttng.orig/init/main.c	2007-11-13 09:25:29.000000000 -0500
+++ linux-2.6-lttng/init/main.c	2007-11-13 09:46:32.000000000 -0500
@@ -57,6 +57,7 @@
 #include <linux/device.h>
 #include <linux/kthread.h>
 #include <linux/sched.h>
+#include <linux/immediate.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -101,6 +102,11 @@ static inline void mark_rodata_ro(void) 
 #ifdef CONFIG_TC
 extern void tc_init(void);
 #endif
+#ifdef CONFIG_IMMEDIATE
+extern void immediate_init_complete(void);
+#else
+static inline void immediate_init_complete(void) { }
+#endif
 
 enum system_states system_state;
 EXPORT_SYMBOL(system_state);
@@ -518,6 +524,7 @@ asmlinkage void __init start_kernel(void
 	unwind_init();
 	lockdep_init();
 	cgroup_init_early();
+	core_immediate_update();
 
 	local_irq_disable();
 	early_boot_irqs_off();
@@ -639,6 +646,7 @@ asmlinkage void __init start_kernel(void
 	cpuset_init();
 	taskstats_init_early();
 	delayacct_init();
+	immediate_init_complete();
 
 	check_bugs();
 
Index: linux-2.6-lttng/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/kernel/Makefile	2007-11-13 09:25:29.000000000 -0500
+++ linux-2.6-lttng/kernel/Makefile	2007-11-13 09:46:32.000000000 -0500
@@ -57,6 +57,7 @@ obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
 obj-$(CONFIG_MARKERS) += marker.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
Index: linux-2.6-lttng/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/vmlinux.lds.h	2007-11-13 09:25:28.000000000 -0500
+++ linux-2.6-lttng/include/asm-generic/vmlinux.lds.h	2007-11-13 09:46:32.000000000 -0500
@@ -25,6 +25,10 @@
 		*(.rodata) *(.rodata.*)					\
 		*(__vermagic)		/* Kernel version magic */	\
 		*(__markers_strings)	/* Markers: strings */		\
+		. = ALIGN(8);						\
+		VMLINUX_SYMBOL(__start___immediate) = .;		\
+		*(__immediate)		/* Immediate values: pointers */ \
+		VMLINUX_SYMBOL(__stop___immediate) = .;			\
 	}								\
 									\
 	.rodata1          : AT(ADDR(.rodata1) - LOAD_OFFSET) {		\

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 2/8] Immediate Values - Kconfig menu in EMBEDDED
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 1/8] Immediate Values - Architecture Independent Code Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 3/8] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Adrian Bunk, Andi Kleen, Alexey Dobriyan,
	Christoph Hellwig

[-- Attachment #1: immediate-values-kconfig-embedded.patch --]
[-- Type: text/plain, Size: 2530 bytes --]

Immediate values provide a way to use dynamic code patching to update variables
sitting within the instruction stream. It saves caches lines normally used by
static read mostly variables. Enable it by default, but let users disable it
through the EMBEDDED menu with the "Disable immediate values" submenu entry.

Note: Since I think that I really should let embedded systems developers using
RO memory the option to disable the immediate values, I choose to leave this
menu option there, in the EMBEDDED menu. Also, the "CONFIG_IMMEDIATE" makes
sense because we want to compile out all the immediate code when we decide not
to use optimized immediate values at all (it removes otherwise unused code).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Christoph Hellwig <hch@infradead.org>
---
 init/Kconfig |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: linux-2.6-lttng/init/Kconfig
===================================================================
--- linux-2.6-lttng.orig/init/Kconfig	2007-11-13 13:33:54.000000000 -0500
+++ linux-2.6-lttng/init/Kconfig	2007-11-13 13:50:02.000000000 -0500
@@ -423,6 +423,17 @@ config CC_OPTIMIZE_FOR_SIZE
 config SYSCTL
 	bool
 
+config IMMEDIATE
+	default y if !DISABLE_IMMEDIATE
+	depends on ARCH_SUPPORTS_IMMEDIATE
+	bool
+	help
+	  Immediate values are used as read-mostly variables that are rarely
+	  updated. They use code patching to modify the values inscribed in the
+	  instruction stream. It provides a way to save precious cache lines
+	  that would otherwise have to be used by these variables. They can be
+	  disabled through the EMBEDDED menu.
+
 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"
 	help
@@ -658,6 +669,16 @@ config MARKERS
 
 source "arch/Kconfig"
 
+config DISABLE_IMMEDIATE
+	default y if EMBEDDED
+	bool "Disable immediate values" if EMBEDDED
+	depends on ARCH_SUPPORTS_IMMEDIATE
+	help
+	  Disable code patching based immediate values for embedded systems. It
+	  consumes slightly more memory and requires to modify the instruction
+	  stream each time a variable is updated. Should really be disabled for
+	  embedded systems with read-only text.
+
 endmenu		# General setup
 
 config RT_MUTEXES

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 3/8] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 1/8] Immediate Values - Architecture Independent Code Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 2/8] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 4/8] Add asm-compat.h to x86 Mathieu Desnoyers
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Ananth N Mavinakayanahalli, Christoph Hellwig,
	prasanna, anil.s.keshavamurthy, davem

[-- Attachment #1: immediate-values-move-kprobes-x86-restore-interrupt-to-kdebug-h.patch --]
[-- Type: text/plain, Size: 3272 bytes --]

Since the breakpoint handler is useful both to kprobes and immediate values, it
makes sense to make the required restore_interrupt() available through
asm-i386/kdebug.h.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: prasanna@in.ibm.com
CC: anil.s.keshavamurthy@intel.com
CC: davem@davemloft.net
---
 include/asm-x86/kdebug.h     |   12 ++++++++++++
 include/asm-x86/kprobes_32.h |    9 ---------
 include/asm-x86/kprobes_64.h |    9 ---------
 3 files changed, 12 insertions(+), 18 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/kdebug.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/kdebug.h	2007-11-02 15:01:53.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/kdebug.h	2007-11-02 15:02:00.000000000 -0400
@@ -3,6 +3,9 @@
 
 #include <linux/notifier.h>
 
+#include <linux/ptrace.h>
+#include <asm/system.h>
+
 struct pt_regs;
 
 /* Grossly misnamed. */
@@ -30,4 +33,13 @@ extern void dump_pagetable(unsigned long
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long);
 
+/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
+ * if necessary, before executing the original int3/1 (trap) handler.
+ */
+static inline void restore_interrupts(struct pt_regs *regs)
+{
+	if (regs->eflags & IF_MASK)
+		local_irq_enable();
+}
+
 #endif
Index: linux-2.6-lttng/include/asm-x86/kprobes_32.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/kprobes_32.h	2007-11-02 15:01:53.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/kprobes_32.h	2007-11-02 15:02:00.000000000 -0400
@@ -79,15 +79,6 @@ struct kprobe_ctlblk {
 	struct prev_kprobe prev_kprobe;
 };
 
-/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
- * if necessary, before executing the original int3/1 (trap) handler.
- */
-static inline void restore_interrupts(struct pt_regs *regs)
-{
-	if (regs->eflags & IF_MASK)
-		local_irq_enable();
-}
-
 extern int kprobe_exceptions_notify(struct notifier_block *self,
 				    unsigned long val, void *data);
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
Index: linux-2.6-lttng/include/asm-x86/kprobes_64.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/kprobes_64.h	2007-11-02 15:02:10.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/kprobes_64.h	2007-11-02 15:02:22.000000000 -0400
@@ -72,15 +72,6 @@ struct kprobe_ctlblk {
 	struct prev_kprobe prev_kprobe;
 };
 
-/* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
- * if necessary, before executing the original int3/1 (trap) handler.
- */
-static inline void restore_interrupts(struct pt_regs *regs)
-{
-	if (regs->eflags & IF_MASK)
-		local_irq_enable();
-}
-
 extern int post_kprobe_handler(struct pt_regs *regs);
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_handler(struct pt_regs *regs);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 4/8] Add asm-compat.h to x86
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2007-11-13 18:58 ` [patch 3/8] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 19:05   ` H. Peter Anvin
  2007-11-13 18:58 ` [patch 5/8] Immediate Values - x86 Optimization Mathieu Desnoyers
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

[-- Attachment #1: add-asm-compat-to-x86.patch --]
[-- Type: text/plain, Size: 1947 bytes --]

In assembly code and in gcc inline assembly, we need .long to express a "c long"
type on i386 and a .quad to express the same on x86_64. Use macros similar to
powerpc "PPC_LONG" to express those. Name chosen: ASM_LONG. (didn't feel like
X86_LONG was required)

This is useful in inline assembly within code shared between 32 and 64
bits architectures in x86.

More compatible assembly macros could be added in this header later when
needed.

I had to create this to implement a merged optimized immediate values
header for x86.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@elte.hu>
CC: "H. Peter Anvin" <hpa@zytor.com>
---
 include/asm-x86/asm-compat.h |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6-lttng/include/asm-x86/asm-compat.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/asm-compat.h	2007-11-01 22:43:09.000000000 -0400
@@ -0,0 +1,29 @@
+#ifndef _ASM_X86_ASM_COMPAT_H
+#define _ASM_X86_ASM_COMPAT_H
+
+#include <linux/types.h>
+
+#ifdef __ASSEMBLY__
+#  define stringify_in_c(...)	__VA_ARGS__
+#  define ASM_CONST(x)		x
+#else
+/* This version of stringify will deal with commas... */
+#  define __stringify_in_c(...)	#__VA_ARGS__
+#  define stringify_in_c(...)	__stringify_in_c(__VA_ARGS__) " "
+#  define __ASM_CONST(x)	x##UL
+#  define ASM_CONST(x)		__ASM_CONST(x)
+#endif
+
+#ifdef CONFIG_X86_64
+
+/* operations for longs and pointers */
+#define ASM_LONG	stringify_in_c(.quad)
+
+#else /* 32-bit */
+
+/* operations for longs and pointers */
+#define ASM_LONG	stringify_in_c(.long)
+
+#endif
+
+#endif /* _ASM_X86_ASM_COMPAT_H */

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2007-11-13 18:58 ` [patch 4/8] Add asm-compat.h to x86 Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 19:07   ` H. Peter Anvin
  2007-11-15  3:08   ` [patch 5/8] Immediate Values - x86 Optimization Rusty Russell
  2007-11-13 18:58 ` [patch 6/8] Immediate Values - Powerpc Optimization Mathieu Desnoyers
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: Mathieu Desnoyers, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge

[-- Attachment #1: immediate-values-x86-optimization.patch --]
[-- Type: text/plain, Size: 19642 bytes --]

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Use "=g" constraint for char immediate value inline assembly.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
 arch/x86/Kconfig.i386       |    3 
 arch/x86/Kconfig.x86_64     |    3 
 arch/x86/kernel/Makefile_32 |    1 
 arch/x86/kernel/Makefile_64 |    1 
 arch/x86/kernel/immediate.c |  330 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps_32.c  |   10 -
 include/asm-x86/immediate.h |  100 +++++++++++++
 7 files changed, 444 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32	2007-11-13 13:50:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_32	2007-11-13 13:51:58.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_KPROBES)		+= kprobes_32.o
 obj-$(CONFIG_MODULES)		+= module_32.o
 obj-y				+= sysenter_32.o vsyscall_32.o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat_32.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi_32.o efi_stub_32.o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
Index: linux-2.6-lttng/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c	2007-11-13 13:50:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/traps_32.c	2007-11-13 13:51:58.000000000 -0500
@@ -544,7 +544,7 @@ fastcall void do_##name(struct pt_regs *
 }
 
 DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -786,7 +786,7 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	trace_hardirqs_fixup();
@@ -794,8 +794,10 @@ fastcall void __kprobes do_int3(struct p
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
 		return;
-	/* This is an interrupt gate, because kprobes wants interrupts
-	disabled.  Normal trap handlers don't. */
+	/*
+	 * This is an interrupt gate, because kprobes and immediate values wants
+	 * interrupts disabled.  Normal trap handlers don't.
+	 */
 	restore_interrupts(regs);
 	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
 }
Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/immediate.h	2007-11-13 13:51:58.000000000 -0500
@@ -0,0 +1,100 @@
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm-compat.h>
+
+struct __immediate {
+	long var;		/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	long immediate;		/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	long size;		/* Type size. */
+} __attribute__ ((aligned(sizeof(long))));
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
+ * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
+ * fall back on a memory read.
+ * The 64 bits load immediates produced by gas have a 2 bytes opcode.
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 2 bytes opcode and 8 bytes immediate
+ * value.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					ASM_LONG "%c1, (0f)+1, 1\n\t"	\
+					".previous\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 2:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					ASM_LONG "%c1, (0f)+2, 2\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".align 2\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 4:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					ASM_LONG "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 3 - (. & 3), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__immediate;		\
+				break;					\
+			}						\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					ASM_LONG "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 6 - (. & 7), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0xFEFEFEFE01010101,%0\n\t" \
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		default:value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-13 13:51:58.000000000 -0500
@@ -0,0 +1,330 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			10
+
+static long target_after_int3;		/* EIP of the target after the int3 */
+static long bypass_eip;			/* EIP of the bypass. */
+static long bypass_after_int3;		/* EIP after the end-of-bypass int3 */
+static long after_immediate;		/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Size of the movl instruction (without the immediate value) in bytes.
+ * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
+ * wide.
+ */
+static inline size_t _immediate_get_insn_size(long size)
+{
+	switch (size) {
+	case 1: return 1;
+	case 2: return 2;
+	case 4: return 1;
+#ifdef CONFIG_X86_64
+	case 8: return 2;
+#endif
+	default: BUG();
+	};
+}
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ */
+static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
+{
+		asm volatile("jmp 2f;\n\t"
+				"0:\n\t"
+				".space 10, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"mov $(0b),%0;\n\t"
+				"mov $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr));
+}
+
+static void immediate_synchronize_core(void *info)
+{
+	sync_core();	/* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (instruction_pointer(args->regs) == target_after_int3) {
+			preempt_disable();
+			instruction_pointer(args->regs) = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (instruction_pointer(args->regs)
+				== bypass_after_int3) {
+			instruction_pointer(args->regs) = after_immediate;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+	.notifier_call = immediate_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+__kprobes int arch_immediate_update(const struct __immediate *immediate)
+{
+	int ret;
+	size_t insn_size = _immediate_get_insn_size(immediate->size);
+	long insn = immediate->immediate - insn_size;
+	long len;
+	unsigned long cr0;
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(*(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void *)immediate->immediate,
+				    (void *)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	_immediate_bypass(&bypass_eip, &bypass_after_int3);
+
+	after_immediate = immediate->immediate + immediate->size;
+
+	/*
+	 * Using the _early variants because nobody is executing the
+	 * bypass code while we patch it. It is protected by the
+	 * immediate_mutex. Since we modify the instructions non atomically (for
+	 * nops), we have to use the _early variant.
+	 * We must however deal with the WP flag in cr0 by ourself.
+	 */
+	kernel_wp_save(cr0);
+	text_poke_early((void *)bypass_eip, (void *)insn,
+			insn_size + immediate->size);
+	/*
+	 * Fill the rest with nops.
+	 */
+	len = NR_NOPS - immediate->size - insn_size;
+	add_nops((void *)(bypass_eip + insn_size + immediate->size), len);
+	kernel_wp_restore(cr0);
+
+	target_after_int3 = insn + BREAKPOINT_INS_LEN;
+	/* register_die_notifier has memory barriers */
+	register_die_notifier(&immediate_notify);
+	/* The breakpoint will single-step the bypass */
+	text_poke((void *)insn,
+		INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
+	/*
+	 * Make sure the breakpoint is set before we continue (visible to other
+	 * CPUs and interrupts).
+	 */
+	wmb();
+	/*
+	 * Execute serializing instruction on each CPU.
+	 */
+	ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+	BUG_ON(ret != 0);
+
+	text_poke((void *)(insn + insn_size), (void *)immediate->var,
+			immediate->size);
+	/*
+	 * Make sure the value can be seen from other CPUs and interrupts.
+	 */
+	wmb();
+	text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end
+		 * (interrupts are disabled in int3).
+		 * This CPU is clearly not in a int3 handler,
+		 * because int3 handler is not preemptible and
+		 * there cannot be any more int3 handler called
+		 * for this site, because we placed the original
+		 * instruction back.
+		 * synchronize_sched has memory barriers.
+		 */
+	synchronize_sched();
+	unregister_die_notifier(&immediate_notify);
+	/* unregister_die_notifier has memory barriers */
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return;
+		break;
+#endif
+	default:return;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+}
Index: linux-2.6-lttng/arch/x86/kernel/Makefile_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_64	2007-11-13 13:50:32.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_64	2007-11-13 13:51:58.000000000 -0500
@@ -33,6 +33,7 @@ obj-$(CONFIG_X86_PM_TIMER)	+= pmtimer_64
 obj-$(CONFIG_X86_VSMP)		+= vsmp_64.o
 obj-$(CONFIG_K8_NB)		+= k8.o
 obj-$(CONFIG_AUDIT)		+= audit_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 
 obj-$(CONFIG_MODULES)		+= module_64.o
 obj-$(CONFIG_PCI)		+= early-quirks.o
Index: linux-2.6-lttng/arch/x86/Kconfig.i386
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.i386	2007-11-13 13:52:52.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.i386	2007-11-13 13:53:06.000000000 -0500
@@ -97,6 +97,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
 menu "Processor type and features"
Index: linux-2.6-lttng/arch/x86/Kconfig.x86_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.x86_64	2007-11-13 13:53:14.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.x86_64	2007-11-13 13:53:25.000000000 -0500
@@ -139,6 +139,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 6/8] Immediate Values - Powerpc Optimization
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2007-11-13 18:58 ` [patch 5/8] Immediate Values - x86 Optimization Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 7/8] Immediate Values - Documentation Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 8/8] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers, Christoph Hellwig

[-- Attachment #1: immediate-values-powerpc-optimization.patch --]
[-- Type: text/plain, Size: 7433 bytes --]

PowerPC optimization of the immediate values which uses a li instruction,
patched with an immediate value.

Changelog:
- Put immediate_set and _immediate_set in the architecture independent header.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Christoph Hellwig <hch@infradead.org>
---
 arch/powerpc/Kconfig            |    3 +
 arch/powerpc/kernel/Makefile    |    1 
 arch/powerpc/kernel/immediate.c |  103 ++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/immediate.h |   71 +++++++++++++++++++++++++++
 4 files changed, 178 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-powerpc/immediate.h	2007-11-13 13:53:28.000000000 -0500
@@ -0,0 +1,71 @@
+#ifndef _ASM_POWERPC_IMMEDIATE_H
+#define _ASM_POWERPC_IMMEDIATE_H
+
+/*
+ * Immediate values. PowerPC architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm-compat.h>
+
+struct __immediate {
+	long var;		/* Identifier variable of the immediate value */
+	long immediate;		/*
+				 * Pointer to the memory location that holds
+				 * the immediate value within the load immediate
+				 * instruction.
+				 */
+	long size;		/* Type size. */
+} __attribute__ ((aligned(sizeof(long))));
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 bytes update will be atomic by aligning the immediate
+ * value. Use a normal memory read for the 4 bytes immediate because there is no
+ * way to atomically update it without using a seqlock read side, which would
+ * cost more in term of total i-cache and d-cache space than a simple memory
+ * read.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((0f)+3), 1\n\t" \
+					".previous\n\t"			\
+					"0:\n\t"			\
+					"li %0,0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 2:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					PPC_LONG "%c1, ((0f)+2), 2\n\t" \
+					".previous\n\t"			\
+					".align 2\n\t"			\
+					"0:\n\t"			\
+					"li %0,0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		default:						\
+			value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_POWERPC_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/powerpc/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/kernel/Makefile	2007-11-13 13:49:11.000000000 -0500
+++ linux-2.6-lttng/arch/powerpc/kernel/Makefile	2007-11-13 13:53:28.000000000 -0500
@@ -91,3 +91,4 @@ obj-$(CONFIG_PPC64)		+= $(obj64-y)
 
 extra-$(CONFIG_PPC_FPU)		+= fpu.o
 extra-$(CONFIG_PPC64)		+= entry_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
Index: linux-2.6-lttng/arch/powerpc/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/powerpc/kernel/immediate.c	2007-11-13 13:53:28.000000000 -0500
@@ -0,0 +1,103 @@
+/*
+ * Powerpc optimized immediate values enabling/disabling.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/string.h>
+#include <linux/kprobes.h>
+#include <asm/cacheflush.h>
+#include <asm/page.h>
+
+#define LI_OPCODE_LEN	2
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+int arch_immediate_update(const struct __immediate *immediate)
+{
+#ifdef CONFIG_KPROBES
+	kprobe_opcode_t *insn;
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	switch (immediate->size) {
+	case 1:	/* The uint8_t points to the 3rd byte of the
+		 * instruction */
+		insn = (void *)(immediate->immediate - 1 - LI_OPCODE_LEN);
+		break;
+	case 2:	insn = (void *)(immediate->immediate - LI_OPCODE_LEN);
+		break;
+	default:
+	return -EINVAL;
+	}
+
+	if (unlikely(*insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void *)immediate->immediate,
+				    (void *)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return 0;
+		break;
+	default:return -EINVAL;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+	flush_icache_range((unsigned long)immediate->immediate,
+		(unsigned long)immediate->immediate + immediate->size);
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ * We can use flush_icache_range, since the cpu identification has been done in
+ * the early_init stage.
+ */
+void arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return;
+		break;
+	default:return;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+	flush_icache_range((unsigned long)immediate->immediate,
+		(unsigned long)immediate->immediate + immediate->size);
+}
Index: linux-2.6-lttng/arch/powerpc/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig	2007-11-13 13:53:36.000000000 -0500
+++ linux-2.6-lttng/arch/powerpc/Kconfig	2007-11-13 13:53:58.000000000 -0500
@@ -169,6 +169,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
 source "arch/powerpc/platforms/Kconfig"

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 7/8] Immediate Values - Documentation
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2007-11-13 18:58 ` [patch 6/8] Immediate Values - Powerpc Optimization Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  2007-11-13 18:58 ` [patch 8/8] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 8837 bytes --]

Changelog:
- Remove immediate_set_early (removed from API).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 Documentation/immediate.txt |  221 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2007-11-03 20:28:58.000000000 -0400
@@ -0,0 +1,221 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(immediate_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMMEDIATE(char, this_immediate);
+EXPORT_IMMEDIATE_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use immediate_set(this_immediate) to set the immediate value.
+
+Use immediate_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _immediate_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMMEDIATE(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 markers for irq entry/exit, it
+brings us to 15348 markers sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Markers impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a marker upon syscall entry and syscall exit; markers are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Markers compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+markers than with markers. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+markers are compiled in or out. It seems that it aligns the function's
+data better to compile-in the markers in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based markers, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based markers (2 markers) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and without
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles
+
+Getppid without memory pressure:
+1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles
+
+Getppid with memory pressure:
+17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles
+
+Therefore, if we add 2 markers not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93±12.74 = 1111.86±25.48 cycles
+
+Therefore,
+
+1111.86±25.48 / 16535.11±1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 %
+
+We can therefore affirm that adding 2 markers to getppid, on a system with high
+memory pressure, would have a performance hit of at least 6.0% on the system
+call time, all within the uncertainty limits of these tests. The same applies to
+other kernel code paths. The smaller those code paths are, the highest the
+impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the markers, but I think it should also be
+considered as a replacement for many of the "read-mostly" static variables.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [patch 8/8] Scheduler Profiling - Use Immediate Values
  2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2007-11-13 18:58 ` [patch 7/8] Immediate Values - Documentation Mathieu Desnoyers
@ 2007-11-13 18:58 ` Mathieu Desnoyers
  7 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 18:58 UTC (permalink / raw)
  To: akpm, linux-kernel; +Cc: Mathieu Desnoyers

[-- Attachment #1: profiling-use-immediate-values.patch --]
[-- Type: text/plain, Size: 5769 bytes --]

Use immediate values with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 drivers/kvm/kvm_main.c  |    3 ++-
 include/linux/profile.h |    5 +++--
 kernel/profile.c        |   22 +++++++++++-----------
 kernel/sched_fair.c     |    6 +-----
 4 files changed, 17 insertions(+), 19 deletions(-)

Index: linux-2.6-lttng/kernel/profile.c
===================================================================
--- linux-2.6-lttng.orig/kernel/profile.c	2007-11-13 09:48:36.000000000 -0500
+++ linux-2.6-lttng/kernel/profile.c	2007-11-13 09:48:37.000000000 -0500
@@ -42,8 +42,8 @@ static int (*timer_hook)(struct pt_regs 
 static atomic_t *prof_buffer;
 static unsigned long prof_len, prof_shift;
 
-int prof_on __read_mostly;
-EXPORT_SYMBOL_GPL(prof_on);
+DEFINE_IMMEDIATE(char, prof_on) __read_mostly;
+EXPORT_IMMEDIATE_SYMBOL_GPL(prof_on);
 
 static cpumask_t prof_cpu_mask = CPU_MASK_ALL;
 #ifdef CONFIG_SMP
@@ -61,7 +61,7 @@ static int __init profile_setup(char * s
 
 	if (!strncmp(str, sleepstr, strlen(sleepstr))) {
 #ifdef CONFIG_SCHEDSTATS
-		prof_on = SLEEP_PROFILING;
+		immediate_set(prof_on, SLEEP_PROFILING);
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
 		if (get_option(&str, &par))
@@ -74,7 +74,7 @@ static int __init profile_setup(char * s
 			"kernel sleep profiling requires CONFIG_SCHEDSTATS\n");
 #endif /* CONFIG_SCHEDSTATS */
 	} else if (!strncmp(str, schedstr, strlen(schedstr))) {
-		prof_on = SCHED_PROFILING;
+		immediate_set(prof_on, SCHED_PROFILING);
 		if (str[strlen(schedstr)] == ',')
 			str += strlen(schedstr) + 1;
 		if (get_option(&str, &par))
@@ -83,7 +83,7 @@ static int __init profile_setup(char * s
 			"kernel schedule profiling enabled (shift: %ld)\n",
 			prof_shift);
 	} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
-		prof_on = KVM_PROFILING;
+		immediate_set(prof_on, KVM_PROFILING);
 		if (str[strlen(kvmstr)] == ',')
 			str += strlen(kvmstr) + 1;
 		if (get_option(&str, &par))
@@ -93,7 +93,7 @@ static int __init profile_setup(char * s
 			prof_shift);
 	} else if (get_option(&str, &par)) {
 		prof_shift = par;
-		prof_on = CPU_PROFILING;
+		immediate_set(prof_on, CPU_PROFILING);
 		printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
 			prof_shift);
 	}
@@ -104,7 +104,7 @@ __setup("profile=", profile_setup);
 
 void __init profile_init(void)
 {
-	if (!prof_on) 
+	if (!_immediate_read(prof_on))
 		return;
  
 	/* only text is profiled */
@@ -293,7 +293,7 @@ void profile_hits(int type, void *__pc, 
 	int i, j, cpu;
 	struct profile_hit *hits;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = min((pc - (unsigned long)_stext) >> prof_shift, prof_len - 1);
 	i = primary = (pc & (NR_PROFILE_GRP - 1)) << PROFILE_GRPSHIFT;
@@ -403,7 +403,7 @@ void profile_hits(int type, void *__pc, 
 {
 	unsigned long pc;
 
-	if (prof_on != type || !prof_buffer)
+	if (!prof_buffer)
 		return;
 	pc = ((unsigned long)__pc - (unsigned long)_stext) >> prof_shift;
 	atomic_add(nr_hits, &prof_buffer[min(pc, prof_len - 1)]);
@@ -560,7 +560,7 @@ static int __init create_hash_tables(voi
 	}
 	return 0;
 out_cleanup:
-	prof_on = 0;
+	immediate_set(prof_on, 0);
 	smp_mb();
 	on_each_cpu(profile_nop, NULL, 0, 1);
 	for_each_online_cpu(cpu) {
@@ -587,7 +587,7 @@ static int __init create_proc_profile(vo
 {
 	struct proc_dir_entry *entry;
 
-	if (!prof_on)
+	if (!_immediate_read(prof_on))
 		return 0;
 	if (create_hash_tables())
 		return -1;
Index: linux-2.6-lttng/include/linux/profile.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/profile.h	2007-11-13 09:48:36.000000000 -0500
+++ linux-2.6-lttng/include/linux/profile.h	2007-11-13 09:48:37.000000000 -0500
@@ -7,10 +7,11 @@
 #include <linux/init.h>
 #include <linux/cpumask.h>
 #include <linux/cache.h>
+#include <linux/immediate.h>
 
 #include <asm/errno.h>
 
-extern int prof_on __read_mostly;
+DECLARE_IMMEDIATE(char, prof_on) __read_mostly;
 
 #define CPU_PROFILING	1
 #define SCHED_PROFILING	2
@@ -38,7 +39,7 @@ static inline void profile_hit(int type,
 	/*
 	 * Speedup for the common (no profiling enabled) case:
 	 */
-	if (unlikely(prof_on == type))
+	if (unlikely(immediate_read(prof_on) == type))
 		profile_hits(type, ip, 1);
 }
 
Index: linux-2.6-lttng/drivers/kvm/kvm_main.c
===================================================================
--- linux-2.6-lttng.orig/drivers/kvm/kvm_main.c	2007-11-13 09:48:36.000000000 -0500
+++ linux-2.6-lttng/drivers/kvm/kvm_main.c	2007-11-13 09:48:37.000000000 -0500
@@ -2054,7 +2054,8 @@ again:
 	/*
 	 * Profile KVM exit RIPs:
 	 */
-	if (unlikely(prof_on == KVM_PROFILING)) {
+
+	if (unlikely(immediate_read(prof_on) == KVM_PROFILING)) {
 		kvm_x86_ops->cache_regs(vcpu);
 		profile_hit(KVM_PROFILING, (void *)vcpu->rip);
 	}
Index: linux-2.6-lttng/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched_fair.c	2007-11-13 09:48:36.000000000 -0500
+++ linux-2.6-lttng/kernel/sched_fair.c	2007-11-13 09:48:37.000000000 -0500
@@ -455,12 +455,8 @@ static void enqueue_sleeper(struct cfs_r
 		 * get a milliseconds-range estimation of the amount of
 		 * time that the task spent sleeping:
 		 */
-		if (unlikely(prof_on == SLEEP_PROFILING)) {
-			struct task_struct *tsk = task_of(se);
-
-			profile_hits(SLEEP_PROFILING, (void *)get_wchan(tsk),
+		profile_hits(SLEEP_PROFILING, (void *)get_wchan(task_of(se)),
 				     delta >> 20);
-		}
 	}
 #endif
 }

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 4/8] Add asm-compat.h to x86
  2007-11-13 18:58 ` [patch 4/8] Add asm-compat.h to x86 Mathieu Desnoyers
@ 2007-11-13 19:05   ` H. Peter Anvin
  2007-11-13 20:37     ` [patch 4/8] Add asm-compat.h to x86 -> use new asm.h instead Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 19:05 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: akpm, linux-kernel, Thomas Gleixner, Ingo Molnar

Mathieu Desnoyers wrote:
> In assembly code and in gcc inline assembly, we need .long to express a "c long"
> type on i386 and a .quad to express the same on x86_64. Use macros similar to
> powerpc "PPC_LONG" to express those. Name chosen: ASM_LONG. (didn't feel like
> X86_LONG was required)

In the x86 queue I already have a patch which adds <asm/asm.h> for this; 
I used the namespace _ASM_* and the name _ASM_PTR since in Linux it is 
what holds a pointer.

	-hpa


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 18:58 ` [patch 5/8] Immediate Values - x86 Optimization Mathieu Desnoyers
@ 2007-11-13 19:07   ` H. Peter Anvin
  2007-11-13 19:24     ` Mathieu Desnoyers
  2007-11-15  3:08   ` [patch 5/8] Immediate Values - x86 Optimization Rusty Russell
  1 sibling, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 19:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
- Use "=g" constraint for char immediate value inline assembly.

"=g" is the same as "=rmi" which is inherently bogus.  In your actual 
code you use "=r", the correct constraint is "=q".

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 19:07   ` H. Peter Anvin
@ 2007-11-13 19:24     ` Mathieu Desnoyers
  2007-11-13 19:36       ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 19:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
> - Use "=g" constraint for char immediate value inline assembly.
>
> "=g" is the same as "=rmi" which is inherently bogus.  In your actual code 
> you use "=r", the correct constraint is "=q".
>

Hi Peter,

Yup, =g wasn't what I was looking for at all, the header comment is
bogus.

From
http://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints

`r'
    A register operand is allowed provided that it is in a general register.

From
http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
Intel 386   config/i386/constraints.md

q
    Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 64-bit mode, any integer register. 


I am worried that "=q" might exclude the si and di registers in 32-bit mode.

What exactly is wrong with "=r" ?


> 	-hpa

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 19:24     ` Mathieu Desnoyers
@ 2007-11-13 19:36       ` H. Peter Anvin
  2007-11-13 19:45         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 19:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
>> - Use "=g" constraint for char immediate value inline assembly.
>>
>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual code 
>> you use "=r", the correct constraint is "=q".
> 
> q
>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 64-bit mode, any integer register. 
> 
> 
> I am worried that "=q" might exclude the si and di registers in 32-bit mode.
> 
> What exactly is wrong with "=r" ?
> 

For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.

Hence "=q".

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 19:36       ` H. Peter Anvin
@ 2007-11-13 19:45         ` Mathieu Desnoyers
  2007-11-13 19:56           ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 19:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>>> - Use "=g" constraint for char immediate value inline assembly.
>>>
>>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual 
>>> code you use "=r", the correct constraint is "=q".
>> q
>>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 
>> 64-bit mode, any integer register. I am worried that "=q" might exclude 
>> the si and di registers in 32-bit mode.
>> What exactly is wrong with "=r" ?
>
> For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
>
> Hence "=q".
>

Ah! yep, I see, so we say:

1 byte : "=q"
2 bytes : "=r"
4 bytes : "=r"
8 bytes : "=r"

? (si and di appear to be legal for 2 and 4 bytes in 32-bit mode)




> 	-hpa

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 19:45         ` Mathieu Desnoyers
@ 2007-11-13 19:56           ` H. Peter Anvin
  2007-11-13 20:40             ` [patch 5/8] Immediate Values - x86 Optimization (update) Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 19:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
> * H. Peter Anvin (hpa@zytor.com) wrote:
>> Mathieu Desnoyers wrote:
>>>> - Use "=g" constraint for char immediate value inline assembly.
>>>>
>>>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual 
>>>> code you use "=r", the correct constraint is "=q".
>>> q
>>>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 
>>> 64-bit mode, any integer register. I am worried that "=q" might exclude 
>>> the si and di registers in 32-bit mode.
>>> What exactly is wrong with "=r" ?
>> For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
>>
>> Hence "=q".
>>
> 
> Ah! yep, I see, so we say:
> 
> 1 byte : "=q"
> 2 bytes : "=r"
> 4 bytes : "=r"
> 8 bytes : "=r"
> 
> ? (si and di appear to be legal for 2 and 4 bytes in 32-bit mode)
> 

That's right.

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 4/8] Add asm-compat.h to x86 -> use new asm.h instead
  2007-11-13 19:05   ` H. Peter Anvin
@ 2007-11-13 20:37     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 20:37 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> In assembly code and in gcc inline assembly, we need .long to express a "c 
>> long"
>> type on i386 and a .quad to express the same on x86_64. Use macros similar 
>> to
>> powerpc "PPC_LONG" to express those. Name chosen: ASM_LONG. (didn't feel 
>> like
>> X86_LONG was required)
>
> In the x86 queue I already have a patch which adds <asm/asm.h> for this; I 
> used the namespace _ASM_* and the name _ASM_PTR since in Linux it is what 
> holds a pointer.
>
> 	-hpa
>

Andrew, this asm-compat.h patch should be replaced by the asm.h patch
from Peter. Here it is, straight from the x86 merge git repository :


From: H. Peter Anvin <hpa@zytor.com>
Date: Fri, 2 Nov 2007 20:59:47 +0000 (-0700)
Subject: x86: add <asm/asm.h>
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fhpa%2Flinux-2.6-x86-headermerge.git;a=commitdiff_plain;h=b02f15537b3bf43e347214cf14bad80aeaef1caf;hp=54866f032307063776b4eff7eadb131d47f9f9b4

x86: add <asm/asm.h>

Create <asm/asm.h>, with common definitions suitable for assembly
unification.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---

diff --git a/include/asm-x86/asm.h b/include/asm-x86/asm.h
new file mode 100644
index 0000000..b5006eb
--- /dev/null
+++ b/include/asm-x86/asm.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_X86_ASM_H
+#define _ASM_X86_ASM_H
+
+#ifdef CONFIG_X86_32
+/* 32 bits */
+
+# define _ASM_PTR	" .long "
+# define _ASM_ALIGN	" .balign 4 "
+
+#else
+/* 64 bits */
+
+# define _ASM_PTR	" .quad "
+# define _ASM_ALIGN	" .balign 8 "
+
+#endif /* CONFIG_X86_32 */
+
+#endif /* _ASM_X86_ASM_H */


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-13 19:56           ` H. Peter Anvin
@ 2007-11-13 20:40             ` Mathieu Desnoyers
  2007-11-13 21:26               ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 20:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> * H. Peter Anvin (hpa@zytor.com) wrote:
>>> Mathieu Desnoyers wrote:
>>>>> - Use "=g" constraint for char immediate value inline assembly.
>>>>>
>>>>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual 
>>>>> code you use "=r", the correct constraint is "=q".
>>>> q
>>>>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 
>>>> 64-bit mode, any integer register. I am worried that "=q" might exclude 
>>>> the si and di registers in 32-bit mode.
>>>> What exactly is wrong with "=r" ?
>>> For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
>>>
>>> Hence "=q".
>>>
>> Ah! yep, I see, so we say:
>> 1 byte : "=q"
>> 2 bytes : "=r"
>> 4 bytes : "=r"
>> 8 bytes : "=r"
>> ? (si and di appear to be legal for 2 and 4 bytes in 32-bit mode)
>
> That's right.
>
> 	-hpa

Andrew,

This version of the immediate values x86 optimization should be used
instead of the one originally submitted. It uses the correct constraints
and uses the new asm.h instead of asm-compat.h.


Immediate Values - x86 Optimization

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use "=r" constraint for char immediate value inline assembly.
- Use "=q" for 1 byte immediate values to exclude si, di, bp, sp from the usable
  register set.
- Use asm-x86/asm.h.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
 arch/x86/Kconfig.i386       |    3 
 arch/x86/Kconfig.x86_64     |    3 
 arch/x86/kernel/Makefile_32 |    1 
 arch/x86/kernel/Makefile_64 |    1 
 arch/x86/kernel/immediate.c |  330 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps_32.c  |   10 -
 include/asm-x86/immediate.h |  100 +++++++++++++
 7 files changed, 444 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32	2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_32	2007-11-13 15:29:13.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_KPROBES)		+= kprobes_32.o
 obj-$(CONFIG_MODULES)		+= module_32.o
 obj-y				+= sysenter_32.o vsyscall_32.o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat_32.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi_32.o efi_stub_32.o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
Index: linux-2.6-lttng/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c	2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/traps_32.c	2007-11-13 15:29:13.000000000 -0500
@@ -544,7 +544,7 @@ fastcall void do_##name(struct pt_regs *
 }
 
 DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -786,7 +786,7 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	trace_hardirqs_fixup();
@@ -794,8 +794,10 @@ fastcall void __kprobes do_int3(struct p
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
 		return;
-	/* This is an interrupt gate, because kprobes wants interrupts
-	disabled.  Normal trap handlers don't. */
+	/*
+	 * This is an interrupt gate, because kprobes and immediate values wants
+	 * interrupts disabled.  Normal trap handlers don't.
+	 */
 	restore_interrupts(regs);
 	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
 }
Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/immediate.h	2007-11-13 15:31:01.000000000 -0500
@@ -0,0 +1,100 @@
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm.h>
+
+struct __immediate {
+	long var;		/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	long immediate;		/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	long size;		/* Type size. */
+} __attribute__ ((aligned(sizeof(long))));
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
+ * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
+ * fall back on a memory read.
+ * The 64 bits load immediates produced by gas have a 2 bytes opcode.
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 2 bytes opcode and 8 bytes immediate
+ * value.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 1\n\t"	\
+					".previous\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=q" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 2:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+2, 2\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".align 2\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 4:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 3 - (. & 3), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__immediate;		\
+				break;					\
+			}						\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 6 - (. & 7), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0xFEFEFEFE01010101,%0\n\t" \
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		default:value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-13 15:29:13.000000000 -0500
@@ -0,0 +1,330 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			10
+
+static long target_after_int3;		/* EIP of the target after the int3 */
+static long bypass_eip;			/* EIP of the bypass. */
+static long bypass_after_int3;		/* EIP after the end-of-bypass int3 */
+static long after_immediate;		/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Size of the movl instruction (without the immediate value) in bytes.
+ * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
+ * wide.
+ */
+static inline size_t _immediate_get_insn_size(long size)
+{
+	switch (size) {
+	case 1: return 1;
+	case 2: return 2;
+	case 4: return 1;
+#ifdef CONFIG_X86_64
+	case 8: return 2;
+#endif
+	default: BUG();
+	};
+}
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ */
+static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
+{
+		asm volatile("jmp 2f;\n\t"
+				"0:\n\t"
+				".space 10, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"mov $(0b),%0;\n\t"
+				"mov $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr));
+}
+
+static void immediate_synchronize_core(void *info)
+{
+	sync_core();	/* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (instruction_pointer(args->regs) == target_after_int3) {
+			preempt_disable();
+			instruction_pointer(args->regs) = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (instruction_pointer(args->regs)
+				== bypass_after_int3) {
+			instruction_pointer(args->regs) = after_immediate;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+	.notifier_call = immediate_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+__kprobes int arch_immediate_update(const struct __immediate *immediate)
+{
+	int ret;
+	size_t insn_size = _immediate_get_insn_size(immediate->size);
+	long insn = immediate->immediate - insn_size;
+	long len;
+	unsigned long cr0;
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(*(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void *)immediate->immediate,
+				    (void *)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	_immediate_bypass(&bypass_eip, &bypass_after_int3);
+
+	after_immediate = immediate->immediate + immediate->size;
+
+	/*
+	 * Using the _early variants because nobody is executing the
+	 * bypass code while we patch it. It is protected by the
+	 * immediate_mutex. Since we modify the instructions non atomically (for
+	 * nops), we have to use the _early variant.
+	 * We must however deal with the WP flag in cr0 by ourself.
+	 */
+	kernel_wp_save(cr0);
+	text_poke_early((void *)bypass_eip, (void *)insn,
+			insn_size + immediate->size);
+	/*
+	 * Fill the rest with nops.
+	 */
+	len = NR_NOPS - immediate->size - insn_size;
+	add_nops((void *)(bypass_eip + insn_size + immediate->size), len);
+	kernel_wp_restore(cr0);
+
+	target_after_int3 = insn + BREAKPOINT_INS_LEN;
+	/* register_die_notifier has memory barriers */
+	register_die_notifier(&immediate_notify);
+	/* The breakpoint will single-step the bypass */
+	text_poke((void *)insn,
+		INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
+	/*
+	 * Make sure the breakpoint is set before we continue (visible to other
+	 * CPUs and interrupts).
+	 */
+	wmb();
+	/*
+	 * Execute serializing instruction on each CPU.
+	 */
+	ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+	BUG_ON(ret != 0);
+
+	text_poke((void *)(insn + insn_size), (void *)immediate->var,
+			immediate->size);
+	/*
+	 * Make sure the value can be seen from other CPUs and interrupts.
+	 */
+	wmb();
+	text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end
+		 * (interrupts are disabled in int3).
+		 * This CPU is clearly not in a int3 handler,
+		 * because int3 handler is not preemptible and
+		 * there cannot be any more int3 handler called
+		 * for this site, because we placed the original
+		 * instruction back.
+		 * synchronize_sched has memory barriers.
+		 */
+	synchronize_sched();
+	unregister_die_notifier(&immediate_notify);
+	/* unregister_die_notifier has memory barriers */
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return;
+		break;
+#endif
+	default:return;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+}
Index: linux-2.6-lttng/arch/x86/kernel/Makefile_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_64	2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_64	2007-11-13 15:29:13.000000000 -0500
@@ -33,6 +33,7 @@ obj-$(CONFIG_X86_PM_TIMER)	+= pmtimer_64
 obj-$(CONFIG_X86_VSMP)		+= vsmp_64.o
 obj-$(CONFIG_K8_NB)		+= k8.o
 obj-$(CONFIG_AUDIT)		+= audit_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 
 obj-$(CONFIG_MODULES)		+= module_64.o
 obj-$(CONFIG_PCI)		+= early-quirks.o
Index: linux-2.6-lttng/arch/x86/Kconfig.i386
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.i386	2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.i386	2007-11-13 15:29:13.000000000 -0500
@@ -97,6 +97,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
 menu "Processor type and features"
Index: linux-2.6-lttng/arch/x86/Kconfig.x86_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.x86_64	2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.x86_64	2007-11-13 15:29:13.000000000 -0500
@@ -139,6 +139,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-13 20:40             ` [patch 5/8] Immediate Values - x86 Optimization (update) Mathieu Desnoyers
@ 2007-11-13 21:26               ` H. Peter Anvin
  2007-11-13 22:02                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 21:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
> * H. Peter Anvin (hpa@zytor.com) wrote:
>> Mathieu Desnoyers wrote:
>>> * H. Peter Anvin (hpa@zytor.com) wrote:
>>>> Mathieu Desnoyers wrote:
>>>>>> - Use "=g" constraint for char immediate value inline assembly.
>>>>>>
>>>>>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual 
>>>>>> code you use "=r", the correct constraint is "=q".
>>>>> q
>>>>>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 
>>>>> 64-bit mode, any integer register. I am worried that "=q" might exclude 
>>>>> the si and di registers in 32-bit mode.
>>>>> What exactly is wrong with "=r" ?
>>>> For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
>>>>
>>>> Hence "=q".
>>>>
>>> Ah! yep, I see, so we say:
>>> 1 byte : "=q"
>>> 2 bytes : "=r"
>>> 4 bytes : "=r"
>>> 8 bytes : "=r"
>>> ? (si and di appear to be legal for 2 and 4 bytes in 32-bit mode)
>> That's right.
>>
>> 	-hpa
> 

Something else to watch out for... in 64-bit mode the lengths most of 
these will depend on which register is used, since whether or not a REX 
prefix is needed will vary.

As far as I can tell, you're assuming fixed length instructions, which 
is wrong unless you manually constrain yourself to only legacy registers.

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-13 21:26               ` H. Peter Anvin
@ 2007-11-13 22:02                 ` Mathieu Desnoyers
  2007-11-13 22:35                   ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-13 22:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> * H. Peter Anvin (hpa@zytor.com) wrote:
>>> Mathieu Desnoyers wrote:
>>>> * H. Peter Anvin (hpa@zytor.com) wrote:
>>>>> Mathieu Desnoyers wrote:
>>>>>>> - Use "=g" constraint for char immediate value inline assembly.
>>>>>>>
>>>>>>> "=g" is the same as "=rmi" which is inherently bogus.  In your actual 
>>>>>>> code you use "=r", the correct constraint is "=q".
>>>>>> q
>>>>>>     Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 
>>>>>> 64-bit mode, any integer register. I am worried that "=q" might 
>>>>>> exclude the si and di registers in 32-bit mode.
>>>>>> What exactly is wrong with "=r" ?
>>>>> For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
>>>>>
>>>>> Hence "=q".
>>>>>
>>>> Ah! yep, I see, so we say:
>>>> 1 byte : "=q"
>>>> 2 bytes : "=r"
>>>> 4 bytes : "=r"
>>>> 8 bytes : "=r"
>>>> ? (si and di appear to be legal for 2 and 4 bytes in 32-bit mode)
>>> That's right.
>>>
>>> 	-hpa
>
> Something else to watch out for... in 64-bit mode the lengths most of these 
> will depend on which register is used, since whether or not a REX prefix is 
> needed will vary.
>
> As far as I can tell, you're assuming fixed length instructions, which is 
> wrong unless you manually constrain yourself to only legacy registers.
>

This is what I was pointing out in this previous message :
http://lkml.org/lkml/2007/10/20/92

"I am still trying to figure out if we must assume that gas will produce
different length opcodes for mov instructions. The choice is:
- Either I use a "r" constraint and let gcc produce the instructions,
  that I need to assume to have correct size so I can align their
  immediate values (therefore, taking the offset from the end of the
  instruction will not help). Here, if gas changes its behavior
  dramatically for a given immediate value size, it will break.

- Second choice is to stick to a particular register, choosing the one
  with the less side-effect, and encoding the instruction ourselves. I
  start to think that this second solution might be safer, even though
  we wouldn't let the compiler select the register which has the less
  impact by itself."

Andi seemed to trust gas stability and you answered:

"The comment was referring to x86-64, but I incorrectly remembered that 
applying to "movq $imm,%reg" as opposed to loading from an absolute 
address.  gas actually has a special opcode (movabs) for the 64-bit 
version of the latter variant, which is only available with %rax and its 
subregisters.

Nevermind, in other words.  It's still true, though, that the immediate 
will always be the last thing in the instruction -- that's a fixture of 
the instruction format."

So, in the end, is there a way to make x86_64 use a fixed-size opcode
for the 1, 2, 4 and 8 bytes load immediates or we will have to force the
use of a specific register ?

(and we can't take a pointer from the end of the instruction, because we
need to align the immediate value correctly)

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-13 22:02                 ` Mathieu Desnoyers
@ 2007-11-13 22:35                   ` H. Peter Anvin
  2007-11-14  0:34                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-13 22:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
> 
> Andi seemed to trust gas stability and you answered:
> 
> "The comment was referring to x86-64, but I incorrectly remembered that 
> applying to "movq $imm,%reg" as opposed to loading from an absolute 
> address.  gas actually has a special opcode (movabs) for the 64-bit 
> version of the latter variant, which is only available with %rax and its 
> subregisters.
> 
> Nevermind, in other words.  It's still true, though, that the immediate 
> will always be the last thing in the instruction -- that's a fixture of 
> the instruction format."
> 
> So, in the end, is there a way to make x86_64 use a fixed-size opcode
> for the 1, 2, 4 and 8 bytes load immediates or we will have to force the
> use of a specific register ?
> 
> (and we can't take a pointer from the end of the instruction, because we
> need to align the immediate value correctly)
> 

For a 64-bit load, you'll always have a REX prefix.  For 8-, 16- and 
32-bit load, the length of the instruction will depend on the register 
chosen, unless you constrain to either all legacy or all upper 
registers, or you force gas to generate a prefix, but I don't think 
there is a way to do that that will work with assemblers all the way 
back to 2.12, which is at least what we officially support (I have no 
idea if assemblers that far back actually *work*, mind you.)

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-13 22:35                   ` H. Peter Anvin
@ 2007-11-14  0:34                     ` Mathieu Desnoyers
  2007-11-14  1:02                       ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-14  0:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> Andi seemed to trust gas stability and you answered:
>> "The comment was referring to x86-64, but I incorrectly remembered that 
>> applying to "movq $imm,%reg" as opposed to loading from an absolute 
>> address.  gas actually has a special opcode (movabs) for the 64-bit 
>> version of the latter variant, which is only available with %rax and its 
>> subregisters.
>> Nevermind, in other words.  It's still true, though, that the immediate 
>> will always be the last thing in the instruction -- that's a fixture of 
>> the instruction format."
>> So, in the end, is there a way to make x86_64 use a fixed-size opcode
>> for the 1, 2, 4 and 8 bytes load immediates or we will have to force the
>> use of a specific register ?
>> (and we can't take a pointer from the end of the instruction, because we
>> need to align the immediate value correctly)
>
> For a 64-bit load, you'll always have a REX prefix.  For 8-, 16- and 32-bit 
> load, the length of the instruction will depend on the register chosen, 
> unless you constrain to either all legacy or all upper registers, or you 
> force gas to generate a prefix, but I don't think there is a way to do that 
> that will work with assemblers all the way back to 2.12, which is at least 
> what we officially support (I have no idea if assemblers that far back 
> actually *work*, mind you.)
>
> 	-hpa

Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :


1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
                 on all i386 processors (a, b, c, d, si, di, bp, sp). 
8 bytes : (only for x86_64)
          "=r" : A register operand is allowed provided that it is in a
                 general register.

That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.

Does it make sense ?

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update)
  2007-11-14  0:34                     ` Mathieu Desnoyers
@ 2007-11-14  1:02                       ` H. Peter Anvin
  2007-11-14  1:44                         ` [patch 5/8] Immediate Values - x86 Optimization (update 2) Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-14  1:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
> 
> Ok, so the most flexible solution that I see, that should fit for both
> i386 and x86_64 would be :
> 
> 
> 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
> 2, 4 bytes : "=R" : Legacy register—the eight integer registers available
>                  on all i386 processors (a, b, c, d, si, di, bp, sp). 
> 8 bytes : (only for x86_64)
>           "=r" : A register operand is allowed provided that it is in a
>                  general register.
> 
> That should make sure x86_64 won't try to use REX prefixed opcodes for
> 1, 2 and 4 bytes values.
> 
> Does it make sense ?
> 

That's probably the best bet.

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update 2)
  2007-11-14  1:02                       ` H. Peter Anvin
@ 2007-11-14  1:44                         ` Mathieu Desnoyers
  2007-11-14  2:58                           ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-14  1:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Immediate Values - x86 Optimization

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.

Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
                 on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
          "=r" : A register operand is allowed provided that it is in a
                 general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
 arch/x86/Kconfig.i386       |    3 
 arch/x86/Kconfig.x86_64     |    3 
 arch/x86/kernel/Makefile_32 |    1 
 arch/x86/kernel/Makefile_64 |    1 
 arch/x86/kernel/immediate.c |  330 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps_32.c  |   10 -
 include/asm-x86/immediate.h |  109 ++++++++++++++
 7 files changed, 453 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32	2007-11-13 17:06:16.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_32	2007-11-13 20:12:34.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_KPROBES)		+= kprobes_32.o
 obj-$(CONFIG_MODULES)		+= module_32.o
 obj-y				+= sysenter_32.o vsyscall_32.o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat_32.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi_32.o efi_stub_32.o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
Index: linux-2.6-lttng/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c	2007-11-13 17:06:16.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/traps_32.c	2007-11-13 20:12:34.000000000 -0500
@@ -544,7 +544,7 @@ fastcall void do_##name(struct pt_regs *
 }
 
 DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -786,7 +786,7 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	trace_hardirqs_fixup();
@@ -794,8 +794,10 @@ fastcall void __kprobes do_int3(struct p
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
 		return;
-	/* This is an interrupt gate, because kprobes wants interrupts
-	disabled.  Normal trap handlers don't. */
+	/*
+	 * This is an interrupt gate, because kprobes and immediate values wants
+	 * interrupts disabled.  Normal trap handlers don't.
+	 */
 	restore_interrupts(regs);
 	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
 }
Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/immediate.h	2007-11-13 20:16:05.000000000 -0500
@@ -0,0 +1,109 @@
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm.h>
+
+struct __immediate {
+	long var;		/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	long immediate;		/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	long size;		/* Type size. */
+} __attribute__ ((aligned(sizeof(long))));
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
+ * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
+ * fall back on a memory read.
+ * The 64 bits load immediates produced by gas have a 2 bytes opcode.
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 2 bytes opcode and 8 bytes immediate
+ * value.
+ * A solution that should fit for both i386 and x86_64 would be :
+ * 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
+ * 2, 4 bytes : "=R" : Legacy register—the eight integer registers available
+ *               on all i386 processors (a, b, c, d, si, di, bp, sp).
+ * 8 bytes : (only for x86_64)
+ *              "=r" : A register operand is allowed provided that it is in a
+ *              general register.
+ * That should make sure x86_64 won't try to use REX prefixed opcodes for
+ * 1, 2 and 4 bytes values.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 1\n\t"	\
+					".previous\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=Q" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 2:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+2, 2\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".align 2\n\t"			\
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=R" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 4:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 3 - (. & 3), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0,%0\n\t"			\
+				: "=R" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__immediate;		\
+				break;					\
+			}						\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
+					".previous\n\t"			\
+					"1:\n\t"			\
+					".org . + 6 - (. & 7), 0x90\n\t" \
+					"0:\n\t"			\
+					"mov $0xFEFEFEFE01010101,%0\n\t" \
+				: "=r" (value)				\
+				: "i" (&name##__immediate));		\
+			break;						\
+		default:value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-13 20:12:34.000000000 -0500
@@ -0,0 +1,330 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION  0xcc
+#define BREAKPOINT_INS_LEN	1
+#define NR_NOPS			10
+
+static long target_after_int3;		/* EIP of the target after the int3 */
+static long bypass_eip;			/* EIP of the bypass. */
+static long bypass_after_int3;		/* EIP after the end-of-bypass int3 */
+static long after_immediate;		/*
+					 * EIP where to resume after the
+					 * single-stepping.
+					 */
+
+/*
+ * Size of the movl instruction (without the immediate value) in bytes.
+ * The 2 bytes load immediate has a 66H prefix, which makes the opcode 2 bytes
+ * wide.
+ */
+static inline size_t _immediate_get_insn_size(long size)
+{
+	switch (size) {
+	case 1: return 1;
+	case 2: return 2;
+	case 4: return 1;
+#ifdef CONFIG_X86_64
+	case 8: return 2;
+#endif
+	default: BUG();
+	};
+}
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ */
+static inline void _immediate_bypass(long *bypassaddr, long *breaknextaddr)
+{
+		asm volatile("jmp 2f;\n\t"
+				"0:\n\t"
+				".space 10, 0x90;\n\t"
+				"1:\n\t"
+				"int3;\n\t"
+				"2:\n\t"
+				"mov $(0b),%0;\n\t"
+				"mov $((1b)+1),%1;\n\t"
+				: "=r" (*bypassaddr),
+				  "=r" (*breaknextaddr));
+}
+
+static void immediate_synchronize_core(void *info)
+{
+	sync_core();	/* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+	unsigned long val, void *data)
+{
+	enum die_val die_val = (enum die_val) val;
+	struct die_args *args = data;
+
+	if (!args->regs || user_mode_vm(args->regs))
+		return NOTIFY_DONE;
+
+	if (die_val == DIE_INT3) {
+		if (instruction_pointer(args->regs) == target_after_int3) {
+			preempt_disable();
+			instruction_pointer(args->regs) = bypass_eip;
+			return NOTIFY_STOP;
+		} else if (instruction_pointer(args->regs)
+				== bypass_after_int3) {
+			instruction_pointer(args->regs) = after_immediate;
+			preempt_enable();
+			return NOTIFY_STOP;
+		}
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+	.notifier_call = immediate_notifier,
+	.priority = 0x7fffffff,	/* we need to be notified first */
+};
+
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ */
+__kprobes int arch_immediate_update(const struct __immediate *immediate)
+{
+	int ret;
+	size_t insn_size = _immediate_get_insn_size(immediate->size);
+	long insn = immediate->immediate - insn_size;
+	long len;
+	unsigned long cr0;
+
+#ifdef CONFIG_KPROBES
+	/*
+	 * Fail if a kprobe has been set on this instruction.
+	 * (TODO: we could eventually do better and modify all the (possibly
+	 * nested) kprobes for this site if kprobes had an API for this.
+	 */
+	if (unlikely(*(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+		printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+				    "Variable at %p, "
+				    "instruction at %p, size %lu\n",
+				    (void *)immediate->immediate,
+				    (void *)immediate->var, immediate->size);
+		return -EBUSY;
+	}
+#endif
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	_immediate_bypass(&bypass_eip, &bypass_after_int3);
+
+	after_immediate = immediate->immediate + immediate->size;
+
+	/*
+	 * Using the _early variants because nobody is executing the
+	 * bypass code while we patch it. It is protected by the
+	 * immediate_mutex. Since we modify the instructions non atomically (for
+	 * nops), we have to use the _early variant.
+	 * We must however deal with the WP flag in cr0 by ourself.
+	 */
+	kernel_wp_save(cr0);
+	text_poke_early((void *)bypass_eip, (void *)insn,
+			insn_size + immediate->size);
+	/*
+	 * Fill the rest with nops.
+	 */
+	len = NR_NOPS - immediate->size - insn_size;
+	add_nops((void *)(bypass_eip + insn_size + immediate->size), len);
+	kernel_wp_restore(cr0);
+
+	target_after_int3 = insn + BREAKPOINT_INS_LEN;
+	/* register_die_notifier has memory barriers */
+	register_die_notifier(&immediate_notify);
+	/* The breakpoint will single-step the bypass */
+	text_poke((void *)insn,
+		INIT_ARRAY(unsigned char, BREAKPOINT_INSTRUCTION, 1), 1);
+	/*
+	 * Make sure the breakpoint is set before we continue (visible to other
+	 * CPUs and interrupts).
+	 */
+	wmb();
+	/*
+	 * Execute serializing instruction on each CPU.
+	 */
+	ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+	BUG_ON(ret != 0);
+
+	text_poke((void *)(insn + insn_size), (void *)immediate->var,
+			immediate->size);
+	/*
+	 * Make sure the value can be seen from other CPUs and interrupts.
+	 */
+	wmb();
+	text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+		/*
+		 * Wait for all int3 handlers to end
+		 * (interrupts are disabled in int3).
+		 * This CPU is clearly not in a int3 handler,
+		 * because int3 handler is not preemptible and
+		 * there cannot be any more int3 handler called
+		 * for this site, because we placed the original
+		 * instruction back.
+		 * synchronize_sched has memory barriers.
+		 */
+	synchronize_sched();
+	unregister_die_notifier(&immediate_notify);
+	/* unregister_die_notifier has memory barriers */
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return;
+		break;
+#endif
+	default:return;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+}
Index: linux-2.6-lttng/arch/x86/kernel/Makefile_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_64	2007-11-13 17:06:16.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_64	2007-11-13 20:12:34.000000000 -0500
@@ -33,6 +33,7 @@ obj-$(CONFIG_X86_PM_TIMER)	+= pmtimer_64
 obj-$(CONFIG_X86_VSMP)		+= vsmp_64.o
 obj-$(CONFIG_K8_NB)		+= k8.o
 obj-$(CONFIG_AUDIT)		+= audit_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 
 obj-$(CONFIG_MODULES)		+= module_64.o
 obj-$(CONFIG_PCI)		+= early-quirks.o
Index: linux-2.6-lttng/arch/x86/Kconfig.i386
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.i386	2007-11-13 17:06:16.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.i386	2007-11-13 20:12:34.000000000 -0500
@@ -97,6 +97,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
 menu "Processor type and features"
Index: linux-2.6-lttng/arch/x86/Kconfig.x86_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig.x86_64	2007-11-13 17:06:16.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig.x86_64	2007-11-13 20:12:34.000000000 -0500
@@ -139,6 +139,9 @@ config ARCH_SUPPORTS_OPROFILE
 config ARCH_SUPPORTS_KPROBES
 	def_bool y
 
+config ARCH_SUPPORTS_IMMEDIATE
+	def_bool y
+
 source "init/Kconfig"
 
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update 2)
  2007-11-14  1:44                         ` [patch 5/8] Immediate Values - x86 Optimization (update 2) Mathieu Desnoyers
@ 2007-11-14  2:58                           ` H. Peter Anvin
  2007-11-14 14:16                             ` Mathieu Desnoyers
  2007-11-14 18:52                             ` [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction Mathieu Desnoyers
  0 siblings, 2 replies; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-14  2:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

Mathieu Desnoyers wrote:
> Immediate Values - x86 Optimization
> 
> x86 optimization of the immediate values which uses a movl with code patching
> to set/unset the value used to populate the register used as variable source.
> 
> Changelog:
> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
>   non atomic writes to a code region only touched by us (nobody can execute it
>   since we are protected by the immediate_mutex).
> - Put immediate_set and _immediate_set in the architecture independent header.
> - Use $0 instead of %2 with (0) operand.
> - Add x86_64 support, ready for i386+x86_64 -> x86 merge.
> - Use asm-x86/asm.h.
> 
> Ok, so the most flexible solution that I see, that should fit for both
> i386 and x86_64 would be :
> 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
> 2, 4 bytes : "=R" : Legacy register—the eight integer registers available
>                  on all i386 processors (a, b, c, d, si, di, bp, sp). 8
> bytes : (only for x86_64)
>           "=r" : A register operand is allowed provided that it is in a
>                  general register.
> That should make sure x86_64 won't try to use REX prefixed opcodes for
> 1, 2 and 4 bytes values.
> 

I just had a couple of utterly sick ideas.

Consider this variant (this example is for a 32-bit immediate on x86-64, 
but the obvious macroizations apply):

	.section __discard,"a",@progbits
1:	movl $0x12345678,%r9d
2:
	.previous

	.section __immediate,"a",@progbits
	.quad foo_immediate, (3f)-4, 4
	.previous
	
	.org . + ((-.-(2b-1b)) & 3), 0x90
	movl $0x12345678,%r9d
3:


The idea is that the instruction is emitted into a section, which is 
marked DISCARD in the linker script.  That lets us actually measure the 
length, and since we know the immediate is always at the end of the 
instruction... done!

	-hpa


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (update 2)
  2007-11-14  2:58                           ` H. Peter Anvin
@ 2007-11-14 14:16                             ` Mathieu Desnoyers
  2007-11-14 18:52                             ` [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction Mathieu Desnoyers
  1 sibling, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-14 14:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> Immediate Values - x86 Optimization
>> x86 optimization of the immediate values which uses a movl with code 
>> patching
>> to set/unset the value used to populate the register used as variable 
>> source.
>> Changelog:
>> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are 
>> doing
>>   non atomic writes to a code region only touched by us (nobody can 
>> execute it
>>   since we are protected by the immediate_mutex).
>> - Put immediate_set and _immediate_set in the architecture independent 
>> header.
>> - Use $0 instead of %2 with (0) operand.
>> - Add x86_64 support, ready for i386+x86_64 -> x86 merge.
>> - Use asm-x86/asm.h.
>> Ok, so the most flexible solution that I see, that should fit for both
>> i386 and x86_64 would be :
>> 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
>> 2, 4 bytes : "=R" : Legacy register—the eight integer registers 
>> available
>>                  on all i386 processors (a, b, c, d, si, di, bp, sp). 8
>> bytes : (only for x86_64)
>>           "=r" : A register operand is allowed provided that it is in a
>>                  general register.
>> That should make sure x86_64 won't try to use REX prefixed opcodes for
>> 1, 2 and 4 bytes values.
>
> I just had a couple of utterly sick ideas.
>
> Consider this variant (this example is for a 32-bit immediate on x86-64, 
> but the obvious macroizations apply):
>
> 	.section __discard,"a",@progbits
> 1:	movl $0x12345678,%r9d
> 2:
> 	.previous
>
> 	.section __immediate,"a",@progbits
> 	.quad foo_immediate, (3f)-4, 4
> 	.previous
> 	
> 	.org . + ((-.-(2b-1b)) & 3), 0x90
> 	movl $0x12345678,%r9d
> 3:
>
>
> The idea is that the instruction is emitted into a section, which is marked 
> DISCARD in the linker script.  That lets us actually measure the length, 
> and since we know the immediate is always at the end of the instruction... 
> done!
>

Wow! I like this idea. I'll give it a try in a test sandbox to see if
gas or ld complains. I'll keep you posted.

Mathieu

>


> 	-hpa
>

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction
  2007-11-14  2:58                           ` H. Peter Anvin
  2007-11-14 14:16                             ` Mathieu Desnoyers
@ 2007-11-14 18:52                             ` Mathieu Desnoyers
  2007-11-14 19:00                               ` H. Peter Anvin
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-14 18:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar

(the patch follows at the end of the discussion, it applies on top of
the "Immediate Values x86 Optimization (update 2)" patch.

* H. Peter Anvin (hpa@zytor.com) wrote:
> Mathieu Desnoyers wrote:
>> Immediate Values - x86 Optimization
>> x86 optimization of the immediate values which uses a movl with code 
>> patching
>> to set/unset the value used to populate the register used as variable 
>> source.
>> Changelog:
>> - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are 
>> doing
>>   non atomic writes to a code region only touched by us (nobody can 
>> execute it
>>   since we are protected by the immediate_mutex).
>> - Put immediate_set and _immediate_set in the architecture independent 
>> header.
>> - Use $0 instead of %2 with (0) operand.
>> - Add x86_64 support, ready for i386+x86_64 -> x86 merge.
>> - Use asm-x86/asm.h.
>> Ok, so the most flexible solution that I see, that should fit for both
>> i386 and x86_64 would be :
>> 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
>> 2, 4 bytes : "=R" : Legacy register—the eight integer registers 
>> available
>>                  on all i386 processors (a, b, c, d, si, di, bp, sp). 8
>> bytes : (only for x86_64)
>>           "=r" : A register operand is allowed provided that it is in a
>>                  general register.
>> That should make sure x86_64 won't try to use REX prefixed opcodes for
>> 1, 2 and 4 bytes values.
>
> I just had a couple of utterly sick ideas.
>
> Consider this variant (this example is for a 32-bit immediate on x86-64, 
> but the obvious macroizations apply):
>

nitpicking :

> 	.section __discard,"a",@progbits

could we simply declare a 

     .section __discard,"",@progbits instead ?

I think this should remove the need to tweak the linker script, since
the section would not be allocated.

> 1:	movl $0x12345678,%r9d
> 2:
> 	.previous
>
> 	.section __immediate,"a",@progbits
> 	.quad foo_immediate, (3f)-4, 4
> 	.previous
> 	
> 	.org . + ((-.-(2b-1b)) & 3), 0x90

So in general we would have :

    .org . + ((-.-(2b-1b)) & (immediate_size - 1)), 0x90

Here is the result, on i386, for an immediate value used by markers. I
normally use a 1 byte immediate value for this, but I tried to change
the data type to check the result. The resulting assembly and binary
generated on x86_64 follows.

* A 1 byte immediate value (no need to align) :

#APP
        .section __immediate,"a",@progbits
         .long __mark_kernel_sched_try_wakeup.30104+8, (3f)-1, 1
        .previous
        mov $0,%al
        3:

.LVL1116:
#NO_APP
.LBE2180:
        testb   %al, %al
        jne     .L1028


* A 2 bytes immediate value :

#APP
	.section __discard,"",@progbits
	1:
	mov $0,%ax
	2:
	.previous
	.section __immediate,"a",@progbits
	 .long __mark_kernel_sched_try_wakeup.30104+8, (3f)-2, 2
	.previous
	.org . + ((-.-(2b-1b)) & (2-1)), 0x90
	mov $0,%ax
	3:
	
.LVL1116:
#NO_APP
.LBE2180:
	testw	%ax, %ax
	jne	.L1028


* A 4 bytes immediate value :

#APP
        .section __discard,"",@progbits
        1:
        mov $0,%eax
        2:
        .previous
        .section __immediate,"a",@progbits
         .long __mark_kernel_sched_try_wakeup.30104+8, (3f)-4, 4
        .previous
        .org . + ((-.-(2b-1b)) & (4-1)), 0x90
        mov $0,%eax
        3:
        
.LVL1116:
#NO_APP
.LBE2180:
        testl   %eax, %eax
        jne     .L1028


And on x86_64 :

* A 1 byte immediate value :

#APP
        .section __immediate,"a",@progbits
         .quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-1, 1
        .previous
        mov $0,%al
        3:

.LVL705:
#NO_APP
.LBE2003:
        testb   %al, %al
        je      .L676

Resulting binary :

    28c0:       b0 00                   mov    $0x0,%al
    28c2:       84 c0                   test   %al,%al
    28c4:       74 21                   je     28e7 <try_to_wake_up+0x50>


* A 2 bytes immediate value :
#APP
        .section __discard,"",@progbits
        1:
        mov $0,%ax
        2:
        .previous
        .section __immediate,"a",@progbits
         .quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-2, 2
        .previous
        .org . + ((-.-(2b-1b)) & (2-1)), 0x90
        mov $0,%ax
        3:

.LVL705:
#NO_APP
.LBE2003:
        testw   %ax, %ax
        je      .L676

Resulting binary :

    28c0:       66 b8 00 00             mov    $0x0,%ax
    28c4:       66 85 c0                test   %ax,%ax
    28c7:       74 21                   je     28ea <try_to_wake_up+0x53>


* A 4 bytes immediate value :

#APP
        .section __discard,"",@progbits
        1:
        mov $0,%eax
        2:
        .previous
        .section __immediate,"a",@progbits
         .quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-4, 4
        .previous
        .org . + ((-.-(2b-1b)) & (4-1)), 0x90
        mov $0,%eax
        3:

.LVL705:
#NO_APP
.LBE2003:
        testl   %eax, %eax
        je      .L676

Resulting binary :

    28c0:       90                      nop    
    28c1:       90                      nop    
    28c2:       90                      nop    
    28c3:       b8 00 00 00 00          mov    $0x0,%eax
    28c8:       85 c0                   test   %eax,%eax


* An 8 bytes immediate value :

#APP
        .section __discard,"",@progbits
        1:
        mov $0xFEFEFEFE01010101,%rax
        2:
        .previous
        .section __immediate,"a",@progbits
         .quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-8, 8
        .previous
        .org . + ((-.-(2b-1b)) & (8-1)), 0x90
        mov $0xFEFEFEFE01010101,%rax
        3:

.LVL705:
#NO_APP
.LBE2003:
        testq   %rax, %rax
        je      .L676

And the corresponding binary (for x86_64, using a 8 bytes immediate) :

    28c0:       90                      nop    
    28c1:       90                      nop    
    28c2:       90                      nop    
    28c3:       90                      nop    
    28c4:       90                      nop    
    28c5:       90                      nop    
    28c6:       48 b8 01 01 01 01 fe    mov    $0xfefefefe01010101,%rax
    28cd:       fe fe fe 
    28d0:       48 85 c0                test   %rax,%rax

The patch is at the bottom of this email.

Mathieu

> 	movl $0x12345678,%r9d
> 3:
>
>
> The idea is that the instruction is emitted into a section, which is marked 
> DISCARD in the linker script.  That lets us actually measure the length, 
> and since we know the immediate is always at the end of the instruction... 
> done!
>
> 	-hpa
>

Immediate Values x86 Optimization Declare Discarded Instruction

- Create the instruction in a discarded section to calculate its size. This is
  how we can align the beginning of the instruction on an address that will
  permit atomic modificatino of the immediate value without knowing the size of
  the opcode used by the compiler.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
  immediate structure.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/asm-x86/immediate.h |   93 +++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 48 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/immediate.h	2007-11-14 13:39:39.000000000 -0500
+++ linux-2.6-lttng/include/asm-x86/immediate.h	2007-11-14 13:40:18.000000000 -0500
@@ -30,22 +30,18 @@ struct __immediate {
  * Reads the value of @name.
  * Optimized version of the immediate.
  * Do not use in __init and __exit functions. Use _immediate_read() instead.
- * Makes sure the 2 and 4 bytes update will be atomic by aligning the immediate
- * value. 2 bytes (short) uses a 66H prefix. If size is bigger than 4 bytes,
- * fall back on a memory read.
- * The 64 bits load immediates produced by gas have a 2 bytes opcode.
+ * If size is bigger than the architecture long size, fall back on a memory
+ * read.
+ *
  * Make sure to populate the initial static 64 bits opcode with a value
- * what will generate an instruction with 2 bytes opcode and 8 bytes immediate
- * value.
- * A solution that should fit for both i386 and x86_64 would be :
- * 1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
- * 2, 4 bytes : "=R" : Legacy register—the eight integer registers available
- *               on all i386 processors (a, b, c, d, si, di, bp, sp).
- * 8 bytes : (only for x86_64)
- *              "=r" : A register operand is allowed provided that it is in a
- *              general register.
- * That should make sure x86_64 won't try to use REX prefixed opcodes for
- * 1, 2 and 4 bytes values.
+ * what will generate an instruction with 8 bytes immediate value (not the REX.W
+ * prefixed one that loads a sign extended 32 bits immediate value in a r64
+ * register).
+ *
+ * Create the instruction in a discarded section to calculate its size. This is
+ * how we can align the beginning of the instruction on an address that will
+ * permit atomic modificatino of the immediate value without knowing the size of
+ * the opcode used by the compiler. The operand size is known in advance.
  */
 #define immediate_read(name)						\
 	({								\
@@ -53,49 +49,50 @@ struct __immediate {
 		switch (sizeof(value)) {				\
 		case 1:							\
 			asm(".section __immediate,\"a\",@progbits\n\t"	\
-					_ASM_PTR "%c1, (0f)+1, 1\n\t"	\
-					".previous\n\t"			\
-					"0:\n\t"			\
-					"mov $0,%0\n\t"			\
-				: "=Q" (value)				\
-				: "i" (&name##__immediate));		\
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=q" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
 			break;						\
 		case 2:							\
-			asm(".section __immediate,\"a\",@progbits\n\t"	\
-					_ASM_PTR "%c1, (0f)+2, 2\n\t"	\
-					".previous\n\t"			\
-					"1:\n\t"			\
-					".align 2\n\t"			\
-					"0:\n\t"			\
-					"mov $0,%0\n\t"			\
-				: "=R" (value)				\
-				: "i" (&name##__immediate));		\
-			break;						\
 		case 4:							\
-			asm(".section __immediate,\"a\",@progbits\n\t"	\
-					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
-					".previous\n\t"			\
-					"1:\n\t"			\
-					".org . + 3 - (. & 3), 0x90\n\t" \
-					"0:\n\t"			\
-					"mov $0,%0\n\t"			\
-				: "=R" (value)				\
-				: "i" (&name##__immediate));		\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __immediate,\"a\",@progbits\n\t" \
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
 			break;						\
 		case 8:							\
 			if (sizeof(long) < 8) {				\
 				value = name##__immediate;		\
 				break;					\
 			}						\
-			asm(".section __immediate,\"a\",@progbits\n\t"	\
-					_ASM_PTR "%c1, (0f)+1, 4\n\t"	\
-					".previous\n\t"			\
-					"1:\n\t"			\
-					".org . + 6 - (. & 7), 0x90\n\t" \
-					"0:\n\t"			\
-					"mov $0xFEFEFEFE01010101,%0\n\t" \
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __immediate,\"a\",@progbits\n\t" \
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"3:\n\t"				\
 				: "=r" (value)				\
-				: "i" (&name##__immediate));		\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
 			break;						\
 		default:value = name##__immediate;			\
 			break;						\
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction
  2007-11-14 18:52                             ` [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction Mathieu Desnoyers
@ 2007-11-14 19:00                               ` H. Peter Anvin
  2007-11-14 19:16                                 ` [PATCH] Add __discard section to x86 Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-14 19:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar

Mathieu Desnoyers wrote:
> 
> nitpicking :
> 
>> 	.section __discard,"a",@progbits
> 
> could we simply declare a 
> 
>      .section __discard,"",@progbits instead ?
> 
> I think this should remove the need to tweak the linker script, since
> the section would not be allocated.
> 

The allocation flag is a loader property, not a linker property.  It 
would still be manifest in the vmlinux file; I think this is 
undesirable.  Perhaps not a big deal, but neither is adding a /DISCARD/ 
statement to the linker script:

/DISCARD/ : { *(__discard) }

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] Add __discard section to x86
  2007-11-14 19:00                               ` H. Peter Anvin
@ 2007-11-14 19:16                                 ` Mathieu Desnoyers
  2007-11-14 19:33                                   ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-14 19:16 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar

Add a __discard sectionto the linker script. Code produced in this section will
not be put in the vmlinux file. This is useful when we have to calculate the
size of an instruction before actually declaring it (for alignment purposes for
instance). This is used by the immediate values.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/kernel/vmlinux_32.lds.S |    1 +
 arch/x86/kernel/vmlinux_64.lds.S |    1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_32.lds.S	2007-11-14 14:10:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S	2007-11-14 14:11:32.000000000 -0500
@@ -205,6 +205,7 @@ SECTIONS
   /* Sections to be discarded */
   /DISCARD/ : {
 	*(.exitcall.exit)
+	*(__discard)
 	}
 
   STABS_DEBUG
Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_64.lds.S	2007-11-14 14:10:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S	2007-11-14 14:11:48.000000000 -0500
@@ -227,6 +227,7 @@ SECTIONS
   /DISCARD/ : {
 	*(.exitcall.exit)
 	*(.eh_frame)
+	*(__discard)
 	}
 
   STABS_DEBUG
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add __discard section to x86
  2007-11-14 19:16                                 ` [PATCH] Add __discard section to x86 Mathieu Desnoyers
@ 2007-11-14 19:33                                   ` H. Peter Anvin
  0 siblings, 0 replies; 40+ messages in thread
From: H. Peter Anvin @ 2007-11-14 19:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, Chuck Ebbert, Christoph Hellwig,
	Jeremy Fitzhardinge, Thomas Gleixner, Ingo Molnar

Mathieu Desnoyers wrote:
> Add a __discard sectionto the linker script. Code produced in this section will
> not be put in the vmlinux file. This is useful when we have to calculate the
> size of an instruction before actually declaring it (for alignment purposes for
> instance). This is used by the immediate values.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> CC: Andi Kleen <ak@muc.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Chuck Ebbert <cebbert@redhat.com>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Jeremy Fitzhardinge <jeremy@goop.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>

Acked-by: H. Peter Anvin <hpa@zytor.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-13 18:58 ` [patch 5/8] Immediate Values - x86 Optimization Mathieu Desnoyers
  2007-11-13 19:07   ` H. Peter Anvin
@ 2007-11-15  3:08   ` Rusty Russell
  2007-11-15  4:06     ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Rusty Russell @ 2007-11-15  3:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge

On Wednesday 14 November 2007 05:58:05 Mathieu Desnoyers wrote:
> x86 optimization of the immediate values which uses a movl with code
> patching to set/unset the value used to populate the register used as
> variable source.

For the record, I think the patching code gross overkill.

A stop_machine (or lightweight variant using IPI) would be sufficient and 
vastly simpler.  Trying to patch NMI handlers while they're running is 
already crazy.

I'd keep this version up your sleeve for they day when it's needed.

Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-15  3:08   ` [patch 5/8] Immediate Values - x86 Optimization Rusty Russell
@ 2007-11-15  4:06     ` Mathieu Desnoyers
  2007-11-15  4:45       ` Rusty Russell
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-15  4:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Wednesday 14 November 2007 05:58:05 Mathieu Desnoyers wrote:
> > x86 optimization of the immediate values which uses a movl with code
> > patching to set/unset the value used to populate the register used as
> > variable source.
> 
> For the record, I think the patching code gross overkill.
> 
> A stop_machine (or lightweight variant using IPI) would be sufficient and 
> vastly simpler.  Trying to patch NMI handlers while they're running is 
> already crazy.
> 

I wouldn't mind if it was limited to the code within do_nmi(), but then
we would have to accept potential GPF if

A - the NMI or MCE code calls any external kernel code (printk,
  notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
  calls printk too for debugging)...

B - we try to patch this code at the wrong moment

I could live with that, but I would prefer to have a solid, non flaky
solution. My goal is to help the kernel quality _improve_ rather than
deteriorate. Therefore, if one decides to use the immediate values to
leave dormant spinlock instrumentation in the kernel, I wouldn't want it
to have undesirable side-effects (GPF) when the instrumentation is
being enabled, as rare as it could be.

> I'd keep this version up your sleeve for they day when it's needed.
> 

If we choose to go this way, stop_machine would have to do a sync_core()
on every CPU before it reactivates interrupts for this to respect
Intel's errata. It's not just a matter of not executing the code while
it is modified; the issue here is that we must insure that we don't have
an incoherent trace cache.  So, as is, stop_machine would not respect
the errata.

Mathieu

> Rusty.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-15  4:06     ` Mathieu Desnoyers
@ 2007-11-15  4:45       ` Rusty Russell
  2007-11-15  5:37         ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Rusty Russell @ 2007-11-15  4:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

On Thursday 15 November 2007 15:06:10 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > A stop_machine (or lightweight variant using IPI) would be sufficient and
> > vastly simpler.  Trying to patch NMI handlers while they're running is
> > already crazy.
>
> I wouldn't mind if it was limited to the code within do_nmi(), but then
> we would have to accept potential GPF if
>
> A - the NMI or MCE code calls any external kernel code (printk,
>   notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
>   calls printk too for debugging)...

Sure, but as I pointed out previously, such calls are already best effort.  
You can do very little safely from do_nmi(), and calling printk isn't one of 
them, nor is grabbing a spinlock (well, actually you could as long as it's 
*only* used by NMI handlers.  See any of those?).

> Therefore, if one decides to use the immediate values to
> leave dormant spinlock instrumentation in the kernel, I wouldn't want it
> to have undesirable side-effects (GPF) when the instrumentation is
> being enabled, as rare as it could be.

It's overengineered, since it's less likely than deadlock already.

> > I'd keep this version up your sleeve for they day when it's needed.
>
> If we choose to go this way, stop_machine would have to do a sync_core()
> on every CPU before it reactivates interrupts for this to respect
> Intel's errata.

Yes, I don't think stop_machine is actually what you want anyway, since you 
are happy to run in interrupt context.  An IPI-based scheme is probably 
better, and also has the side effect of iret doing the sync you need, IIUC.

Hope that clarifies,
Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-15  4:45       ` Rusty Russell
@ 2007-11-15  5:37         ` Mathieu Desnoyers
  2007-11-15 11:06           ` Rusty Russell
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-15  5:37 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Thursday 15 November 2007 15:06:10 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > > A stop_machine (or lightweight variant using IPI) would be sufficient and
> > > vastly simpler.  Trying to patch NMI handlers while they're running is
> > > already crazy.
> >
> > I wouldn't mind if it was limited to the code within do_nmi(), but then
> > we would have to accept potential GPF if
> >
> > A - the NMI or MCE code calls any external kernel code (printk,
> >   notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
> >   calls printk too for debugging)...
> 
> Sure, but as I pointed out previously, such calls are already best effort.  
> You can do very little safely from do_nmi(), and calling printk isn't one of 
> them,

yes and no.. do_nmi uses the "bust spinlocks" exactly for this. So this
is ok by design. Other than this, we can end up mixing up the console
data output with different sources of characters, but I doubt something
really bad can happen (like a deadlock).

> nor is grabbing a spinlock (well, actually you could as long as it's 
> *only* used by NMI handlers.  See any of those?).

Yup, see arch/x86/kernel/nmi_64.c : nmi_watchdog_tick()

It defines a spinlock to "Serialise the printks". I guess it's good to
protect against other nmi watchdogs running on other CPUs concurrently,
I guess.

> 
> > Therefore, if one decides to use the immediate values to
> > leave dormant spinlock instrumentation in the kernel, I wouldn't want it
> > to have undesirable side-effects (GPF) when the instrumentation is
> > being enabled, as rare as it could be.
> 
> It's overengineered, since it's less likely than deadlock already.
> 

So should we put a warning telling "enabling tracing or profiling on a
production system that also uses NMI watchdog could potentially cause a
crash" ? The rarer a bug is, the more difficult it is to debug. It does
not make the bug hurt less when it happens.

The normal thing to do when a potential deadlock is detected is to fix
it, not to leave it there under the premise that it doesn't matter since
it happens rarely. In our case, where we know there is a potential race,
I don't see any reason not to make sure it never happens. What's the
cost of it ?

arch/x86/kernel/immediate.o : 2.4K

let's compare..

kernel/stop_machine.o : 3.9K

so I think that code size is not an issue there, especially since the
immediate values are not meant to be deployed on embedded systems.

> > > I'd keep this version up your sleeve for they day when it's needed.
> >
> > If we choose to go this way, stop_machine would have to do a sync_core()
> > on every CPU before it reactivates interrupts for this to respect
> > Intel's errata.
> 
> Yes, I don't think stop_machine is actually what you want anyway, since you 
> are happy to run in interrupt context.  An IPI-based scheme is probably 
> better, and also has the side effect of iret doing the sync you need, IIUC.
> 

Yup, looping in IPIs with interrupts disabled should do the job there.
It's just awful for interrupt latency on large SMP systems :( Being
currently bad at it is not a reason to make it worse. If we have a CPU
that is within a high latency irq disable region when we send the IPI,
we can easily end up waiting for this critical section to end with
interrupts disabled on all CPUs. The fact that we would wait for the
longest interrupt disable region with IRQs disabled implies that we
increase the maximum latency of the system, by design. I'm not sure I
would like to be the new longest record-beating IRQ off region.

Mathieu

> Hope that clarifies,
> Rusty.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization
  2007-11-15  5:37         ` Mathieu Desnoyers
@ 2007-11-15 11:06           ` Rusty Russell
  2007-11-16 14:03             ` [patch 5/8] Immediate Values - x86 Optimization (simplified) Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Rusty Russell @ 2007-11-15 11:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

On Thursday 15 November 2007 16:37:51 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > On Thursday 15 November 2007 15:06:10 Mathieu Desnoyers wrote:
> > > A - the NMI or MCE code calls any external kernel code (printk,
> > >   notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
> > >   calls printk too for debugging)...
> >
> > Sure, but as I pointed out previously, such calls are already best
> > effort. You can do very little safely from do_nmi(), and calling printk
> > isn't one of them,
>
> yes and no.. do_nmi uses the "bust spinlocks" exactly for this. So this
> is ok by design.

Yeah, in the "machine is dying" case, we do make best effort to get the 
message out.  Worrying about a GPF on the off chance we happen to be 
modifying an immediate in that case seems crazy.

> > nor is grabbing a spinlock (well, actually you could as long as it's
> > *only* used by NMI handlers.  See any of those?).
>
> Yup, see arch/x86/kernel/nmi_64.c : nmi_watchdog_tick()
>
> It defines a spinlock to "Serialise the printks". I guess it's good to
> protect against other nmi watchdogs running on other CPUs concurrently,
> I guess.

So that when it's printing a backtrace from an NMI.

> > > Therefore, if one decides to use the immediate values to
> > > leave dormant spinlock instrumentation in the kernel, I wouldn't want
> > > it to have undesirable side-effects (GPF) when the instrumentation is
> > > being enabled, as rare as it could be.
> >
> > It's overengineered, since it's less likely than deadlock already.
>
> So should we put a warning telling "enabling tracing or profiling on a
> production system that also uses NMI watchdog could potentially cause a
> crash"

Or, "if your system is crashing, this may make it worse".

> arch/x86/kernel/immediate.o : 2.4K
>
> let's compare..
>
> kernel/stop_machine.o : 3.9K
>
> so I think that code size is not an issue there, especially since the
> immediate values are not meant to be deployed on embedded systems.

stop_machine is necessary (although it could use simplification, patches 
pending).

> Yup, looping in IPIs with interrupts disabled should do the job there.
> It's just awful for interrupt latency on large SMP systems :(

Yet you could have implemented it in the time it took us to have this 
conversation.

Just because code is clever doesn't mean it should go in.  There are enough 
things in the kernel which have to be complex that we should always be on the 
lookout for things which can be made simpler.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-15 11:06           ` Rusty Russell
@ 2007-11-16 14:03             ` Mathieu Desnoyers
  2007-11-18 23:11               ` Rusty Russell
  0 siblings, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-16 14:03 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Thursday 15 November 2007 16:37:51 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > > On Thursday 15 November 2007 15:06:10 Mathieu Desnoyers wrote:
> > > > A - the NMI or MCE code calls any external kernel code (printk,
> > > >   notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
> > > >   calls printk too for debugging)...
> > >
> > > Sure, but as I pointed out previously, such calls are already best
> > > effort. You can do very little safely from do_nmi(), and calling printk
> > > isn't one of them,
> >
> > yes and no.. do_nmi uses the "bust spinlocks" exactly for this. So this
> > is ok by design.
> 
> Yeah, in the "machine is dying" case, we do make best effort to get the 
> message out.  Worrying about a GPF on the off chance we happen to be 
> modifying an immediate in that case seems crazy.
> 
> > > nor is grabbing a spinlock (well, actually you could as long as it's
> > > *only* used by NMI handlers.  See any of those?).
> >
> > Yup, see arch/x86/kernel/nmi_64.c : nmi_watchdog_tick()
> >
> > It defines a spinlock to "Serialise the printks". I guess it's good to
> > protect against other nmi watchdogs running on other CPUs concurrently,
> > I guess.
> 
> So that when it's printing a backtrace from an NMI.
> 
> > > > Therefore, if one decides to use the immediate values to
> > > > leave dormant spinlock instrumentation in the kernel, I wouldn't want
> > > > it to have undesirable side-effects (GPF) when the instrumentation is
> > > > being enabled, as rare as it could be.
> > >
> > > It's overengineered, since it's less likely than deadlock already.
> >
> > So should we put a warning telling "enabling tracing or profiling on a
> > production system that also uses NMI watchdog could potentially cause a
> > crash"
> 
> Or, "if your system is crashing, this may make it worse".
> 
> > arch/x86/kernel/immediate.o : 2.4K
> >
> > let's compare..
> >
> > kernel/stop_machine.o : 3.9K
> >
> > so I think that code size is not an issue there, especially since the
> > immediate values are not meant to be deployed on embedded systems.
> 
> stop_machine is necessary (although it could use simplification, patches 
> pending).
> 
> > Yup, looping in IPIs with interrupts disabled should do the job there.
> > It's just awful for interrupt latency on large SMP systems :(
> 
> Yet you could have implemented it in the time it took us to have this 
> conversation.
> 
> Just because code is clever doesn't mean it should go in.  There are enough 
> things in the kernel which have to be complex that we should always be on the 
> lookout for things which can be made simpler.
> 
> Cheers,
> Rusty.

Ok, let's start with a simple version then. I'll plan to resubmit the
more sophisticated one when I submit a patch to make my markers use the
immediate values.

Here it is, for a first round of review :


Immediate Values - x86 Optimization

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.

Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
  non atomic writes to a code region only touched by us (nobody can execute it
  since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.

Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte  : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
                 on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
          "=r" : A register operand is allowed provided that it is in a
                 general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.

- Create the instruction in a discarded section to calculate its size. This is
  how we can align the beginning of the instruction on an address that will
  permit atomic modificatino of the immediate value without knowing the size of
  the opcode used by the compiler.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
  immediate structure.
- Change the immediate.c update code to support variable length opcodes.

- Vastly simplified, using a busy looping IPI with interrupts disabled.
  Does not protect against NMI nor MCE.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
---
 arch/x86/Kconfig            |    1 
 arch/x86/kernel/Makefile_32 |    1 
 arch/x86/kernel/Makefile_64 |    1 
 arch/x86/kernel/immediate.c |  143 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps_32.c  |   10 +--
 include/asm-x86/immediate.h |  107 ++++++++++++++++++++++++++++++++
 6 files changed, 259 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32	2007-11-15 23:25:57.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_32	2007-11-15 23:26:14.000000000 -0500
@@ -35,6 +35,7 @@ obj-$(CONFIG_KPROBES)		+= kprobes_32.o
 obj-$(CONFIG_MODULES)		+= module_32.o
 obj-y				+= sysenter_32.o vsyscall_32.o
 obj-$(CONFIG_ACPI_SRAT) 	+= srat_32.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 obj-$(CONFIG_EFI) 		+= efi_32.o efi_stub_32.o
 obj-$(CONFIG_DOUBLEFAULT) 	+= doublefault_32.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
Index: linux-2.6-lttng/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c	2007-11-15 23:25:57.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/traps_32.c	2007-11-16 08:55:45.000000000 -0500
@@ -544,7 +544,7 @@ fastcall void do_##name(struct pt_regs *
 }
 
 DO_VM86_ERROR_INFO( 0, SIGFPE,  "divide error", divide_error, FPE_INTDIV, regs->eip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
 DO_VM86_ERROR( 3, SIGTRAP, "int3", int3)
 #endif
 DO_VM86_ERROR( 4, SIGSEGV, "overflow", overflow)
@@ -786,7 +786,7 @@ void restart_nmi(void)
 	acpi_nmi_enable();
 }
 
-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
 fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
 	trace_hardirqs_fixup();
@@ -794,8 +794,10 @@ fastcall void __kprobes do_int3(struct p
 	if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP)
 			== NOTIFY_STOP)
 		return;
-	/* This is an interrupt gate, because kprobes wants interrupts
-	disabled.  Normal trap handlers don't. */
+	/*
+	 * This is an interrupt gate, because kprobes and immediate values wants
+	 * interrupts disabled.  Normal trap handlers don't.
+	 */
 	restore_interrupts(regs);
 	do_trap(3, SIGTRAP, "int3", 1, regs, error_code, NULL);
 }
Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/asm-x86/immediate.h	2007-11-16 08:55:49.000000000 -0500
@@ -0,0 +1,107 @@
+#ifndef _ASM_X86_IMMEDIATE_H
+#define _ASM_X86_IMMEDIATE_H
+
+/*
+ * Immediate values. x86 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#include <asm/asm.h>
+
+struct __immediate {
+	long var;		/* Pointer to the identifier variable of the
+				 * immediate value
+				 */
+	long immediate;		/*
+				 * Pointer to the memory location of the
+				 * immediate value within the instruction.
+				 */
+	long size;		/* Type size. */
+} __attribute__ ((aligned(sizeof(long))));
+
+/**
+ * immediate_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ * Optimized version of the immediate.
+ * Do not use in __init and __exit functions. Use _immediate_read() instead.
+ * If size is bigger than the architecture long size, fall back on a memory
+ * read.
+ *
+ * Make sure to populate the initial static 64 bits opcode with a value
+ * what will generate an instruction with 8 bytes immediate value (not the REX.W
+ * prefixed one that loads a sign extended 32 bits immediate value in a r64
+ * register).
+ *
+ * Create the instruction in a discarded section to calculate its size. This is
+ * how we can align the beginning of the instruction on an address that will
+ * permit atomic modificatino of the immediate value without knowing the size of
+ * the opcode used by the compiler. The operand size is known in advance.
+ */
+#define immediate_read(name)						\
+	({								\
+		__typeof__(name##__immediate) value;			\
+		switch (sizeof(value)) {				\
+		case 1:							\
+			asm(".section __immediate,\"a\",@progbits\n\t"	\
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				"2:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=q" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 2:							\
+		case 4:							\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0,%0\n\t"				\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __immediate,\"a\",@progbits\n\t" \
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
+				"mov $0,%0\n\t"				\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
+			break;						\
+		case 8:							\
+			if (sizeof(long) < 8) {				\
+				value = name##__immediate;		\
+				break;					\
+			}						\
+			asm(".section __discard,\"\",@progbits\n\t"	\
+				"1:\n\t"				\
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"2:\n\t"				\
+				".previous\n\t"				\
+				".section __immediate,\"a\",@progbits\n\t" \
+				_ASM_PTR "%c1, (3f)-%c2, %c2\n\t"	\
+				".previous\n\t"				\
+				".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
+				"mov $0xFEFEFEFE01010101,%0\n\t" 	\
+				"3:\n\t"				\
+				: "=r" (value)				\
+				: "i" (&name##__immediate),		\
+				  "i" (sizeof(value)));			\
+			break;						\
+		default:value = name##__immediate;			\
+			break;						\
+		};							\
+		value;							\
+	})
+
+extern int arch_immediate_update(const struct __immediate *immediate);
+extern void arch_immediate_update_early(const struct __immediate *immediate);
+
+#endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-16 08:56:22.000000000 -0500
@@ -0,0 +1,143 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ *   Centrino Duo Processor Technology Specification Update, AH33.
+ *   Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ *   Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Non reentrant for NMI and trap handler instrumentation.
+ *
+ * Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
+ */
+
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/cpu.h>
+
+static atomic_t wait_sync;
+
+static void ipi_busy_loop(void *arg)
+{
+	long value = (long)arg;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	atomic_dec(&wait_sync);
+	do {
+		smp_mb();
+	} while (atomic_read(&wait_sync) > value);
+	atomic_dec(&wait_sync);
+	do {
+		smp_mb();
+	} while (atomic_read(&wait_sync) > 0);
+	sync_core();
+	local_irq_restore(flags);
+}
+
+/**
+ * arch_immediate_update - update one immediate value
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value. Must be called with immediate_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+int arch_immediate_update(const struct __immediate *immediate)
+{
+	unsigned long flags;
+	long online_cpus;
+
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return 0;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return 0;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return 0;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return 0;
+		break;
+#endif
+	default:return -EINVAL;
+	}
+
+	lock_cpu_hotplug();
+	online_cpus = num_online_cpus();
+	atomic_set(&wait_sync, 2 * online_cpus);
+	smp_call_function(ipi_busy_loop, (void *)online_cpus, 1, 0);
+	local_irq_save(flags);
+	atomic_dec(&wait_sync);
+	do {
+		smp_mb();
+	} while (atomic_read(&wait_sync) > online_cpus);
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+	sync_core();
+	wmb();
+	atomic_dec(&wait_sync);
+	smp_mb();
+	local_irq_restore(flags);
+	unlock_cpu_hotplug();
+	return 0;
+}
+
+/**
+ * arch_immediate_update_early - update one immediate value at boot time
+ * @immediate: pointer of type const struct __immediate to update
+ *
+ * Update one immediate value at boot time.
+ */
+void arch_immediate_update_early(const struct __immediate *immediate)
+{
+	/*
+	 * If the variable and the instruction have the same value, there is
+	 * nothing to do.
+	 */
+	switch (immediate->size) {
+	case 1:	if (*(uint8_t *)immediate->immediate
+				== *(uint8_t *)immediate->var)
+			return;
+		break;
+	case 2:	if (*(uint16_t *)immediate->immediate
+				== *(uint16_t *)immediate->var)
+			return;
+		break;
+	case 4:	if (*(uint32_t *)immediate->immediate
+				== *(uint32_t *)immediate->var)
+			return;
+		break;
+#ifdef CONFIG_X86_64
+	case 8:	if (*(uint64_t *)immediate->immediate
+				== *(uint64_t *)immediate->var)
+			return;
+		break;
+#endif
+	default:return;
+	}
+	memcpy((void *)immediate->immediate, (void *)immediate->var,
+			immediate->size);
+}
Index: linux-2.6-lttng/arch/x86/kernel/Makefile_64
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_64	2007-11-15 23:25:57.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_64	2007-11-15 23:26:14.000000000 -0500
@@ -35,6 +35,7 @@ obj-$(CONFIG_X86_PM_TIMER)	+= pmtimer_64
 obj-$(CONFIG_X86_VSMP)		+= vsmp_64.o
 obj-$(CONFIG_K8_NB)		+= k8.o
 obj-$(CONFIG_AUDIT)		+= audit_64.o
+obj-$(CONFIG_IMMEDIATE)		+= immediate.o
 
 obj-$(CONFIG_MODULES)		+= module_64.o
 obj-$(CONFIG_PCI)		+= early-quirks.o
Index: linux-2.6-lttng/arch/x86/Kconfig
===================================================================
--- linux-2.6-lttng.orig/arch/x86/Kconfig	2007-11-15 23:26:13.000000000 -0500
+++ linux-2.6-lttng/arch/x86/Kconfig	2007-11-16 08:55:43.000000000 -0500
@@ -21,6 +21,7 @@ config X86
 	default y
 	select HAVE_OPROFILE
 	select HAVE_KPROBES
+	select HAVE_IMMEDIATE
 
 config GENERIC_TIME
 	bool


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-16 14:03             ` [patch 5/8] Immediate Values - x86 Optimization (simplified) Mathieu Desnoyers
@ 2007-11-18 23:11               ` Rusty Russell
  2007-11-19 14:28                 ` Mathieu Desnoyers
  2007-11-19 19:15                 ` Mathieu Desnoyers
  0 siblings, 2 replies; 40+ messages in thread
From: Rusty Russell @ 2007-11-18 23:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

On Saturday 17 November 2007 01:03:35 Mathieu Desnoyers wrote:
> x86 optimization of the immediate values which uses a movl with code
> patching to set/unset the value used to populate the register used as
> variable source.

Since immediate values are by definition an optimization, I think it makes
sense to insist they be 1, 2, 4 or 8 bytes.  A BUILD_BUG_ON() in the right
place should ensure this (probably in generic code rather than x86).

> ===================================================================
> --- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c

I don't think you need any modification to this file now.

> + * Create the instruction in a discarded section to calculate its size. This is
> + * how we can align the beginning of the instruction on an address that will 
> + * permit atomic modificatino of the immediate value without knowing the size of 
> + * the opcode used by the compiler. The operand size is known in advance.
> + */ 

This alignment is also now unnecessary.

> +++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-16
> 08:56:22.000000000 -0500 @@ -0,0 +1,143 @@
> +/*
> + * Immediate Value - x86 architecture specific code.

This is now almost entirely generic code, but I suppose we can let the next
architecture hoist it out.

> +/**
> + * arch_immediate_update_early - update one immediate value at boot time
> + * @immediate: pointer of type const struct __immediate to update
> + *
> + * Update one immediate value at boot time.
> + */
> +void arch_immediate_update_early(const struct __immediate *immediate)

I think it would be easier to just fast-path the num_online_cpus == 1 case,
even if you want to keep this "update_early" interface.

But I like your IPI algorithm: very tight.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-18 23:11               ` Rusty Russell
@ 2007-11-19 14:28                 ` Mathieu Desnoyers
  2007-11-19 23:06                   ` Rusty Russell
  2007-11-19 19:15                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-19 14:28 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Saturday 17 November 2007 01:03:35 Mathieu Desnoyers wrote:
> > x86 optimization of the immediate values which uses a movl with code
> > patching to set/unset the value used to populate the register used as
> > variable source.
> 
> Since immediate values are by definition an optimization, I think it makes
> sense to insist they be 1, 2, 4 or 8 bytes.  A BUILD_BUG_ON() in the right
> place should ensure this (probably in generic code rather than x86).
> 
sure,


> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c
> 
> I don't think you need any modification to this file now.
> 
yep,

> > + * Create the instruction in a discarded section to calculate its size. This is
> > + * how we can align the beginning of the instruction on an address that will 
> > + * permit atomic modificatino of the immediate value without knowing the size of 
> > + * the opcode used by the compiler. The operand size is known in advance.
> > + */ 
> 
> This alignment is also now unnecessary.
> 

right,

> > +++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-16
> > 08:56:22.000000000 -0500 @@ -0,0 +1,143 @@
> > +/*
> > + * Immediate Value - x86 architecture specific code.
> 
> This is now almost entirely generic code, but I suppose we can let the next
> architecture hoist it out.
> 

I'll move it to kernel/immediate.c. We can therefore remove the powerpc
immediate.c as well.

> > +/**
> > + * arch_immediate_update_early - update one immediate value at boot time
> > + * @immediate: pointer of type const struct __immediate to update
> > + *
> > + * Update one immediate value at boot time.
> > + */
> > +void arch_immediate_update_early(const struct __immediate *immediate)
> 
> I think it would be easier to just fast-path the num_online_cpus == 1 case,
> even if you want to keep this "update_early" interface.
> 

Nope, that could lead to problems. I call core_immediate_update()
_very_ early, before boot_cpu_init() is called. Therefore,
cpu_online_map is not set yet. I am not sure the benefit of using
num_online_cpus outweights the added fragility wrt other boot process
initializations.

> But I like your IPI algorithm: very tight.
> 

Thanks. Many thanks for the review, I'll repost the patchset for
comments.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-18 23:11               ` Rusty Russell
  2007-11-19 14:28                 ` Mathieu Desnoyers
@ 2007-11-19 19:15                 ` Mathieu Desnoyers
  1 sibling, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-19 19:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
[...]
> > +++ linux-2.6-lttng/arch/x86/kernel/immediate.c	2007-11-16
> > 08:56:22.000000000 -0500 @@ -0,0 +1,143 @@
> > +/*
> > + * Immediate Value - x86 architecture specific code.
> 
> This is now almost entirely generic code, but I suppose we can let the next
> architecture hoist it out.
> 

Almost.. actually, I have to call those architecture specific primitives :
- text_poke : a memcpy that copies to write protected memory (disables the
  WP bit)
- sync_core : synchronize the core, only defined on x86

I'll also have to add a flush_icache_range to support powerpc correctly,
but this one turns out to be implemented on each architecture and
therefore is not a problem.

So I guess the proper way to do this would be to add :

#define text_poke       memcpy
#define text_poke_early text_poke
#define sync_core()

to asm-$ARCH/cacheflush.h for each architecture we add support for
HAS_IMMEDIATE ?

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-19 14:28                 ` Mathieu Desnoyers
@ 2007-11-19 23:06                   ` Rusty Russell
  2007-11-20 17:02                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 40+ messages in thread
From: Rusty Russell @ 2007-11-19 23:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

On Tuesday 20 November 2007 01:28:03 Mathieu Desnoyers wrote:
> * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > I think it would be easier to just fast-path the num_online_cpus == 1
> > case, even if you want to keep this "update_early" interface.
>
> Nope, that could lead to problems. I call core_immediate_update()
> _very_ early, before boot_cpu_init() is called.

Ah, I see the problem.  It would in fact be clearer for us to move 
boot_cpu_init() up to just after smp_setup_processor_id() in start_kernel 
anyway, not just for this code, but in general.

> Therefore, 
> cpu_online_map is not set yet. I am not sure the benefit of using
> num_online_cpus outweights the added fragility wrt other boot process
> initializations.

I think it's still a win, though worth a comment that we always go via the 
non-IPI path for the early boot case.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [patch 5/8] Immediate Values - x86 Optimization (simplified)
  2007-11-19 23:06                   ` Rusty Russell
@ 2007-11-20 17:02                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 40+ messages in thread
From: Mathieu Desnoyers @ 2007-11-20 17:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-kernel, Andi Kleen, H. Peter Anvin, Chuck Ebbert,
	Christoph Hellwig, Jeremy Fitzhardinge, Ingo Molnar,
	Thomas Gleixner

* Rusty Russell (rusty@rustcorp.com.au) wrote:
> On Tuesday 20 November 2007 01:28:03 Mathieu Desnoyers wrote:
> > * Rusty Russell (rusty@rustcorp.com.au) wrote:
> > > I think it would be easier to just fast-path the num_online_cpus == 1
> > > case, even if you want to keep this "update_early" interface.
> >
> > Nope, that could lead to problems. I call core_immediate_update()
> > _very_ early, before boot_cpu_init() is called.
> 
> Ah, I see the problem.  It would in fact be clearer for us to move 
> boot_cpu_init() up to just after smp_setup_processor_id() in start_kernel 
> anyway, not just for this code, but in general.
> 

Could be done.

> > Therefore, 
> > cpu_online_map is not set yet. I am not sure the benefit of using
> > num_online_cpus outweights the added fragility wrt other boot process
> > initializations.
> 
> I think it's still a win, though worth a comment that we always go via the 
> non-IPI path for the early boot case.
> 

Ok, another potential problem then :

If we fast-path the num_online_cpus == 1, then updates done after the
boot process will have to go through this code. Therefore, we would have
to disable interrupts in the early boot code.

However, I doubt it is safe to use the paravirtualized
local_irq_disable/enable before the paravirt code is executed ?

Mathieu

> Cheers,
> Rusty.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-11-20 17:12 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-13 18:58 [patch 0/8] Immediate Values (now with merged x86 support) Mathieu Desnoyers
2007-11-13 18:58 ` [patch 1/8] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2007-11-13 18:58 ` [patch 2/8] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2007-11-13 18:58 ` [patch 3/8] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
2007-11-13 18:58 ` [patch 4/8] Add asm-compat.h to x86 Mathieu Desnoyers
2007-11-13 19:05   ` H. Peter Anvin
2007-11-13 20:37     ` [patch 4/8] Add asm-compat.h to x86 -> use new asm.h instead Mathieu Desnoyers
2007-11-13 18:58 ` [patch 5/8] Immediate Values - x86 Optimization Mathieu Desnoyers
2007-11-13 19:07   ` H. Peter Anvin
2007-11-13 19:24     ` Mathieu Desnoyers
2007-11-13 19:36       ` H. Peter Anvin
2007-11-13 19:45         ` Mathieu Desnoyers
2007-11-13 19:56           ` H. Peter Anvin
2007-11-13 20:40             ` [patch 5/8] Immediate Values - x86 Optimization (update) Mathieu Desnoyers
2007-11-13 21:26               ` H. Peter Anvin
2007-11-13 22:02                 ` Mathieu Desnoyers
2007-11-13 22:35                   ` H. Peter Anvin
2007-11-14  0:34                     ` Mathieu Desnoyers
2007-11-14  1:02                       ` H. Peter Anvin
2007-11-14  1:44                         ` [patch 5/8] Immediate Values - x86 Optimization (update 2) Mathieu Desnoyers
2007-11-14  2:58                           ` H. Peter Anvin
2007-11-14 14:16                             ` Mathieu Desnoyers
2007-11-14 18:52                             ` [PATCH] Immediate Values x86 Optimization Declare Discarded Instruction Mathieu Desnoyers
2007-11-14 19:00                               ` H. Peter Anvin
2007-11-14 19:16                                 ` [PATCH] Add __discard section to x86 Mathieu Desnoyers
2007-11-14 19:33                                   ` H. Peter Anvin
2007-11-15  3:08   ` [patch 5/8] Immediate Values - x86 Optimization Rusty Russell
2007-11-15  4:06     ` Mathieu Desnoyers
2007-11-15  4:45       ` Rusty Russell
2007-11-15  5:37         ` Mathieu Desnoyers
2007-11-15 11:06           ` Rusty Russell
2007-11-16 14:03             ` [patch 5/8] Immediate Values - x86 Optimization (simplified) Mathieu Desnoyers
2007-11-18 23:11               ` Rusty Russell
2007-11-19 14:28                 ` Mathieu Desnoyers
2007-11-19 23:06                   ` Rusty Russell
2007-11-20 17:02                     ` Mathieu Desnoyers
2007-11-19 19:15                 ` Mathieu Desnoyers
2007-11-13 18:58 ` [patch 6/8] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2007-11-13 18:58 ` [patch 7/8] Immediate Values - Documentation Mathieu Desnoyers
2007-11-13 18:58 ` [patch 8/8] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).