LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* i386 PDA patches use of %gs
@ 2006-09-12  7:35 Arjan van de Ven
  2006-09-12  7:48 ` Jeremy Fitzhardinge
  2006-09-13  1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge
  0 siblings, 2 replies; 45+ messages in thread
From: Arjan van de Ven @ 2006-09-12  7:35 UTC (permalink / raw)
  To: akpm, ak, mingo, Jeremy Fitzhardinge; +Cc: linux-kernel

Hi,

Userspace uses %gs for it's per thread data (and in modern linux
versions that means "all the time", errno is there for example).

On x86-64 this is the reason that the kernel uses the OTHER segment
register; so for the PDA patches this would mean using %fs and not %gs.

The advantage of this is very simple: %fs will be 0 for userspace most
of the time. Putting 0 in a segment register is cheap for the cpu,
putting anything else in is quite expensive (a LOT of security checks
need to happen). As such I would MUCH rather see that the i386 PDA
patches use %fs and not %gs... 

Jeremy, is there a reason you're specifically using %gs and not %fs? If
not, would you mind a switch to using %fs instead?

Greetings,
   Arjan van de Ven
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-12  7:35 i386 PDA patches use of %gs Arjan van de Ven
@ 2006-09-12  7:48 ` Jeremy Fitzhardinge
  2006-09-12  7:56   ` Arjan van de Ven
  2006-09-13  1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-09-12  7:48 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: akpm, ak, mingo, linux-kernel

Arjan van de Ven wrote:
> Jeremy, is there a reason you're specifically using %gs and not %fs? If
> not, would you mind a switch to using %fs instead?
>   

The main reason for using %gs was to take advantage of gcc's TLS 
support.  I intend to measure the cost of gs vs fs, and if there's a 
significant difference I'll switch.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-12  7:48 ` Jeremy Fitzhardinge
@ 2006-09-12  7:56   ` Arjan van de Ven
  2006-09-12  8:31     ` Jeremy Fitzhardinge
  2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Arjan van de Ven @ 2006-09-12  7:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: akpm, ak, mingo, linux-kernel

On Tue, 2006-09-12 at 00:48 -0700, Jeremy Fitzhardinge wrote:
> Arjan van de Ven wrote:
> > Jeremy, is there a reason you're specifically using %gs and not %fs? If
> > not, would you mind a switch to using %fs instead?
> >   
> 
> The main reason for using %gs was to take advantage of gcc's TLS 
> support.  I intend to measure the cost of gs vs fs, and if there's a 
> significant difference I'll switch.

gcc can be fixed if needed. I don't see the kernel switching to use that
any time soon though...



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-12  7:56   ` Arjan van de Ven
@ 2006-09-12  8:31     ` Jeremy Fitzhardinge
  2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-09-12  8:31 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: akpm, ak, mingo, linux-kernel

Arjan van de Ven wrote:
> gcc can be fixed if needed. I don't see the kernel switching to use that
> any time soon though...

I have a preliminary patch to implement per_cpu() in terms of __thread.

Hm, my initial tests comparing reloading a NULL selector vs a real 
selector shows absolutely no measurable difference, on either a modern 
Core Duo, or an old P4...  Admittedly this is with an artificial 
usermode test program, but I'd expect to see *some* difference if 
there's a difference.

    J


--

/* gcc -o time-segops time-segops.c -O2 -Wall -lrt -fomit-frame-pointer -funroll-loops */
#include <stdio.h>
#include <time.h>

#define COUNT 10000000

static inline void sync(void)
{
	int a,b,c,d;

	asm volatile("cpuid"
		     : "=a" (a), "=b" (b), "=c" (c), "=d" (d)
		     : "0" (0), "2" (0)
		     : "memory");
}

static void test_none(void)
{
	int i;

	for(i = 0; i < COUNT; i++) {
		sync();
	}
}

static void test_fs(void)
{
	int i, ds;
	asm volatile("mov %%ds,%0" : "=r" (ds));

	for(i = 0; i < COUNT; i++) {
		asm volatile("push %%fs; mov %0, %%fs; popl %%fs"
			     : : "r" (ds));
		sync();
	}
}

static void test_gs(void)
{
	int i, ds;
	asm volatile("mov %%ds,%0" : "=r" (ds));

	for(i = 0; i < COUNT; i++) {
		asm volatile("push %%gs; mov %0, %%gs; popl %%gs"
			     : : "r" (ds));
		sync();
	}
}

typedef void (*test_t)(void);

static test_t tests[] = {
	test_none,
	test_fs,
	test_gs,
	NULL,
};

int main()
{
	int i;
	int ds, fs, gs;

	asm volatile("mov %%ds, %0; "
		     "mov %%fs, %1; "
		     "mov %%gs, %2"
		     : "=r" (ds), "=r" (fs), "=r" (gs) : : "memory");

	printf("fs=%x gs=%x\n", fs, gs);
	for(i = 0; tests[i]; i++) {
		struct timespec start, end;
		unsigned long long delta;

		clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
		(*tests[i])();
		clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

		delta = (end.tv_sec * 1000000000ull + end.tv_nsec) - 
			(start.tv_sec * 1000000000ull + start.tv_nsec);
		delta /= COUNT;

		printf("%lluns/iteration\n", delta);
	}

	return 0;
}


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-12  7:35 i386 PDA patches use of %gs Arjan van de Ven
  2006-09-12  7:48 ` Jeremy Fitzhardinge
@ 2006-09-13  1:00 ` Jeremy Fitzhardinge
  2006-09-13  9:59   ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-09-13  1:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: akpm, ak, mingo, linux-kernel, Michael.Fetterman, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 3131 bytes --]

Arjan van de Ven wrote:
> The advantage of this is very simple: %fs will be 0 for userspace most
> of the time. Putting 0 in a segment register is cheap for the cpu,
> putting anything else in is quite expensive (a LOT of security checks
> need to happen). As such I would MUCH rather see that the i386 PDA
> patches use %fs and not %gs... 
Hi Arjan,

I spent some time trying to measure this, to see if there really is a 
difference between loading a null selector vs a non-null.

The short answer is no, I couldn't measure any difference at all, on any 
CPU going back to a P166, up to a current Core Duo machine.

I used a usermode test model of the entry.S code in order to make it 
easier to test on more machines.  The basic inner loop is:

	push %segreg
	mov  %selectorreg, %segreg
	add  $1,%segreg:offset	# use the segment register
	pop  %segreg


I also unrolled the loop to minimize the overhead from anything else.  
This is clearly much more segment-register intense than any real use, so 
I'm hoping that this should exacerbate any performance differences.  I 
also tried to put cpuid in the loop in order to approximate the 
synchronizing effects of taking an exception, but it didn't seem to make 
much difference other than slow everything down by a constant amount 
(the cpuid slowdown swamped pretty much everything else on Intel CPUs, 
but was much less intrusive on the Athlon64).

I tried the push/load/pop sequence with both %fs and %gs, where pop %fs 
would result in a null selector load, and pop %gs would load the normal 
userspace TLS selector.

I also tried loading 3 types of selector after the push:

    * the normal usermode ds selector, on the grounds that the CPU might
      be more efficient in reloading a selector which is already in use
    * an ldt selector, which I thought might be slower since (at least
      conceptually) there's an indirection into a different descriptor table
    * and a gdt selector (the normally unused second TLS selector)


In general, I got identical results for all of these.  There were two 
exceptions:

    * The 1.8 GHz P4 Northwood was slower loading the LDT selector as
      expected, and pop %fs was faster than pop %gs.  The GDT and data
      selector results were the same independent of %fs or %gs.
    * The AMD K6 was consistently *slower* with pop %fs; pop %gs was
      faster.  I didn't try reversing the uses of %fs and %gs to see if
      it was the null selector being slower, or some inherent slowness
      in using %fs.


It's possible I got something wrong, and I'm not really measuring what I 
think I'm measuring.  The main thing that worries me about the results 
is that they don't scale much at all in proportion to the clock speed.  
Otherwise the results look sensible to me.  I'd appreciate it if people 
could review the test program to see if I've overlooked something.

So, in summary, I don't think there's much point in switching to %fs.  I 
may get around to confirming this by doing a %gs->%fs conversion patch, 
but given these results that's at a fairly low priority.

I've attached my test program and results.

    J

[-- Attachment #2: time-segops.c --]
[-- Type: text/x-csrc, Size: 5235 bytes --]

/* gcc -m32 -O3 -Wall -fomit-frame-pointer -funroll-loops -g  -o time-segops time-segops.c -lrt */
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <errno.h>
#include <string.h>
#include <ctype.h>
#include <asm/unistd.h>

#define GTOD	0
#define SYNC	0
#define COUNT 50000000

/* different glibc's call this different things, so define our own */
struct desc {
	unsigned int  entry_number;
	unsigned long base_addr;
	unsigned int  limit;
	unsigned int  seg_32bit:1;
	unsigned int  contents:2;
	unsigned int  read_exec_only:1;
	unsigned int  limit_in_pages:1;
	unsigned int  seg_not_present:1;
	unsigned int  useable:1;
};

/* These don't seem to be consistently defined in glibc */
static int set_thread_area(struct desc *desc)
{
	int ret;
	asm("int $0x80"
	    : "=a" (ret)
	    : "0" (__NR_set_thread_area), "b" (desc)
	    : "memory");
	if (ret < 0) {
		errno = -ret;
		ret = -1;

	}
	return ret;
}
static int modify_ldt(int func, struct desc *desc, int size)
{
	int ret;
	asm("int $0x80"
	    : "=a" (ret)
	    : "0" (__NR_modify_ldt), "b" (func), "c" (desc), "d" (size)
	    : "memory");
	if (ret < 0) {
		errno = -ret;
		ret = -1;

	}
	return ret;
}

static inline unsigned long long now(void)
{
#if GTOD
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000000ull + tv.tv_usec * 1000ull;
#else
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_sec * 1000000000ull + ts.tv_nsec;
#endif
}

/* Simulate an exception's effect on the pipeline? */
static inline void sync(void)
{
	if (0) {
		int a,b,c,d;
		asm volatile("cpuid"
			     : "=a" (a), "=b" (b), "=c" (c), "=d" (d)
			     : "0" (0), "2" (0)
			     : "memory");
	} else
		asm volatile("" : : : "memory");
}

static const char *test_none(int seg, int *offset)
{
	int i;

	for(i = 0; i < COUNT; i++) {
		sync();
	}

	return "<none>";
}

static const char *test_fs(int seg, int *offset)
{
	int i;

	for(i = 0; i < COUNT; i++) {
		asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs"
			     : "+m" (*offset): "r" (seg) : "memory");
		sync();
	}
	return "fs";
}

static const char *test_gs(int seg, int *offset)
{
	int i;

	for(i = 0; i < COUNT; i++) {
		asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs"
			     : "+m" (*offset): "r" (seg) : "memory");
		sync();
	}
	return "gs";
}

typedef const char *(*test_t)(int, int *);

static const test_t tests[] = {
	test_none,
	test_fs,
	test_gs,
	NULL,
};

static int segment[1];


static void test(int seg, int *offset, const char *segdesc)
{
	int i;

	for(i = 0; tests[i]; i++) {
		unsigned long long start, end;
		unsigned long long delta;
		const char *t;

		start = now();
		t = (*tests[i])(seg, offset);
		end = now();

		delta = (end - start);

		printf("   %s with %s selector: %lluns/iteration\n",
		       t, segdesc, delta / COUNT);
	}
}

struct cpu
{
	char modelname[100];
	int family, model, stepping;
	float speed;
};

static int cpu_details(struct cpu *cpu)
{
	FILE *fp = fopen("/proc/cpuinfo", "r");
	char buf[500];

	if (fp == NULL) {
		perror("open /proc/cpuinfo");
		return 0;
	}

	while(fgets(buf, sizeof(buf), fp) != NULL) {
		char *col = strchr(buf, ':');
		char *val;

		if (col == NULL)
			continue;

		val = col+1;
		while(*val == ' ')
			val++;

		col--;
		while(col > buf && isspace(*col))
			col--;
		col[1] = 0;

		col = strchr(val, '\n');
		if (col)
			*col = 0;

		//printf("name=%s val=%s\n", buf, val);

		if (strcmp(buf, "model name") == 0)
			strcpy(cpu->modelname, val);
		if (strcmp(buf, "cpu family") == 0)
			sscanf(val, "%d", &cpu->family);
		if (strcmp(buf, "model") == 0)
			sscanf(val, "%d", &cpu->model);
		if (strcmp(buf, "stepping") == 0)
			sscanf(val, "%d", &cpu->stepping);
		if (strcmp(buf, "cpu MHz") == 0)
			sscanf(val, "%f", &cpu->speed);

		if (strcmp(buf, "processor") == 0 && strcmp(val, "0") != 0)
			break;
	}
	fclose(fp);

	return 1;
}

int main()
{
	int ds, fs, gs;
	static struct desc desc = {
		.entry_number = 1,
		.base_addr = (unsigned long)segment,
		.limit = sizeof(segment)-1,
		.seg_32bit = 1,
		.contents = 0,
		.read_exec_only = 0,
		.limit_in_pages = 0,
		.seg_not_present = 0,
		.useable = 1,
	};
	int gdtseg, ldtseg;
	struct cpu cpu;
	float speed;

	if (!cpu_details(&cpu)) {
		printf("can't read CPU details");
		return 1;
	}
	speed = cpu.speed;


	if (modify_ldt(1, &desc, sizeof(desc)) == -1)
		perror("modify ldt");
	ldtseg = desc.entry_number * 8 | 4 | 3;

	desc.entry_number = -1;
	if (set_thread_area(&desc) == -1)
		perror("set_thread_area");
	gdtseg = desc.entry_number * 8 | 3;

	asm volatile("mov %%ds, %0; "
		     "mov %%fs, %1; "
		     "mov %%gs, %2"
		     : "=r" (ds), "=r" (fs), "=r" (gs) : : "memory");

	printf("\"%s\" @%gMhz (%d,%d,%d):\n",
	       cpu.modelname, cpu.speed, cpu.family, cpu.model, cpu.stepping);
	printf("ds=%x fs=%x gs=%x ldt=%x gdt=%x %s %s\n",
	       ds, fs, gs, ldtseg, gdtseg,
	       GTOD ? "GTOD" : "CPUTIME",
	       SYNC ? "SYNC" : "");

	test(ds, segment, "data");
	printf("\n");
	test(ldtseg, 0, "LDT");
	printf("\n");
	test(gdtseg, 0, "GDT");

	if (cpu_details(&cpu)) {
		if (speed != cpu.speed)
			printf("cpu speed changed %f->%f?! disable CPUFREQ\n",
			       speed, cpu.speed);
	}

	return 0;
}

[-- Attachment #3: results-nosync.txt --]
[-- Type: text/plain, Size: 3164 bytes --]

"Genuine Intel(R) CPU           T2400  @ 1.83GHz" @1000Mhz (6,14,8):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 0ns/iteration
   fs with data selector: 27ns/iteration
   gs with data selector: 28ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 27ns/iteration
   gs with LDT selector: 28ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 27ns/iteration
   gs with GDT selector: 28ns/iteration

"AMD Athlon(tm) 64 Processor 3500+" @1000Mhz (15,15,0):
fs=0 gs=63 ldt=f gdt=6b
   <none> with data selector: 0ns/iteration
   fs with data selector: 10ns/iteration
   gs with data selector: 10ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 10ns/iteration
   gs with LDT selector: 10ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 10ns/iteration
   gs with GDT selector: 10ns/iteration

"Intel(R) Pentium(R) 4 CPU 1.80GHz" @1817.91Mhz (15,2,4):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 0ns/iteration
   fs with data selector: 30ns/iteration
   gs with data selector: 31ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 40ns/iteration
   gs with LDT selector: 44ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 30ns/iteration
   gs with GDT selector: 31ns/iteration

"Intel(R) Celeron(R) CPU 2.40GHz" @2394.47Mhz (15,2,9):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 0ns/iteration
   fs with data selector: 27ns/iteration
   gs with data selector: 25ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 25ns/iteration
   gs with LDT selector: 25ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 24ns/iteration
   gs with GDT selector: 25ns/iteration

"Pentium 75 - 200" @166.213Mhz (5,2,12):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 1ns/iteration
   fs with data selector: 57ns/iteration
   gs with data selector: 57ns/iteration

   <none> with LDT selector: 1ns/iteration
   fs with LDT selector: 57ns/iteration
   gs with LDT selector: 57ns/iteration

   <none> with GDT selector: 1ns/iteration
   fs with GDT selector: 57ns/iteration
   gs with GDT selector: 57ns/iteration

"AMD-K6(tm) 3D+ Processor" @451.105Mhz (5,9,1):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 0ns/iteration
   fs with data selector: 57ns/iteration
   gs with data selector: 44ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 57ns/iteration
   gs with LDT selector: 44ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 57ns/iteration
   gs with GDT selector: 44ns/iteration

"Pentium III (Coppermine)" @700Mhz (6,8,6):
fs=0 gs=33 ldt=f gdt=3b
   <none> with data selector: 0ns/iteration
   fs with data selector: 46ns/iteration
   gs with data selector: 46ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 46ns/iteration
   gs with LDT selector: 47ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 46ns/iteration
   gs with GDT selector: 47ns/iteration

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-13  1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge
@ 2006-09-13  9:59   ` Ingo Molnar
  2006-09-13 16:17     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-09-13  9:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> [...]  The basic inner loop is:
> 
> 	push %segreg
> 	mov  %selectorreg, %segreg
> 	add  $1,%segreg:offset	# use the segment register
> 	pop  %segreg

well, the most important thing i believe you didnt test: the effect of 
mixing two descriptors on the _same_ selector: one %gs selector value 
loaded and used by glibc, and another %gs selector value loaded and used 
by the kernel, intermixed. It's the mixing that causes the descriptor 
cache reload. (unless i missed some detail about your testcase)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-13  9:59   ` Ingo Molnar
@ 2006-09-13 16:17     ` Jeremy Fitzhardinge
  2006-11-15 18:26       ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-09-13 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell

Ingo Molnar wrote:
> well, the most important thing i believe you didnt test: the effect of 
> mixing two descriptors on the _same_ selector: one %gs selector value 
> loaded and used by glibc, and another %gs selector value loaded and used 
> by the kernel, intermixed. It's the mixing that causes the descriptor 
> cache reload. (unless i missed some detail about your testcase)

But it doesn't mix different descriptors on the same selector; the GDT 
is initialized when the CPU is brought up, and is unchanged from then 
on.  The PDA descriptor is GDT entry 27 and the userspace TLS entries 
are 6-8, so in the typical case %gs will alternate between 0x33 and 0xd8 
as it enters and leaves the kernel.

My test program does the same thing, except using GDT entries 6 and 7 
(selectors 0x33 and 0x3b).

    J


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH] i386-pda UP optimization
  2006-09-12  7:56   ` Arjan van de Ven
  2006-09-12  8:31     ` Jeremy Fitzhardinge
@ 2006-11-15 11:27     ` Eric Dumazet
  2006-11-15 11:32       ` Andi Kleen
                         ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Eric Dumazet @ 2006-11-15 11:27 UTC (permalink / raw)
  To: akpm; +Cc: Arjan van de Ven, Jeremy Fitzhardinge, ak, mingo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile 
results on Opteron machines.

I really think %gs prefixes can be expensive in some (most ?) cases, even if 
the Intel/AMD docs say they are free.

I wrote this trivial User program to benchmark vfs_read()/vfs_write() that 
happens to use 'current' many times.

#include <unistd.h>
#include <errno.h>

int main()
{
        int i, fd[2];
        char c = 0;
        pipe(fd);
        for (i = 0; i < 10000000; i++) {
                errno = 0; // glibc also use %gs
                write(fd[1], &c, 1);
                read(fd[0], &c, 1);
        }
        return 0;
}

The best elap time I got for this program on 10 runs was : 12.811 s
(Intel(R) Pentium(R) M processor 1.60GHz)

With the attached patch, I got 12.212 s, and a kernel text size reduction of 
3400 bytes.

I wish Jeremy give us patches for UP machines so that %gs can be let untouched 
in entry.S (syscall entry/exit). A lot of ia32 machines are still using one 
CPU.

Note : I dont have a x86_64 machine here, but I suspect a similar patch could 
be done for x86_64 too.

Thank you

[PATCH] i386-pda UP optimization

On a !CONFIG_SMP machine, there is only one PDA, (one CPU).
We can avoid %gs prefixes when reading/writing fields in PDA.
This reduce kernel text size and also give better performance.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: i386-pda-up.patch --]
[-- Type: text/plain, Size: 2182 bytes --]

--- linux-2.6.19-rc5-mm2/include/asm-i386/pda.h	2006-11-15 11:21:24.000000000 +0100
+++ linux-2.6.19-rc5-mm2-ed/include/asm-i386/pda.h	2006-11-15 11:23:49.000000000 +0100
@@ -91,10 +91,19 @@
 	((typeof(_proxy_pda.field) *)((unsigned char *)read_pda(_pda) + \
 				      pda_offset(field)))
 
+#if defined(CONFIG_SMP)
 #define read_pda(field) pda_from_op("mov",field)
 #define write_pda(field,val) pda_to_op("mov",field,val)
 #define add_pda(field,val) pda_to_op("add",field,val)
 #define sub_pda(field,val) pda_to_op("sub",field,val)
 #define or_pda(field,val) pda_to_op("or",field,val)
+#else
+extern struct i386_pda boot_pda;
+#define read_pda(field)      boot_pda.field 
+#define write_pda(field,val) do { boot_pda.field = (val);} while (0)
+#define add_pda(field,val) ) do { boot_pda.field += (val);} while (0)
+#define sub_pda(field,val)   do { boot_pda.field -= (val);} while (0)
+#define or_pda(field,val)    do { boot_pda.field |= (val);} while (0)
+#endif
 
 #endif	/* _I386_PDA_H */
--- linux-2.6.19-rc5-mm2/arch/i386/kernel/cpu/common.c	2006-11-15 11:21:25.000000000 +0100
+++ linux-2.6.19-rc5-mm2-ed/arch/i386/kernel/cpu/common.c	2006-11-15 11:45:09.000000000 +0100
@@ -609,6 +609,14 @@
 	return regs;
 }
 
+/* Initial PDA used by boot CPU */
+struct i386_pda boot_pda = {
+	._pda = &boot_pda,
+	.cpu_number = 0,
+	.pcurrent = &init_task,
+};
+EXPORT_SYMBOL(boot_pda);
+
 static __cpuinit int alloc_gdt(int cpu)
 {
 	struct Xgt_desc_struct *cpu_gdt_descr = &per_cpu(cpu_gdt_descr, cpu);
@@ -628,11 +636,10 @@
 		BUG_ON(gdt != NULL || pda != NULL);
 
 		gdt = alloc_bootmem_pages(PAGE_SIZE);
-		pda = alloc_bootmem(sizeof(*pda));
+		pda = &boot_pda;
 		/* alloc_bootmem(_pages) panics on failure, so no check */
 
 		memset(gdt, 0, PAGE_SIZE);
-		memset(pda, 0, sizeof(*pda));
 	} else {
 		/* GDT and PDA might already have been allocated if
 		   this is a CPU hotplug re-insertion. */
@@ -655,13 +662,6 @@
 	return 1;
 }
 
-/* Initial PDA used by boot CPU */
-struct i386_pda boot_pda = {
-	._pda = &boot_pda,
-	.cpu_number = 0,
-	.pcurrent = &init_task,
-};
-
 static inline void set_kernel_gs(void)
 {
 	/* Set %gs for this CPU's PDA.  Memory clobber is to create a

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
@ 2006-11-15 11:32       ` Andi Kleen
  2006-11-15 17:20         ` Ingo Molnar
  2006-11-15 17:52       ` Jeremy Fitzhardinge
  2006-11-28 23:12       ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2006-11-15 11:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: akpm, Arjan van de Ven, Jeremy Fitzhardinge, mingo, linux-kernel

On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile 
> results on Opteron machines.
> 
> I really think %gs prefixes can be expensive in some (most ?) cases, even if 
> the Intel/AMD docs say they are free.

They aren't free, just very cheap.

> 
> With the attached patch, I got 12.212 s, and a kernel text size reduction of 
> 3400 bytes.

Are the benchmark numbers stable? i.e. if you repeat them multiple times
with reboots do you still get the same difference?

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 11:32       ` Andi Kleen
@ 2006-11-15 17:20         ` Ingo Molnar
  2006-11-15 17:24           ` Andi Kleen
  2006-11-15 17:28           ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 17:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> > Seeing %gs prefixes used now by i386 port, I recalled seeing strange 
> > oprofile results on Opteron machines.
> > 
> > I really think %gs prefixes can be expensive in some (most ?) cases, 
> > even if the Intel/AMD docs say they are free.
> 
> They aren't free, just very cheap.

Eric's test shows a 5% slowdown. That's far from cheap.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:20         ` Ingo Molnar
@ 2006-11-15 17:24           ` Andi Kleen
  2006-11-15 17:46             ` Eric Dumazet
  2006-11-15 17:28           ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2006-11-15 17:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel

On Wednesday 15 November 2006 18:20, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> > > Seeing %gs prefixes used now by i386 port, I recalled seeing strange 
> > > oprofile results on Opteron machines.
> > > 
> > > I really think %gs prefixes can be expensive in some (most ?) cases, 
> > > even if the Intel/AMD docs say they are free.
> > 
> > They aren't free, just very cheap.
> 
> Eric's test shows a 5% slowdown. That's far from cheap.

I have my doubts about the accuracy of his test results. That is why I asked 
him to double check.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:20         ` Ingo Molnar
  2006-11-15 17:24           ` Andi Kleen
@ 2006-11-15 17:28           ` Jeremy Fitzhardinge
  2006-11-15 17:32             ` Ingo Molnar
  2006-11-15 18:01             ` Arjan van de Ven
  1 sibling, 2 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 17:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel

Ingo Molnar wrote:
> Eric's test shows a 5% slowdown. That's far from cheap.
>   

It seems like an absurdly large difference.  PDA references aren't all
that common in the kernel; for the %gs prefix on PDA accesses to be
causing a 5% overall difference in a test like this means that the
prefixes would have to be costing hundreds or thousands of cycles, which
seems absurd.  Particularly since Eric's patch doesn't touch head.S, so
the %gs save/restore is still being executed.

Are we sure this isn't a cache layout issue?  Eric, did you try evicting
your executable from pagecache between runs to see if you get variation
depending on what physical pages it gets put into?  (Making several
copies of the executable should have the same effect.)

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:28           ` Jeremy Fitzhardinge
@ 2006-11-15 17:32             ` Ingo Molnar
  2006-11-15 17:59               ` Jeremy Fitzhardinge
  2006-11-15 18:01             ` Arjan van de Ven
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 17:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > Eric's test shows a 5% slowdown. That's far from cheap.
> 
> It seems like an absurdly large difference.  PDA references aren't all 
> that common in the kernel; for the %gs prefix on PDA accesses to be 
> causing a 5% overall difference in a test like this means that the 
> prefixes would have to be costing hundreds or thousands of cycles, 
> which seems absurd.  Particularly since Eric's patch doesn't touch 
> head.S, so the %gs save/restore is still being executed.

i said this before: using segmentation tricks these days is /insane/. 
Segmentation is not for free, and it's not going to be cheap in the 
future. In fact, chances are that it will be /more/ expensive in the 
future, because sane OSs just make no use of them besides the trivial 
"they dont even exist" uses.

so /at a minimum/, as i suggested it before, the kernel's segment use 
should not overlap that of glibc's. I.e. the kernel should use %fs, not 
%gs.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:24           ` Andi Kleen
@ 2006-11-15 17:46             ` Eric Dumazet
  2006-11-15 17:49               ` Ingo Molnar
  2006-11-21 11:38               ` Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2006-11-15 17:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Wednesday 15 November 2006 18:24, Andi Kleen wrote:
> On Wednesday 15 November 2006 18:20, Ingo Molnar wrote:
> > * Andi Kleen <ak@suse.de> wrote:
> > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> > > > Seeing %gs prefixes used now by i386 port, I recalled seeing strange
> > > > oprofile results on Opteron machines.
> > > >
> > > > I really think %gs prefixes can be expensive in some (most ?) cases,
> > > > even if the Intel/AMD docs say they are free.
> > >
> > > They aren't free, just very cheap.
> >
> > Eric's test shows a 5% slowdown. That's far from cheap.
>
> I have my doubts about the accuracy of his test results. That is why I
> asked him to double check.

Fair enough :)

I plan doing *lot* of tests as soon as possible (not possible during daytime 
unfortunately, I miss a dev machine)

By the way, I tried this patch to avoid reload %gs at syscall start. Since %gs 
is not anymore used inside kernel (after i386-pda UP optimization is 
applied) : We can let in %gs the User Program %gs value. (I still force a 
reload of %gs before syscall exit of course)

Machine boots but freeze when init starts. Any idea ?

Thank you
Eric

[-- Attachment #2: entry.patch --]
[-- Type: text/plain, Size: 1398 bytes --]

--- linux-2.6.19-rc5-mm2/arch/i386/kernel/entry.S	2006-11-15 11:21:25.000000000 +0100
+++ linux-2.6.19-rc5-mm2-ed/arch/i386/kernel/entry.S	2006-11-15 18:40:53.000000000 +0100
@@ -97,6 +97,16 @@
 #define resume_userspace_sig	resume_userspace
 #endif
 
+/*
+ * On UP, we dont need to change %gs since PDA accesses dont use %gs
+ */
+#if defined(CONFIG_SMP)
+#define LOAD_KERNEL_GS(reg)	movl $(__KERNEL_PDA), reg; \
+	movl reg, %gs
+#else
+#define LOAD_KERNEL_GS(reg)
+#endif
+
 #define SAVE_ALL \
 	cld; \
 	pushl %gs; \
@@ -132,8 +142,7 @@
 	movl $(__USER_DS), %edx; \
 	movl %edx, %ds; \
 	movl %edx, %es; \
-	movl $(__KERNEL_PDA), %edx; \
-	movl %edx, %gs
+	LOAD_KERNEL_GS(%edx);
 
 #define RESTORE_INT_REGS \
 	popl %ebx;	\
@@ -544,9 +553,15 @@
 	jmp resume_userspace
 	CFI_ENDPROC
 
+#ifdef CONFIG_SMP
+# define GET_CPU_NUM(reg) movl %gs:PDA_cpu, reg;
+#else
+# define GET_CPU_NUM(reg)
+#endif
+
 #define FIXUP_ESPFIX_STACK \
 	/* since we are on a wrong stack, we cant make it a C code :( */ \
-	movl %gs:PDA_cpu, %ebx; \
+	GET_CPU_NUM(%ebx) \
 	PER_CPU(cpu_gdt_descr, %ebx); \
 	movl GDS_address(%ebx), %ebx; \
 	GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \
@@ -660,8 +675,7 @@
 	pushl %gs
 	CFI_ADJUST_CFA_OFFSET 4
 	/*CFI_REL_OFFSET gs, 0*/
-	movl $(__KERNEL_PDA), %ecx
-	movl %ecx, %gs
+	LOAD_KERNEL_GS(%ecx)
 	UNWIND_ESPFIX_STACK
 	popl %ecx
 	CFI_ADJUST_CFA_OFFSET -4

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:46             ` Eric Dumazet
@ 2006-11-15 17:49               ` Ingo Molnar
  2006-11-15 17:58                 ` Eric Dumazet
  2006-11-21 11:38               ` Eric Dumazet
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Machine boots but freeze when init starts. Any idea ?

probably caused by this:

> +# define GET_CPU_NUM(reg)

>  #define FIXUP_ESPFIX_STACK \
>  	/* since we are on a wrong stack, we cant make it a C code :( */ \
> -	movl %gs:PDA_cpu, %ebx; \
> +	GET_CPU_NUM(%ebx) \
>  	PER_CPU(cpu_gdt_descr, %ebx); \
>  	movl GDS_address(%ebx), %ebx; \

%ebx very definitely wants to have a current CPU number loaded ;) Pick 
it up from the task struct.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
  2006-11-15 11:32       ` Andi Kleen
@ 2006-11-15 17:52       ` Jeremy Fitzhardinge
  2006-11-28 23:12       ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 17:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel

Eric Dumazet wrote:
> I wish Jeremy give us patches for UP machines so that %gs can be let untouched 
> in entry.S (syscall entry/exit). A lot of ia32 machines are still using one 
> CPU.
>   
Unfortunately that would add cruft in a number of places.  At the
moment, context switch, ptrace and vm86 all assume entry.S has saved %gs
into pt_regs, so they can treat it like any other register.  If this
were conditional, it would require multiple places to add #ifndef
CONFIG_SMP code, which is not something I'd like to do without a good
reason.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:49               ` Ingo Molnar
@ 2006-11-15 17:58                 ` Eric Dumazet
  2006-11-15 18:01                   ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2006-11-15 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel

On Wednesday 15 November 2006 18:49, Ingo Molnar wrote:
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > Machine boots but freeze when init starts. Any idea ?
>
> probably caused by this:
> > +# define GET_CPU_NUM(reg)
> >
> >  #define FIXUP_ESPFIX_STACK \
> >  	/* since we are on a wrong stack, we cant make it a C code :( */ \
> > -	movl %gs:PDA_cpu, %ebx; \
> > +	GET_CPU_NUM(%ebx) \
> >  	PER_CPU(cpu_gdt_descr, %ebx); \
> >  	movl GDS_address(%ebx), %ebx; \
>
> %ebx very definitely wants to have a current CPU number loaded ;) Pick
> it up from the task struct.

Hum.... Are you sure ?

For UP we have this PER_CPU definition :

#define PER_CPU(var, cpu) \
        movl $per_cpu__/**/var, cpu;

You can see 'cpu' is a pure output , not an input value.

So I basically deleted the fist instruction of this sequence :

movl %gs:PDA_cpu, %ebx
movl $per_cpu__cpu_gdt_descr, %ebx;

Did I miss something ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:32             ` Ingo Molnar
@ 2006-11-15 17:59               ` Jeremy Fitzhardinge
  2006-11-15 18:05                 ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 17:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Eric Dumazet, akpm, Arjan van de Ven, linux-kernel

Ingo Molnar wrote:
> i said this before: using segmentation tricks these days is /insane/. 
> Segmentation is not for free, and it's not going to be cheap in the 
> future. In fact, chances are that it will be /more/ expensive in the 
> future, because sane OSs just make no use of them besides the trivial 
> "they dont even exist" uses.
>   

Many, many systems use %fs/%gs to implement some kind of thread-local
storage, and such usage is becoming more common; the PDA's use of it in
the kernel is no different.  I would agree that using all the obscure
corners of segmentation is just asking for trouble, but using %gs as an
address offset seems like something that's going to be efficient on x86
32/64 processors indefinitely.

> so /at a minimum/, as i suggested it before, the kernel's segment use 
> should not overlap that of glibc's. I.e. the kernel should use %fs, not 
> %gs.

Last time you raised this I did a pretty comprehensive set of tests
which showed there was flat out zero difference between using %fs and
%gs.  There doesn't seem to be anything to the theory that reloading a
null segment selector is in any way cheaper than loading a real
selector.  Did you find a problem in my methodology?

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:58                 ` Eric Dumazet
@ 2006-11-15 18:01                   ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> > > +	GET_CPU_NUM(%ebx) \
> > >  	PER_CPU(cpu_gdt_descr, %ebx); \
> > >  	movl GDS_address(%ebx), %ebx; \
> >
> > %ebx very definitely wants to have a current CPU number loaded ;) Pick
> > it up from the task struct.
> 
> Hum.... Are you sure ?
> 
> For UP we have this PER_CPU definition :
> 
> #define PER_CPU(var, cpu) \
>         movl $per_cpu__/**/var, cpu;

hm, you are right. No quick ideas then.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:28           ` Jeremy Fitzhardinge
  2006-11-15 17:32             ` Ingo Molnar
@ 2006-11-15 18:01             ` Arjan van de Ven
  2006-11-15 18:24               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Arjan van de Ven @ 2006-11-15 18:01 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, Eric Dumazet, akpm, linux-kernel

On Wed, 2006-11-15 at 09:28 -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > Eric's test shows a 5% slowdown. That's far from cheap.
> >   
> 
> It seems like an absurdly large difference.  PDA references aren't all
> that common in the kernel; for the %gs prefix on PDA accesses to be
> causing a 5% overall difference in a test like this means that the
> prefixes would have to be costing hundreds or thousands of cycles, which
> seems absurd.  Particularly since Eric's patch doesn't touch head.S, so
> the %gs save/restore is still being executed.


segment register accesses really are not cheap. 
Also really it'll be better to use the register userspace is not using,
but we had that discussion before; could you remind me why you picked 
%gs in the first place?


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:59               ` Jeremy Fitzhardinge
@ 2006-11-15 18:05                 ` Eric Dumazet
  2006-11-15 18:28                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2006-11-15 18:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andi Kleen, akpm, Arjan van de Ven, linux-kernel

On Wednesday 15 November 2006 18:59, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> > i said this before: using segmentation tricks these days is /insane/.
> > Segmentation is not for free, and it's not going to be cheap in the
> > future. In fact, chances are that it will be /more/ expensive in the
> > future, because sane OSs just make no use of them besides the trivial
> > "they dont even exist" uses.
>
> Many, many systems use %fs/%gs to implement some kind of thread-local
> storage, and such usage is becoming more common; the PDA's use of it in
> the kernel is no different.  I would agree that using all the obscure
> corners of segmentation is just asking for trouble, but using %gs as an
> address offset seems like something that's going to be efficient on x86
> 32/64 processors indefinitely.
>
> > so /at a minimum/, as i suggested it before, the kernel's segment use
> > should not overlap that of glibc's. I.e. the kernel should use %fs, not
> > %gs.
>
> Last time you raised this I did a pretty comprehensive set of tests
> which showed there was flat out zero difference between using %fs and
> %gs.  There doesn't seem to be anything to the theory that reloading a
> null segment selector is in any way cheaper than loading a real
> selector.  Did you find a problem in my methodology?

I have the feeling (most probably wrong, but I prefer to speak than keeping 
this for myself) that the cost of segment load is delayed up to the first use 
of a segment selector. Sort of a lazy reload...

I had this crazy idea while looking at oprofile numbers


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 18:01             ` Arjan van de Ven
@ 2006-11-15 18:24               ` Jeremy Fitzhardinge
  2006-11-15 19:06                 ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 18:24 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andi Kleen, Eric Dumazet, akpm, linux-kernel

Arjan van de Ven wrote:
> segment register accesses really are not cheap. 
> Also really it'll be better to use the register userspace is not using,
> but we had that discussion before; could you remind me why you picked 
> %gs in the first place?
>   

To leave open the possibility of using the compiler's TLS support in the
kernel for percpu.  I also measured the cost of reloading %gs vs %fs,
and found no difference between reloading a null selector vs a non-null
selector.

    J


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-09-13 16:17     ` Jeremy Fitzhardinge
@ 2006-11-15 18:26       ` Ingo Molnar
  2006-11-15 18:29         ` Ingo Molnar
  2006-11-15 18:39         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> >well, the most important thing i believe you didnt test: the effect of 
> >mixing two descriptors on the _same_ selector: one %gs selector value 
> >loaded and used by glibc, and another %gs selector value loaded and used 
> >by the kernel, intermixed. It's the mixing that causes the descriptor 
> >cache reload. (unless i missed some detail about your testcase)
> 
> But it doesn't mix different descriptors on the same selector; the GDT 
> is initialized when the CPU is brought up, and is unchanged from then 
> on.  The PDA descriptor is GDT entry 27 and the userspace TLS entries 
> are 6-8, so in the typical case %gs will alternate between 0x33 and 
> 0xd8 as it enters and leaves the kernel.
> 
> My test program does the same thing, except using GDT entries 6 and 7 
> (selectors 0x33 and 0x3b).

no, that's not what it does. It measures 50000000 switches of the _same_ 
selector value, without using any of the selectors in the loop itself. 
I.e. no mixing at all! But when the kernel and userspace uses %gs, it's 
the cost of switching between two selector values of %gs that has to be 
measured. Your code does not measure that at all, AFAICS.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 18:05                 ` Eric Dumazet
@ 2006-11-15 18:28                   ` Jeremy Fitzhardinge
  2006-11-15 18:31                     ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Andi Kleen, akpm, Arjan van de Ven, linux-kernel

Eric Dumazet wrote:
> I have the feeling (most probably wrong, but I prefer to speak than keeping 
> this for myself) that the cost of segment load is delayed up to the first use 
> of a segment selector. Sort of a lazy reload...
>   

Probably not too much, since the load itself has to raise a fault if
there's any problem with the segment itself, and once it is loaded you
can change the underlying descriptor without affecting the segment
register. Even if it were lazy, that would only make the first %gs use a
bit slow, and shouldn't affect the subsequent ones.  However, when I
measured segment register use timings, I didn't see any dramatic costs
associated with segment register use which would account for a 5% hit in
your benchmark.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:26       ` Ingo Molnar
@ 2006-11-15 18:29         ` Ingo Molnar
  2006-11-15 18:43           ` Jeremy Fitzhardinge
  2006-11-15 18:39         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Ingo Molnar <mingo@elte.hu> wrote:

> > My test program does the same thing, except using GDT entries 6 and 
> > 7 (selectors 0x33 and 0x3b).
> 
> no, that's not what it does. It measures 50000000 switches of the 
> _same_ selector value, without using any of the selectors in the loop 
> itself. I.e. no mixing at all! But when the kernel and userspace uses 
> %gs, it's the cost of switching between two selector values of %gs 
> that has to be measured. Your code does not measure that at all, 
> AFAICS.

for example, your test_fs() code does:

        for(i = 0; i < COUNT; i++) {
                asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs"
                             : "+m" (*offset): "r" (seg) : "memory");
                sync();
        }

that loads (and uses) a single selector value for %fs, and doesnt do any 
mixed use as far as i can see.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 18:28                   ` Jeremy Fitzhardinge
@ 2006-11-15 18:31                     ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric Dumazet, Andi Kleen, akpm, Arjan van de Ven, linux-kernel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> [...] However, when I measured segment register use timings, I didn't 
> see any dramatic costs associated with segment register use which 
> would account for a 5% hit in your benchmark.

if by that measurement you mean time-segops.c, i dont think it correctly 
measures 'mixed' use of different selector values for the same %gs 
segment selector. And that's what i suggested for you to measure in 
September, and that's what Eric's testcase triggers too.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:26       ` Ingo Molnar
  2006-11-15 18:29         ` Ingo Molnar
@ 2006-11-15 18:39         ` Jeremy Fitzhardinge
  2006-11-15 18:43           ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 18:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell

Ingo Molnar wrote:
> no, that's not what it does. It measures 50000000 switches of the _same_ 
> selector value, without using any of the selectors in the loop itself. 
> I.e. no mixing at all! But when the kernel and userspace uses %gs, it's 
> the cost of switching between two selector values of %gs that has to be 
> measured. Your code does not measure that at all, AFAICS.
>   
I think you're misreading it.  This is the inner loop:

        for(i = 0; i < COUNT; i++) {
                asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs"
                             : "+m" (*offset): "r" (seg) : "memory");
                sync();
        }
        return "gs";

On entry, %gs will contain the normal usermode TLS selector.  "seg" is
another selector allocated with set_thread_area().  The asm pushes the
old %gs, loads the new one, uses a memory address via the new segment,
then restores the previous %gs.

So given this output:

"Genuine Intel(R) CPU           T2400  @ 1.83GHz" @1000Mhz (6,14,8):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME 
[...]

The initial %fs and %gs are 0 and 0x33 respectively, and it is using
0x3b as the other GDT selector (and 0xf as the other LDT selector).

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:29         ` Ingo Molnar
@ 2006-11-15 18:43           ` Jeremy Fitzhardinge
  2006-11-15 18:44             ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell

Ingo Molnar wrote:
> for example, your test_fs() code does:
>
>         for(i = 0; i < COUNT; i++) {
>                 asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs"
>                              : "+m" (*offset): "r" (seg) : "memory");
>                 sync();
>         }
>
> that loads (and uses) a single selector value for %fs, and doesnt do any 
> mixed use as far as i can see.

I'm not sure what you're getting at.  Each loop iteration is analogous
to a user->kernel->user transition with respect to the
save/reload/use/restore pattern on the segment register.  In this case,
%fs starts as a null selector, gets reloaded with a non NULL selector,
and then is restored to null.  Do you mean some other mixing?

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:39         ` Jeremy Fitzhardinge
@ 2006-11-15 18:43           ` Ingo Molnar
  2006-11-15 18:49             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > no, that's not what it does. It measures 50000000 switches of the _same_ 
> > selector value, without using any of the selectors in the loop itself. 
> > I.e. no mixing at all! But when the kernel and userspace uses %gs, it's 
> > the cost of switching between two selector values of %gs that has to be 
> > measured. Your code does not measure that at all, AFAICS.
> >   
> I think you're misreading it.  This is the inner loop:
> 
>         for(i = 0; i < COUNT; i++) {
>                 asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs"
>                              : "+m" (*offset): "r" (seg) : "memory");
>                 sync();
>         }
>         return "gs";
> 
> On entry, %gs will contain the normal usermode TLS selector.  "seg" is 
> another selector allocated with set_thread_area().  The asm pushes the 
> old %gs, loads the new one, uses a memory address via the new segment, 
> then restores the previous %gs.

but it does not actually use the 'normal usermode TLS selector' - it 
only loads it.

a meaningful test would be to allocate two selector values and load and 
read+write memory through both of them.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:43           ` Jeremy Fitzhardinge
@ 2006-11-15 18:44             ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> > that loads (and uses) a single selector value for %fs, and doesnt do 
> > any mixed use as far as i can see.
> 
> I'm not sure what you're getting at.  Each loop iteration is analogous 
> to a user->kernel->user transition with respect to the 
> save/reload/use/restore pattern on the segment register.  In this 
> case, %fs starts as a null selector, gets reloaded with a non NULL 
> selector, and then is restored to null.  Do you mean some other 
> mixing?

yeah, mixed use: i.e. set up /two/ selector values and load them into 
%gs and read+write memory through them. It might not change the results, 
but that's what i meant under 'mixed use'.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:43           ` Ingo Molnar
@ 2006-11-15 18:49             ` Jeremy Fitzhardinge
  2006-11-15 18:49               ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 18:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell

Ingo Molnar wrote:
> but it does not actually use the 'normal usermode TLS selector' - it 
> only loads it.
>
> a meaningful test would be to allocate two selector values and load and 
> read+write memory through both of them.
>   

Well, obviously in one case it would need to switch between
null/non-null/null.  But yes, good point about using the "usermode" %gs
each iteration.  I'll do some more tests.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:49             ` Jeremy Fitzhardinge
@ 2006-11-15 18:49               ` Ingo Molnar
  2006-11-15 19:00                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 18:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > but it does not actually use the 'normal usermode TLS selector' - it 
> > only loads it.
> >
> > a meaningful test would be to allocate two selector values and load and 
> > read+write memory through both of them.
> >   
> 
> Well, obviously in one case it would need to switch between 
> null/non-null/null.  But yes, good point about using the "usermode" 
> %gs each iteration.  I'll do some more tests.

i'd not even use glibc's %gs but set up two separate selectors. (that's 
a more controlled experiment - someone might run a non-TLS glibc, etc.)

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 18:49               ` Ingo Molnar
@ 2006-11-15 19:00                 ` Jeremy Fitzhardinge
  2006-11-15 19:03                   ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-15 19:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell

Ingo Molnar wrote:
> i'd not even use glibc's %gs but set up two separate selectors. (that's 
> a more controlled experiment - someone might run a non-TLS glibc, etc.)
>   

Well, in that case they probably don't care whether the kernel uses %fs
or %gs ;)

But either way, this doesn't have much bearing on Eric's test; we'd be
only talking about a few ns per kernel exit, rather than 5% for read/write.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: i386 PDA patches use of %gs
  2006-11-15 19:00                 ` Jeremy Fitzhardinge
@ 2006-11-15 19:03                   ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 19:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, akpm, ak, linux-kernel, Michael.Fetterman,
	Ian Campbell


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > i'd not even use glibc's %gs but set up two separate selectors. 
> > (that's a more controlled experiment - someone might run a non-TLS 
> > glibc, etc.)
> >   
> 
> Well, in that case they probably don't care whether the kernel uses 
> %fs or %gs ;)
> 
> But either way, this doesn't have much bearing on Eric's test; we'd be 
> only talking about a few ns per kernel exit, rather than 5% for 
> read/write.

if the timings are different then it very much has bearing on the 
argument that i made against the current i386 PDA patchset, that mixed 
use segments are suboptimal.

So i'm NAK-ing the i386 PDA patchset until this has been properly 
measured (and fixed if needed).

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 18:24               ` Jeremy Fitzhardinge
@ 2006-11-15 19:06                 ` Ingo Molnar
  2006-11-17  0:24                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2006-11-15 19:06 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Arjan van de Ven, Andi Kleen, Eric Dumazet, akpm, linux-kernel


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Arjan van de Ven wrote:
> > segment register accesses really are not cheap. 
> > Also really it'll be better to use the register userspace is not using,
> > but we had that discussion before; could you remind me why you picked 
> > %gs in the first place?
> >   
> 
> To leave open the possibility of using the compiler's TLS support in 
> the kernel for percpu.  I also measured the cost of reloading %gs vs 
> %fs, and found no difference between reloading a null selector vs a 
> non-null selector.

what point would there be in using it? It's not like the kernel could 
make use of the thread keyword anytime soon (it would need /all/ 
architectures to support it) ... and the kernel doesnt mind how the 
current per_cpu() primitives are implemented, via assembly or via C. In 
any case, it very much matters to see the precise cost of having the pda 
selector value in %gs versus %fs.

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 19:06                 ` Ingo Molnar
@ 2006-11-17  0:24                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-17  0:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Andi Kleen, Eric Dumazet, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

Ingo Molnar wrote:
> what point would there be in using it? It's not like the kernel could 
> make use of the thread keyword anytime soon (it would need /all/ 
> architectures to support it) ...

The plan was to implement the x86 arch-specific percpu stuff to use it,
since it allows gcc better optimisation opportunities.

>  and the kernel doesnt mind how the 
> current per_cpu() primitives are implemented, via assembly or via C. In 
> any case, it very much matters to see the precise cost of having the pda 
> selector value in %gs versus %fs.
>   

Hm, well, unfortunately for me, there is a small but distinct advantage
to using %fs rather than %gs (around 0-5ns per iteration).  The notable
exception being the "AMD-K6(tm) 3D+ Processor", where %gs is about 25%
(15ns) faster.

I'll revise the patches to use %fs and resubmit.

    J

[-- Attachment #2: results-mixed.txt --]
[-- Type: text/plain, Size: 3720 bytes --]

"Genuine Intel(R) CPU           T2400  @ 1.83GHz" @1000Mhz (6,14,8):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME 
   <none> with data selector: 0ns/iteration
   fs with data selector: 26ns/iteration
   gs with data selector: 30ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 26ns/iteration
   gs with LDT selector: 26ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 26ns/iteration
   gs with GDT selector: 30ns/iteration

"Intel(R) Pentium(R) 4 CPU 1.80GHz" @1817.9Mhz (15,2,4):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME 
   <none> with data selector: 0ns/iteration
   fs with data selector: 33ns/iteration
   gs with data selector: 34ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 43ns/iteration
   gs with LDT selector: 52ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 33ns/iteration
   gs with GDT selector: 34ns/iteration

"Intel(R) Celeron(R) CPU 2.40GHz" @2394.47Mhz (15,2,9):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME 
   <none> with data selector: 0ns/iteration
   fs with data selector: 20ns/iteration
   gs with data selector: 24ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 21ns/iteration
   gs with LDT selector: 26ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 21ns/iteration
   gs with GDT selector: 26ns/iteration

"Pentium 75 - 200" @166.206Mhz (5,2,12):
ds=7b fs=0 gs=33 ldt=f gdt=3b GTOD
   <none> with data selector: 1ns/iteration
   fs with data selector: 74ns/iteration
   gs with data selector: 75ns/iteration

   <none> with LDT selector: 1ns/iteration
   fs with LDT selector: 74ns/iteration
   gs with LDT selector: 75ns/iteration

   <none> with GDT selector: 1ns/iteration
   fs with GDT selector: 74ns/iteration
   gs with GDT selector: 74ns/iteration

"AMD-K6(tm) 3D+ Processor" @451.105Mhz (5,9,1):
ds=7b fs=0 gs=33 ldt=f gdt=3b GTOD
   <none> with data selector: 0ns/iteration
   fs with data selector: 59ns/iteration
   gs with data selector: 44ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 59ns/iteration
   gs with LDT selector: 44ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 59ns/iteration
   gs with GDT selector: 44ns/iteration

"AMD Athlon(tm) XP 3000+" @2162.74Mhz (6,10,0):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME
   <none> with data selector: 0ns/iteration
   fs with data selector: 10ns/iteration
   gs with data selector: 11ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 11ns/iteration
   gs with LDT selector: 11ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 11ns/iteration
   gs with GDT selector: 11ns/iteration


"AMD Athlon(tm) 64 Processor 3500+" @2210.23Mhz (15,31,0):
ds=2b fs=0 gs=63 ldt=f gdt=6b GTOD
   <none> with data selector: 0ns/iteration
   fs with data selector: 11ns/iteration
   gs with data selector: 11ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 10ns/iteration
   gs with LDT selector: 11ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 10ns/iteration
   gs with GDT selector: 11ns/iteration

"Pentium III (Coppermine)" @700Mhz (6,8,6):
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME
   <none> with data selector: 0ns/iteration
   fs with data selector: 38ns/iteration
   gs with data selector: 45ns/iteration

   <none> with LDT selector: 0ns/iteration
   fs with LDT selector: 39ns/iteration
   gs with LDT selector: 41ns/iteration

   <none> with GDT selector: 0ns/iteration
   fs with GDT selector: 39ns/iteration
   gs with GDT selector: 44ns/iteration

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 17:46             ` Eric Dumazet
  2006-11-15 17:49               ` Ingo Molnar
@ 2006-11-21 11:38               ` Eric Dumazet
  2006-11-21 21:42                 ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2006-11-21 11:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, akpm, Arjan van de Ven, Jeremy Fitzhardinge, linux-kernel

On Wednesday 15 November 2006 18:46, Eric Dumazet wrote:
> On Wednesday 15 November 2006 18:24, Andi Kleen wrote:
> > On Wednesday 15 November 2006 18:20, Ingo Molnar wrote:
> > > * Andi Kleen <ak@suse.de> wrote:
> > > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> > > > > Seeing %gs prefixes used now by i386 port, I recalled seeing
> > > > > strange oprofile results on Opteron machines.
> > > > >
> > > > > I really think %gs prefixes can be expensive in some (most ?)
> > > > > cases, even if the Intel/AMD docs say they are free.
> > > >
> > > > They aren't free, just very cheap.
> > >
> > > Eric's test shows a 5% slowdown. That's far from cheap.
> >
> > I have my doubts about the accuracy of his test results. That is why I
> > asked him to double check.
>
> Fair enough :)
>
> I plan doing *lot* of tests as soon as possible (not possible during
> daytime unfortunately, I miss a dev machine)
>

I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks 
using : pipe/write()/read, umask(), or getppid(), using or not oprofile.

I managed to avoid reloading %gs in sysenter_entry .
(avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs

I could not avoid reloading %gs in system_call, I dont know why, but modern 
glibc use sysenter so I dont care :)

I confirm I got better results with my patched kernel in all tests I've done.

umask : 12.64 s instead of 12.90 s
getppid : 13.37 s instead of 13.72 s
pipe/read/write : 9.10 s instead of 9.52 s

(I got very different results in umask() bench, patching it not to use xchg(), 
since this instruction is expensive on x86 and really change oprofile 
results. I will submit a patch for this.

Eric

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-21 11:38               ` Eric Dumazet
@ 2006-11-21 21:42                 ` Jeremy Fitzhardinge
  2006-11-21 21:52                   ` Andi Kleen
  2006-11-21 21:58                   ` Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-21 21:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel

Eric Dumazet wrote:
> I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks 
> using : pipe/write()/read, umask(), or getppid(), using or not oprofile.
>
> I managed to avoid reloading %gs in sysenter_entry .
> (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>
> I could not avoid reloading %gs in system_call, I dont know why, but modern 
> glibc use sysenter so I dont care :)
>
> I confirm I got better results with my patched kernel in all tests I've done.
>
> umask : 12.64 s instead of 12.90 s
> getppid : 13.37 s instead of 13.72 s
> pipe/read/write : 9.10 s instead of 9.52 s
>
> (I got very different results in umask() bench, patching it not to use xchg(), 
> since this instruction is expensive on x86 and really change oprofile 
> results. I will submit a patch for this.
>   

Could you go into more detail about what you're actually measuring
here?  Is it 10,000,000 loops of the single syscall?  pipe/read/write
suggests that you're doing at least 2 syscalls per loop, but it takes
the smallest elapsed time.

What are you using as your time reference?  Real time?  Process time?

For umask/getppid, assuming you're just running 1e7 iterations, you're
seeing a difference of 25 and 35ns per iteration difference.  I wonder
why it would be different for different syscalls; I would expect it to
be a constant overhead either way.  Certainly these numbers are much
larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
syscall (getppid) test; I saw an overall 9ns difference in null syscall
time on my Core Duo run at 1GHz.  What's your CPU and speed?

One possibility is a cache miss on the gdt while reloading %gs.  I've
been planning on a patch to rearrange the gdt in order to pack all the
commonly used segment descriptors into one or two cache lines so that
all the segment register reloads can be done with a minimum of cache
misses.  It would be interesting for you to replace the:

    movl $(__KERNEL_PDA), %edx; movl %edx, %gs

with an appropriate read of the gdt entry, hm, which is a bit complex to
find.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-21 21:42                 ` Jeremy Fitzhardinge
@ 2006-11-21 21:52                   ` Andi Kleen
  2006-11-21 22:10                     ` Jeremy Fitzhardinge
  2006-11-21 21:58                   ` Eric Dumazet
  1 sibling, 1 reply; 45+ messages in thread
From: Andi Kleen @ 2006-11-21 21:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Eric Dumazet, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel


> For umask/getppid, assuming you're just running 1e7 iterations, you're
> seeing a difference of 25 and 35ns per iteration difference.  I wonder
> why it would be different for different syscalls; I would expect it to
> be a constant overhead either way.

They got different numbers of current references? 

> Certainly these numbers are much 
> larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
> syscall (getppid) test; I saw an overall 9ns difference in null syscall
> time on my Core Duo run at 1GHz.  What's your CPU and speed?
> 
> One possibility is a cache miss on the gdt while reloading %gs.  I've

On such micro benchmarks everything should be cache hot in theory
(unless it's a system with really small cache)

> been planning on a patch to rearrange the gdt in order to pack all the
> commonly used segment descriptors into one or two cache lines so that
> all the segment register reloads can be done with a minimum of cache
> misses.  It would be interesting for you to replace the:
> 
>     movl $(__KERNEL_PDA), %edx; movl %edx, %gs
> 
> with an appropriate read of the gdt entry, hm, which is a bit complex to
> find.

On UP it could be hardcoded. And oprofile can be used to profile for cache misses.

-Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-21 21:42                 ` Jeremy Fitzhardinge
  2006-11-21 21:52                   ` Andi Kleen
@ 2006-11-21 21:58                   ` Eric Dumazet
  2006-11-21 23:12                     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2006-11-21 21:58 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel

Jeremy Fitzhardinge a écrit :
> Eric Dumazet wrote:
>> I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks 
>> using : pipe/write()/read, umask(), or getppid(), using or not oprofile.
>>
>> I managed to avoid reloading %gs in sysenter_entry .
>> (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>>
>> I could not avoid reloading %gs in system_call, I dont know why, but modern 
>> glibc use sysenter so I dont care :)
>>
>> I confirm I got better results with my patched kernel in all tests I've done.
>>
>> umask : 12.64 s instead of 12.90 s
>> getppid : 13.37 s instead of 13.72 s
>> pipe/read/write : 9.10 s instead of 9.52 s
>>
>> (I got very different results in umask() bench, patching it not to use xchg(), 
>> since this instruction is expensive on x86 and really change oprofile 
>> results. I will submit a patch for this.
>>   
> 
> Could you go into more detail about what you're actually measuring
> here?  Is it 10,000,000 loops of the single syscall?  pipe/read/write
> suggests that you're doing at least 2 syscalls per loop, but it takes
> the smallest elapsed time.

for umask/getppid(), its a basic loop with 100.000.000 iterations
for read/write(), loop with 10.000.000 iterations
> 
> What are you using as your time reference?  Real time?  Process time?
> 

elapsed time (/usr/bin/time ./prog)
10 runs, and the minimum time is taken.

> For umask/getppid, assuming you're just running 1e7 iterations, you're
> seeing a difference of 25 and 35ns per iteration difference.  I wonder
> why it would be different for different syscalls; I would expect it to
> be a constant overhead either way.  Certainly these numbers are much
> larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
> syscall (getppid) test; I saw an overall 9ns difference in null syscall
> time on my Core Duo run at 1GHz.  What's your CPU and speed?

Its a 1.6GHz Pentium-M CPU (Dell D610)

> 
> One possibility is a cache miss on the gdt while reloading %gs.  I've
> been planning on a patch to rearrange the gdt in order to pack all the
> commonly used segment descriptors into one or two cache lines so that
> all the segment register reloads can be done with a minimum of cache
> misses.  It would be interesting for you to replace the:
> 
>     movl $(__KERNEL_PDA), %edx; movl %edx, %gs
> 
> with an appropriate read of the gdt entry, hm, which is a bit complex to
> find.
> 

Hum... Do you mean a cache miss every time we do a syscall ? What could 
invalidate this cache exactly ?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-21 21:52                   ` Andi Kleen
@ 2006-11-21 22:10                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-21 22:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric Dumazet, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel

Andi Kleen wrote:
>> For umask/getppid, assuming you're just running 1e7 iterations, you're
>> seeing a difference of 25 and 35ns per iteration difference.  I wonder
>> why it would be different for different syscalls; I would expect it to
>> be a constant overhead either way.
>>     
>
> They got different numbers of current references? 
>   

My understanding is that Eric has changed UP current (and other PDA ops)
to not touch %gs at all, and the difference in reported times in due
omitting the %gs load in entry.S (though %gs is still save/restored on
the stack).

> On such micro benchmarks everything should be cache hot in theory
> (unless it's a system with really small cache)
>   

Yes, that would be my thought too, but maybe there's excessive aliasing
on one of the ways, but I think he's using a Pentium M which has a 8-way L1.

>> been planning on a patch to rearrange the gdt in order to pack all the
>> commonly used segment descriptors into one or two cache lines so that
>> all the segment register reloads can be done with a minimum of cache
>> misses.  It would be interesting for you to replace the:
>>
>>     movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>>
>> with an appropriate read of the gdt entry, hm, which is a bit complex to
>> find.
>>     
>
> On UP it could be hardcoded. And oprofile can be used to profile for cache misses.
>   

Yes, assuming oprofile doesn't interfere with things too much. 
Actually, just counting cache miss events during the course of a syscall
would be most interesting (ie, no need to sample).


    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-21 21:58                   ` Eric Dumazet
@ 2006-11-21 23:12                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-21 23:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Ingo Molnar, akpm, Arjan van de Ven, linux-kernel

Eric Dumazet wrote:
> for umask/getppid(), its a basic loop with 100.000.000 iterations

Ah, OK, so there's about 2.5-3.5ns difference due to the instructions
you removed.  That's very much in line with that I saw in my measurements.

> for read/write(), loop with 10.000.000 iterations

2 syscalls/iteration?  It's interesting you measured about the same
absolute time difference (.42s) even though you're doing 1/5th the
number of syscalls.

> elapsed time (/usr/bin/time ./prog)
> 10 runs, and the minimum time is taken.

Hm, but "time" measures user, system and real time.  You used real time?

> Hum... Do you mean a cache miss every time we do a syscall ? What
> could invalidate this cache exactly ?

Well, there might be a miss simply because the line got evicted.  But as
Andi pointed out, a hot benchmark like this is very unlikely to get any
cache misses unless there's something very unfortunate happening.

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
  2006-11-15 11:32       ` Andi Kleen
  2006-11-15 17:52       ` Jeremy Fitzhardinge
@ 2006-11-28 23:12       ` Jeremy Fitzhardinge
  2006-11-29  9:30         ` Eric Dumazet
  2 siblings, 1 reply; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-28 23:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel

Eric Dumazet wrote:
> Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile 
> results on Opteron machines.

Hi Eric,

Could you try this patch out and see if it makes much performance
difference for you.  You should apply this on top of the %fs patch I
posted earlier (and use the %fs patch as the baseline for your comparisons).

Thanks,
    J

Don't bother with segment references for UP PDA

When compiled for UP, don't bother prefixing PDA references with a
segment override.  Also doesn't bother reloading the PDA segment
register (though it still gets saved and restored, because the value
is used elsewhere in the kernel, and the restore is necessary for
correct context switches).

I'm not very keen on the extra #ifdefs this adds, though I've tried to
keep them minimal.  Eric Dumazet reports small performance gains from
similar patch however.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@muc.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>

diff -r 022c29ea754e arch/i386/kernel/cpu/common.c
--- a/arch/i386/kernel/cpu/common.c	Tue Nov 21 18:54:56 2006 -0800
+++ b/arch/i386/kernel/cpu/common.c	Wed Nov 22 01:54:02 2006 -0800
@@ -628,7 +628,11 @@ static __cpuinit int alloc_gdt(int cpu)
 		BUG_ON(gdt != NULL || pda != NULL);
 
 		gdt = alloc_bootmem_pages(PAGE_SIZE);
+#ifdef CONFIG_SMP
+		pda = &boot_pda;
+#else
 		pda = alloc_bootmem(sizeof(*pda));
+#endif
 		/* alloc_bootmem(_pages) panics on failure, so no check */
 
 		memset(gdt, 0, PAGE_SIZE);
@@ -661,6 +665,10 @@ struct i386_pda boot_pda = {
 	.cpu_number = 0,
 	.pcurrent = &init_task,
 };
+#ifndef CONFIG_SMP
+/* boot_pda is used for all PDA access in UP */
+EXPORT_SYMBOL(boot_pda);
+#endif
 
 static inline void set_kernel_fs(void)
 {
diff -r 022c29ea754e arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S	Tue Nov 21 18:54:56 2006 -0800
+++ b/arch/i386/kernel/entry.S	Wed Nov 22 13:38:56 2006 -0800
@@ -97,6 +97,16 @@ 1:
 #define resume_userspace_sig	resume_userspace
 #endif
 
+#ifdef CONFIG_SMP
+#define LOAD_PDA_SEG(reg)	\
+	movl $(__KERNEL_PDA), reg; \
+	movl reg, %fs
+#define CUR_CPU(reg)	movl %fs:PDA_cpu, reg
+#else
+#define LOAD_PDA_SEG(reg)
+#define CUR_CPU(reg)	movl boot_pda+PDA_cpu, reg
+#endif
+	
 #define SAVE_ALL \
 	cld; \
 	pushl %fs; \
@@ -132,8 +142,7 @@ 1:
 	movl $(__USER_DS), %edx; \
 	movl %edx, %ds; \
 	movl %edx, %es; \
-	movl $(__KERNEL_PDA), %edx; \
-	movl %edx, %fs
+	LOAD_PDA_SEG(%edx)
 
 #define RESTORE_INT_REGS \
 	popl %ebx;	\
@@ -546,7 +555,7 @@ syscall_badsys:
 
 #define FIXUP_ESPFIX_STACK \
 	/* since we are on a wrong stack, we cant make it a C code :( */ \
-	movl %fs:PDA_cpu, %ebx; \
+	CUR_CPU(%ebx); \
 	PER_CPU(cpu_gdt_descr, %ebx); \
 	movl GDS_address(%ebx), %ebx; \
 	GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \
diff -r 022c29ea754e include/asm-i386/pda.h
--- a/include/asm-i386/pda.h	Tue Nov 21 18:54:56 2006 -0800
+++ b/include/asm-i386/pda.h	Wed Nov 22 02:35:24 2006 -0800
@@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[];
 
 #define cpu_pda(i)	(_cpu_pda[i])
 
+/* Use boot-time PDA for UP.  For SMP we still need to declare it, but
+   it isn't used. */
+extern struct i386_pda boot_pda;
+
+#ifdef CONFIG_SMP
+#define PDA_REF		"%%fs:%c[off]"
+#else
+#define PDA_REF		"%[mem]"
+#endif
+
 #define pda_offset(field) offsetof(struct i386_pda, field)
 
 extern void __bad_pda_field(void);
@@ -33,28 +43,31 @@ extern void __bad_pda_field(void);
    clobbers, so gcc can readily analyse them. */
 extern struct i386_pda _proxy_pda;
 
-#define pda_to_op(op,field,val)						\
+#define pda_to_op(op,field,_val)					\
 	do {								\
 		typedef typeof(_proxy_pda.field) T__;			\
-		if (0) { T__ tmp__; tmp__ = (val); }			\
+		if (0) { T__ tmp__; tmp__ = (_val); }			\
 		switch (sizeof(_proxy_pda.field)) {			\
 		case 1:							\
-			asm(op "b %1,%%fs:%c2"				\
-			    : "+m" (_proxy_pda.field)			\
-			    :"ri" ((T__)val),				\
-			     "i"(pda_offset(field)));			\
+			asm(op "b %[val]," PDA_REF			\
+			    : "+m" (_proxy_pda.field),			\
+			      [mem] "+m" (boot_pda.field)		\
+			    : [val] "ri" ((T__)_val),			\
+			      [off] "i" (pda_offset(field)));		\
 			break;						\
 		case 2:							\
-			asm(op "w %1,%%fs:%c2"				\
-			    : "+m" (_proxy_pda.field)			\
-			    :"ri" ((T__)val),				\
-			     "i"(pda_offset(field)));			\
+			asm(op "w %[val]," PDA_REF			\
+			    : "+m" (_proxy_pda.field),			\
+			      [mem] "+m" (boot_pda.field)		\
+			    : [val] "ri" ((T__)_val),			\
+			      [off] "i" (pda_offset(field)));		\
 			break;						\
 		case 4:							\
-			asm(op "l %1,%%fs:%c2"				\
-			    : "+m" (_proxy_pda.field)			\
-			    :"ri" ((T__)val),				\
-			     "i"(pda_offset(field)));			\
+			asm(op "l %[val]," PDA_REF			\
+			    : "+m" (_proxy_pda.field),			\
+			      [mem] "+m" (boot_pda.field)		\
+			    : [val] "ri" ((T__)_val),			\
+			      [off] "i" (pda_offset(field)));		\
 			break;						\
 		default: __bad_pda_field();				\
 		}							\
@@ -65,22 +78,25 @@ extern struct i386_pda _proxy_pda;
 		typeof(_proxy_pda.field) ret__;				\
 		switch (sizeof(_proxy_pda.field)) {			\
 		case 1:							\
-			asm(op "b %%fs:%c1,%0"				\
-			    : "=r" (ret__)				\
-			    : "i" (pda_offset(field)),			\
-			      "m" (_proxy_pda.field));			\
+			asm(op "b " PDA_REF ",%[ret]"			\
+			    : [ret] "=r" (ret__)			\
+			    : [off] "i" (pda_offset(field)),		\
+			      "m" (_proxy_pda.field),			\
+			      [mem] "m" (boot_pda.field));		\
 			break;						\
 		case 2:							\
-			asm(op "w %%fs:%c1,%0"				\
-			    : "=r" (ret__)				\
-			    : "i" (pda_offset(field)),			\
-			      "m" (_proxy_pda.field));			\
+			asm(op "w " PDA_REF ",%[ret]"			\
+			    : [ret] "=r" (ret__)			\
+			    : [off] "i" (pda_offset(field)),		\
+			      "m" (_proxy_pda.field),			\
+			      [mem] "m" (boot_pda.field));		\
 			break;						\
 		case 4:							\
-			asm(op "l %%fs:%c1,%0"				\
-			    : "=r" (ret__)				\
-			    : "i" (pda_offset(field)),			\
-			      "m" (_proxy_pda.field));			\
+			asm(op "l " PDA_REF ",%[ret]"			\
+			    : [ret] "=r" (ret__)			\
+			    : [off] "i" (pda_offset(field)),		\
+			      "m" (_proxy_pda.field),			\
+			      [mem] "m" (boot_pda.field));		\
 			break;						\
 		default: __bad_pda_field();				\
 		}							\



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-28 23:12       ` Jeremy Fitzhardinge
@ 2006-11-29  9:30         ` Eric Dumazet
  2006-11-29  9:56           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2006-11-29  9:30 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel

On Wednesday 29 November 2006 00:12, Jeremy Fitzhardinge wrote:

> Hi Eric,
>
> Could you try this patch out and see if it makes much performance
> difference for you.  You should apply this on top of the %fs patch I
> posted earlier (and use the %fs patch as the baseline for your
> comparisons).

Hi Jeremy

I will try this as soon as possible, thank you.

However I have some remarks browsing your patch.


> +#ifdef CONFIG_SMP
> +#define LOAD_PDA_SEG(reg)	\
> +	movl $(__KERNEL_PDA), reg; \
> +	movl reg, %fs
> +#define CUR_CPU(reg)	movl %fs:PDA_cpu, reg
> +#else
> +#define LOAD_PDA_SEG(reg)
> +#define CUR_CPU(reg)	movl boot_pda+PDA_cpu, reg

if !CONFIG_SMP, why even dereferencing boot_pda+PDA_cpu to get 0 ?
and as PER_CPU(cpu_gdt_descr, %ebx) in !CONFIG_SMP doesnt need the a value in 
ebx, you can just do :

#define CUR_CPU(reg) /* nothing */


> --- a/include/asm-i386/pda.h	Tue Nov 21 18:54:56 2006 -0800
> +++ b/include/asm-i386/pda.h	Wed Nov 22 02:35:24 2006 -0800
> @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[];
>

My patch was better IMHO : we dont need to force asm () instructions to 
perform regular C variable reading/writing in !CONFIG_SMP case.

Using plain C allows compiler to generate a better code.

Eric

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH] i386-pda UP optimization
  2006-11-29  9:30         ` Eric Dumazet
@ 2006-11-29  9:56           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 45+ messages in thread
From: Jeremy Fitzhardinge @ 2006-11-29  9:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: akpm, Arjan van de Ven, ak, mingo, linux-kernel

Eric Dumazet wrote:
> if !CONFIG_SMP, why even dereferencing boot_pda+PDA_cpu to get 0 ?
> and as PER_CPU(cpu_gdt_descr, %ebx) in !CONFIG_SMP doesnt need the a value in 
> ebx, you can just do :
>
> #define CUR_CPU(reg) /* nothing */
>   

Yep.  On the other hand, I think that's an incredibly rare path anyway,
so it won't make any difference either way.

>> --- a/include/asm-i386/pda.h	Tue Nov 21 18:54:56 2006 -0800
>> +++ b/include/asm-i386/pda.h	Wed Nov 22 02:35:24 2006 -0800
>> @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[];
>>
>>     
>
> My patch was better IMHO : we dont need to force asm () instructions to 
> perform regular C variable reading/writing in !CONFIG_SMP case.
>
> Using plain C allows compiler to generate a better code.
>   

Probably, but I'm interested in comparing apples with apples; how much
do the actual segment prefixes make a difference?

    J

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2006-11-29  9:55 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-12  7:35 i386 PDA patches use of %gs Arjan van de Ven
2006-09-12  7:48 ` Jeremy Fitzhardinge
2006-09-12  7:56   ` Arjan van de Ven
2006-09-12  8:31     ` Jeremy Fitzhardinge
2006-11-15 11:27     ` [PATCH] i386-pda UP optimization Eric Dumazet
2006-11-15 11:32       ` Andi Kleen
2006-11-15 17:20         ` Ingo Molnar
2006-11-15 17:24           ` Andi Kleen
2006-11-15 17:46             ` Eric Dumazet
2006-11-15 17:49               ` Ingo Molnar
2006-11-15 17:58                 ` Eric Dumazet
2006-11-15 18:01                   ` Ingo Molnar
2006-11-21 11:38               ` Eric Dumazet
2006-11-21 21:42                 ` Jeremy Fitzhardinge
2006-11-21 21:52                   ` Andi Kleen
2006-11-21 22:10                     ` Jeremy Fitzhardinge
2006-11-21 21:58                   ` Eric Dumazet
2006-11-21 23:12                     ` Jeremy Fitzhardinge
2006-11-15 17:28           ` Jeremy Fitzhardinge
2006-11-15 17:32             ` Ingo Molnar
2006-11-15 17:59               ` Jeremy Fitzhardinge
2006-11-15 18:05                 ` Eric Dumazet
2006-11-15 18:28                   ` Jeremy Fitzhardinge
2006-11-15 18:31                     ` Ingo Molnar
2006-11-15 18:01             ` Arjan van de Ven
2006-11-15 18:24               ` Jeremy Fitzhardinge
2006-11-15 19:06                 ` Ingo Molnar
2006-11-17  0:24                   ` Jeremy Fitzhardinge
2006-11-15 17:52       ` Jeremy Fitzhardinge
2006-11-28 23:12       ` Jeremy Fitzhardinge
2006-11-29  9:30         ` Eric Dumazet
2006-11-29  9:56           ` Jeremy Fitzhardinge
2006-09-13  1:00 ` i386 PDA patches use of %gs Jeremy Fitzhardinge
2006-09-13  9:59   ` Ingo Molnar
2006-09-13 16:17     ` Jeremy Fitzhardinge
2006-11-15 18:26       ` Ingo Molnar
2006-11-15 18:29         ` Ingo Molnar
2006-11-15 18:43           ` Jeremy Fitzhardinge
2006-11-15 18:44             ` Ingo Molnar
2006-11-15 18:39         ` Jeremy Fitzhardinge
2006-11-15 18:43           ` Ingo Molnar
2006-11-15 18:49             ` Jeremy Fitzhardinge
2006-11-15 18:49               ` Ingo Molnar
2006-11-15 19:00                 ` Jeremy Fitzhardinge
2006-11-15 19:03                   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).