LKML Archive on
help / color / mirror / Atom feed
From: Olof Johansson <>
To: Willy Tarreau <>
Cc: Mike Galbraith <>,,
	Peter Zijlstra <>,
	Ingo Molnar <>
Subject: Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads
Date: Sun, 10 Feb 2008 01:00:56 -0600	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > 40M:
> > 2.6.22		time 94315 ms
> > 2.6.23		time 107930 ms
> > 2.6.24		time 113291 ms
> > 2.6.24-git19	time 110360 ms
> > 
> > So with more work per thread, the differences become less but they're
> > still there. At the 40M loop, with 500 threads it's quite a bit of
> > runtime per thread.
> No, it's really nothing. I had to push the loop to 1 billion to make the load
> noticeable. You don't have 500 threads, you have 2 threads and that load is
> repeated 500 times. And if we look at the numbers, let's take the worst one :
> > 40M:
> > 2.6.24                time 113291 ms
> 113291/500 = 227 microseconds/loop. This is still very low compared to the
> smallest timeslice you would have (1 ms at HZ=1000).
> So your threads are still completing *before* the scheduler has to preempt
> them.

Hmm? I get that to be 227ms per loop, which is way more than a full
timeslice. Running the program took in the range of 2 minutes, so it's
110000 milliseconds, not microseconds.

> > It seems generally unfortunate that it takes longer for a new thread to
> > move over to the second cpu even when the first is busy with the original
> > thread. I can certainly see cases where this causes suboptimal overall
> > system behaviour.
> In fact, I don't think it takes longer, I think it does not do it at their
> creation, but will do it immediately after the first slice is consumed. This
> would explain the important differences here. I don't know how we could ensure
> that the new thread is created on the second CPU from the start, though.

The math doesn't add up for me. Even if it rebalanced at the end of
the first slice (i.e. after 1ms), that would be a 1ms penalty per
iteration. With 500 threads that'd be a total penalty of 500ms.

> I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> By default, it did not change a thing. Then I simply set sched_compat_yield
> to 1, and the two threads then ran simultaneously with a stable low time
> (2700 ms instead of 10-12 seconds).
> Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> only instead of 2200-6500 ms.

Right, likely because the long-running cases got stuck at the busy loop
at the end, which would end up aborting quicker if the other thread got
scheduled for just a bit. It was a mistake to post that variant of the
testcase, it's not as relevant and doesn't mimic the original workload I
was trying to mimic as well as if the first loop was made larger.

> Ingo, would it be possible (and wise) to ensure that a new thread being
> created gets immediately rebalanced in order to emulate what is done here
> with sched_compat_yield=1 and sched_yield() in both threads just after the
> thread creation ? I don't expect any performance difference doing this,
> but maybe some shell scripts reliying on short-lived pipes would get faster
> on SMP.

There's always the tradeoff of losing cache warmth whenever a thread is
moved, so I'm not sure if it's a good idea to always migrate it at
creation time. It's not a simple problem, really.

> > I agree that the testcase is highly artificial. Unfortunately, it's
> > not uncommon to see these kind of weird testcases from customers tring
> > to evaluate new hardware. :( They tend to be pared-down versions of
> > whatever their real workload is (the real workload is doing things more
> > appropriately, but the smaller version is used for testing). I was lucky
> > enough to get source snippets to base a standalone reproduction case on
> > for this, normally we wouldn't even get copies of their binaries.
> I'm well aware of that. What's important is to be able to explain what is
> causing the difference and why the test case does not represent anything
> related to performance. Maybe the code author wanted to get 500 parallel
> threads and got his code wrong ?

I believe it started out as a simple attempt to parallelize a workload
that sliced the problem too low, instead of slicing it in larger chunks
and have each thread do more work at a time. It did well on 2.6.22 with
almost a 2x speedup, but did worse than the single-treaded testcase on a
2.6.24 kernel.

So yes, it can clearly be handled through explanations and education
and fixen the broken testcase, but I'm still not sure the new behaviour
is desired.


  reply	other threads:[~2008-02-10  6:59 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-09  0:04 Olof Johansson
2008-02-09  0:08 ` Ingo Molnar
2008-02-09  0:32   ` Olof Johansson
2008-02-09  7:58 ` Mike Galbraith
2008-02-09  8:03   ` Willy Tarreau
2008-02-09 10:58     ` Mike Galbraith
2008-02-09 11:40       ` Willy Tarreau
2008-02-09 13:37         ` Mike Galbraith
2008-02-09 16:19           ` Willy Tarreau
2008-02-09 17:33             ` Mike Galbraith
2008-02-10  5:29             ` Olof Johansson
2008-02-10  6:15               ` Willy Tarreau
2008-02-10  7:00                 ` Olof Johansson [this message]
2008-02-10  7:58                   ` Willy Tarreau
2008-02-11  8:15                   ` Mike Galbraith
2008-02-11 17:26                     ` Olof Johansson
2008-02-11 19:58                       ` Mike Galbraith
2008-02-11 20:31                         ` Olof Johansson
2008-02-12  9:23                           ` Mike Galbraith
2008-02-13  5:49                             ` Mike Galbraith
2008-02-11 21:45               ` Bill Davidsen
2008-02-12  4:30                 ` Mike Galbraith
     [not found] <>
2008-02-09  1:49 ` Robert Hancock

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \
    --subject='Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).