LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Matt Fleming <matt@codeblueprint.co.uk>,
	Peter Zijlstra <peterz@infradead.org>,
	subhra.mazumdar@oracle.com, rohit.k.jain@oracle.com
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org,
	dhaval.giani@oracle.com, umgwanakikbuti@gmail.com,
	riel@surriel.com
Subject: Re: [RFC 00/11] select_idle_sibling rework
Date: Wed, 20 Jun 2018 18:20:17 -0400	[thread overview]
Message-ID: <52394289-639e-38e0-727d-8b8d91e3dd35@oracle.com> (raw)
In-Reply-To: <20180619220639.GA14960@codeblueprint.co.uk>

On 6/19/2018 6:06 PM, Matt Fleming wrote:
> On Wed, 30 May, at 04:22:36PM, Peter Zijlstra wrote:
>> Hi all,
>>
>> This is all still very preliminary and could all still go up in flames (it has
>> only seen hackbench so far). This is mostly the same code I posted yesterday,
>> but hopefully in a more readable form.
>>
>> This fixes the SIS_PROP as per the outline here:
>>
>>   https://lkml.kernel.org/r/20180425153600.GA4043@hirez.programming.kicks-ass.net
>>
>> and Rohit's suggestion of folding the iteration loops.
>>
>> For testing I would suggest to ignore the last 3 patches, those are purely
>> cleanups once the first lot is found to actually work as advertised.
> 
> This series looks pretty good from my testing. I see double-digit
> improvements to hackbench results and only one case of a clear
> regression (easily offset by all the wins).
> 
> Are you aware of any regressions for particular benchmarks I should
> take a look at?

Hi folks,
  Just a heads up that I am working on a different patch series for
improving the utilization of idle CPUs.  After reviewing Subhra's series
and Peter's series for inspiration (thanks guys), I started from scratch.
My series improves efficiency on both the push side (find idle) and
the pull side (find a task), and is showing substantial speedups: a 10%
to 88% improvement in hackbench depending on load level.  I will send
an RFC soon, but here is a quick summary.

On the push side, I start by extending Rohit's proposal to use an idle
cpu mask.  I define a per-LLC bitmask of idle CPUs and a bitmask of idle
cores, maintained using atomic operations during idle_task transitions
with the help of a per-core busy-cpu counter.  select_idle_core,
select_idle_cpu, and select_idle_smt search the masks and claim bits
atomically.  However, to reduce contention amongst readers and writers
of the masks, I define a new sparsemask type which only uses 8 bits in
the first word of every 64 bytes; the remaining 63*8 bits are not used.
For example, a set that can hold 128 items is 128/8*64 = 1024 bytes
in size.  This reduces contention for the atomic operations that update
the mask, but still allows efficient traversal of the mask when searching
for set bits.  The number of bits per 64 bytes is a creation time parameter.

On the pull side, I define a per-LLC sparsemask of overloaded CPUs, which
are those with 2 or more runable CFS tasks, updated whenever h_nr_running
changes from 1 to greater, or from 2 to less.  Before pick_next_task_fair
calls idle_balance(), it tries to steal a task from an overloaded CPU,
using the mask to efficiently find candidates.  It first looks on
overloaded CPUs on the same core, then looks on all overloaded CPUs.
It steals the first migrateable task it finds, searching each overloaded
CPU starting at the leftmost task on cfs_rq->tasks_timeline for fairness.
The cost to steal is very low compared to the cost of idle_balance.  If
stealing fails to find a task, idle_balance is called, with no changes to 
its algorithm.

To measure overhead, I added a timer around all the code that
updates masks, searches for idle CPUs, searches for overloaded CPUs,
and steals a task.  The sum of this time is exposed as a schedstat,
labelled as search_time below.  This is temporary, for evaluation purposes.
I added timers around select_idle_sibling in the baseline kernel for
comparison.

More testing is needed, but in my limited testing so far using hackbench,
all the metrics look good: throughput is up, CPU utilization is up, and
search time is down.

Test is "hackbench <groups> process 100000"
Configuration is a single node Xeon, 28 cores, 56 CPUs.
In the tables below:
  Time = elapsed time in seconds.  
         Number in parens is improvement vs baseline.  
  %util is the overall CPU utilization.
  %search_time is the fraction of CPU time spent in the scheduler code
    mentioned above.

baseline:

groups  tasks   Time              %util   %search_time
1       40      13.023            46.95   0.0
2       80      16.939            76.97   0.2
3       120     18.866            79.29   0.3
4       160     23.454            85.31   0.4
8       320     44.095            98.54   1.3

new:

groups  tasks   Time              %util   %search_time
1       40      13.171 (- 1.1%)   39.97   0.2
2       80       8.993 (+88.4%)   98.74   0.1
3       120     13.446 (+40.3%)   99.02   0.3
4       160     20.249 (+15.8%)   99.70   0.3
8       320     40.121 (+ 9.9%)   99.91   0.3

For this workload, most of the improvement comes from the pull side changes,
but I still like the push side changes because they guarantee we find an
idle core or cpu if one is available, at low cost.  This is important if
an idle CPU has transitioned to the idle_task and is no longer trying to
steal work.

I only handle CFS tasks, but the same algorithms could be applied for RT.
The pushing and pulling only occurs within the LLC, but it could be
applied across nodes if the LLC search fails.  Those are potential future
patches if all goes well.  I will send a patch series for the CFS LLC work
soon.

- Steve

      reply	other threads:[~2018-06-20 22:20 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-30 14:22 Peter Zijlstra
2018-05-30 14:22 ` [RFC 01/11] sched/fair: Fix select_idle_cpu()s cost accounting Peter Zijlstra
2018-05-30 14:22 ` [RFC 02/11] sched/fair: Age the average idle time Peter Zijlstra
2018-05-30 14:22 ` [RFC 03/11] sched/fair: Only use time once Peter Zijlstra
2018-05-30 14:22 ` [RFC 04/11] sched/topology: Introduce sched_domain_cores() Peter Zijlstra
2018-05-30 14:22 ` [RFC 05/11] sched/fair: Re-arrange select_idle_cpu() Peter Zijlstra
2018-05-30 14:22 ` [RFC 06/11] sched/fair: Make select_idle_cpu() proportional to cores Peter Zijlstra
2018-05-30 14:22 ` [RFC 07/11] sched/fair: Fold the select_idle_sibling() scans Peter Zijlstra
2018-05-30 14:22 ` [RFC 08/11] sched/fair: Optimize SIS_FOLD Peter Zijlstra
2018-05-30 14:22 ` [RFC 09/11] sched/fair: Remove SIS_AVG_PROP Peter Zijlstra
2018-05-30 14:22 ` [RFC 10/11] sched/fair: Remove SIS_AGE/SIS_ONCE Peter Zijlstra
2018-05-30 14:22 ` [RFC 11/11] sched/fair: Remove SIS_FOLD Peter Zijlstra
2018-06-19 22:06 ` [RFC 00/11] select_idle_sibling rework Matt Fleming
2018-06-20 22:20   ` Steven Sistare [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52394289-639e-38e0-727d-8b8d91e3dd35@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=dhaval.giani@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rohit.k.jain@oracle.com \
    --cc=subhra.mazumdar@oracle.com \
    --cc=umgwanakikbuti@gmail.com \
    --subject='Re: [RFC 00/11] select_idle_sibling rework' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).