LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Li, Aubrey" <aubrey.li@linux.intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: mingo@redhat.com, peterz@infradead.org, hpa@zytor.com,
	ak@linux.intel.com, tim.c.chen@linux.intel.com,
	dave.hansen@intel.com, arjan@linux.intel.com,
	aubrey.li@intel.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
Date: Fri, 15 Feb 2019 12:35:51 +0800	[thread overview]
Message-ID: <85b22e16-fffc-7ebe-2ab1-3b6fe7e036ab@linux.intel.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1902141212400.1561@nanos.tec.linutronix.de>

On 2019/2/14 19:29, Thomas Gleixner wrote:
> On Wed, 13 Feb 2019, Aubrey Li wrote:
> 
>> AVX-512 components use could cause core turbo frequency drop. So
>> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
>> for the user space job scheduler to cluster the AVX-512 using tasks
>> together.
>>
>> Example:
>> $ cat /proc/pid/status | grep AVX512_elapsed_ms
>> AVX512_elapsed_ms:      1020
>>
>> The number '1020' denotes 1020 millisecond elapsed since last time
>> context switch the off-CPU task using AVX-512 components, thus the
> 
> I know what you are trying to say, but this sentence does not parse. So
> what you want to say is:
> 
>   This means that 1020 milliseconds have elapsed since the AVX512 usage of
>   the task was detected when the task was scheduled out.

Thanks, will refine this.

> 
> Aside of that 1020ms is hardly evidence for real AVX512 usage, so you want
> to come up with a better example than that.

Oh, I wrote a simple benchmark to loop {AVX ops a while and non-AVX ops a while},
So this is expected. Yeah, I should use real AVX512 usage. Below is tensorflow
output to train a neural network model to classify images (HZ = 250 on my side).
Will change to use this example.

$ while [ 1 ]; do cat /proc/83226/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4

> 
> But that makes me think about the usefulness of this hint in general.
> 
> A AVX512 using task which runs alone on a CPU, is going to have either no
> AVX512 usage recorded at all or the time elapsed since the last recording
> is absurdly long.

I did an experiment of this, please correct me if I was wrong.

I isolate CPU103, and run a AVX512 micro benchmark(spin AVX512 ops) on it.

$ cat /proc/cmdline 
root=UUID=e6503b72-57d7-433a-ab09-a4b9a39e9128 ro isolcpus=103

I still saw context switch
aubrey@aubrey-skl:~$ sudo trace-cmd report --cpu 103
cpus=104
        avx_demo-6985  [103]  5055.442432: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5055.442434: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5059.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5059.442432: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5063.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5063.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5067.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5067.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]

It looks like some kernel threads still participant context switch on the isolated
CPU, like above one, each CPU has one migration daemon to do migration jobs.

Under this scenario, the elapsed time becomes longer than normal indeed, see below:

$ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	3432
AVX512_elapsed_ms:	440
AVX512_elapsed_ms:	1444
AVX512_elapsed_ms:	2448
AVX512_elapsed_ms:	3456
AVX512_elapsed_ms:	460
AVX512_elapsed_ms:	1464
AVX512_elapsed_ms:	2468

But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.?

>IOW, this needs crystal ball magic to decode because
> there is no correlation between that elapsed time and the time when the
> last context switch happened simply because that time is not available in
> /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
> even more crystal ball magic, but there is no explanation at all.
> 
> There may be use case scenarios where this crystal ball prediction is
> actually useful, but the inaccuracy of that information and the possible
> pitfalls for any user space application which uses it need to be documented
> in detail. Without that, this is going to cause more trouble and confusion
> than benefit.
> 
Not sure if the above experiment addressed your concern, please correct me if
I totally misunderstood.

Thanks,
-Aubrey

  reply	other threads:[~2019-02-15  4:35 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-13  2:37 [PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
2019-02-13  2:37 ` [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
2019-02-14 11:29   ` Thomas Gleixner
2019-02-15  4:35     ` Li, Aubrey [this message]
2019-02-16 12:55       ` Thomas Gleixner
2019-02-16 17:05         ` Li, Aubrey
2019-02-20 15:35         ` David Laight
2019-02-20 15:38           ` Arjan van de Ven
2019-02-13  2:37 ` [PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=85b22e16-fffc-7ebe-2ab1-3b6fe7e036ab@linux.intel.com \
    --to=aubrey.li@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=arjan@linux.intel.com \
    --cc=aubrey.li@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --subject='Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).