Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: David Ahern <dsahern@gmail.com>
To: Andrei Vagin <avagin@gmail.com>,
	Eugene Lubarsky <elubarsky.linux@gmail.com>
Cc: linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, adobriyan@gmail.com
Subject: Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
Date: Wed, 12 Aug 2020 22:47:32 -0600	[thread overview]
Message-ID: <ffc908a7-c94b-56e6-8bb6-c47c52747d77@gmail.com> (raw)
In-Reply-To: <20200812075135.GA191218@gmail.com>

On 8/12/20 1:51 AM, Andrei Vagin wrote:
> 
> I rebased the task_diag patches on top of v5.8:
> https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag

Thanks for updating the patches.

> 
> /proc/pid files have three major limitations:
> * Requires at least three syscalls per process per file
>   open(), read(), close()
> * Variety of formats, mostly text based
>   The kernel spent time to encode binary data into a text format and
>   then tools like top and ps spent time to decode them back to a binary
>   format.
> * Sometimes slow due to extra attributes
>   For example, /proc/PID/smaps contains a lot of useful informations
>   about memory mappings and memory consumption for each of them. But
>   even if we don't need memory consumption fields, the kernel will
>   spend time to collect this information.

that's what I recall as well.

> 
> More details and numbers are in this article:
> https://avagin.github.io/how-fast-is-procfs
> 
> This new interface doesn't have only one of these limitations, but
> task_diag doesn't have all of them.
> 
> And I compared how fast each of these interfaces:
> 
> The test environment:
> CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
> RAM: 16GB
> kernel: v5.8 with task_diag and /proc/all patches.
> 100K processes:
> $ ps ax | wc -l
> 10228

100k processes but showing 10k here??

> 
> $ time cat /proc/all/status > /dev/null
> 
> real	0m0.577s
> user	0m0.017s
> sys	0m0.559s
> 
> task_proc_all is used to read /proc/pid/status for all tasks:
> https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c
> 
> $ time ./task_proc_all status
> tasks: 100230
> 
> real	0m0.924s
> user	0m0.054s
> sys	0m0.858s
> 
> 
> /proc/all/status is about 40% faster than /proc/*/status.
> 
> Now let's take a look at the perf output:
> 
> $ time perf record -g cat /proc/all/status > /dev/null
> $ perf report
> -   98.08%     1.38%  cat      [kernel.vmlinux]  [k] entry_SYSCALL_64
>    - 96.70% entry_SYSCALL_64
>       - do_syscall_64
>          - 94.97% ksys_read
>             - 94.80% vfs_read
>                - 94.58% proc_reg_read
>                   - seq_read
>                      - 87.95% proc_pid_status
>                         + 13.10% seq_put_decimal_ull_width
>                         - 11.69% task_mem
>                            + 9.48% seq_put_decimal_ull_width
>                         + 10.63% seq_printf
>                         - 10.35% cpuset_task_status_allowed
>                            + seq_printf
>                         - 9.84% render_sigset_t
>                              1.61% seq_putc
>                            + 1.61% seq_puts
>                         + 4.99% proc_task_name
>                         + 4.11% seq_puts
>                         - 3.76% render_cap_t
>                              2.38% seq_put_hex_ll
>                            + 1.25% seq_puts
>                           2.64% __task_pid_nr_ns
>                         + 1.54% get_task_mm
>                         + 1.34% __lock_task_sighand
>                         + 0.70% from_kuid_munged
>                           0.61% get_task_cred
>                           0.56% seq_putc
>                           0.52% hugetlb_report_usage
>                           0.52% from_kgid_munged
>                      + 4.30% proc_all_next
>                      + 0.82% _copy_to_user 
> 
> We can see that the kernel spent more than 50% of the time to encode binary
> data into a text format.
> 
> Now let's see how fast task_diag:
> 
> $ time ./task_diag_all all -c -q
> 
> real	0m0.087s
> user	0m0.001s
> sys	0m0.082s
> 
> Maybe we need resurrect the task_diag series instead of inventing
> another less-effective interface...

I think the netlink message design is the better way to go. As system
sizes continue to increase (> 100 cpus is common now) you need to be
able to pass the right data to userspace as fast as possible to keep up
with what can be a very dynamic userspace and set of processes.

When you first proposed this idea I was working on systems with >= 1k
cpus and the netlink option was able to keep up with a 'make -j N' on
those systems. `perf record` walking /proc would never finish
initializing - I had to add a "done initializing" message to know when
to start a test. With the task_diag approach, perf could collect the
data in short order and move on to recording data.


  reply	other threads:[~2020-08-13  4:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 14:58 Eugene Lubarsky
2020-08-10 14:58 ` [RFC PATCH 1/5] fs/proc: Introduce /proc/all/stat Eugene Lubarsky
2020-08-10 14:58 ` [RFC PATCH 2/5] fs/proc: Introduce /proc/all/statm Eugene Lubarsky
2020-08-10 14:58 ` [RFC PATCH 3/5] fs/proc: Introduce /proc/all/status Eugene Lubarsky
2020-08-10 14:58 ` [RFC PATCH 4/5] fs/proc: Introduce /proc/all/io Eugene Lubarsky
2020-08-10 14:58 ` [RFC PATCH 5/5] fs/proc: Introduce /proc/all/statx Eugene Lubarsky
2020-08-10 15:04 ` [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes Greg KH
2020-08-10 15:27   ` Eugene Lubarsky
2020-08-10 15:41     ` Greg KH
2020-08-25  9:59       ` Eugene Lubarsky
2020-08-12  7:51 ` Andrei Vagin
2020-08-13  4:47   ` David Ahern [this message]
2020-08-13  8:03     ` Andrei Vagin
2020-08-13 15:01   ` Eugene Lubarsky
2020-08-20 17:41     ` Andrei Vagin
2020-08-25 10:00       ` Eugene Lubarsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ffc908a7-c94b-56e6-8bb6-c47c52747d77@gmail.com \
    --to=dsahern@gmail.com \
    --cc=adobriyan@gmail.com \
    --cc=avagin@gmail.com \
    --cc=elubarsky.linux@gmail.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --subject='Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).