LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFD] Provide virtualized CPU system information for containers
@ 2021-07-22 7:53 Pratik Sampat
2021-07-22 15:22 ` Eric W. Biederman
0 siblings, 1 reply; 3+ messages in thread
From: Pratik Sampat @ 2021-07-22 7:53 UTC (permalink / raw)
To: Linux Kernel Mailing List, containers, containers
Cc: legion, akpm, christian.brauner, ebiederm, hannes, mhocko,
Alexey Makhalov, llong, Pratik Sampat, pratik.r.sampat
Abstract
========
Today, applications that run on containers enforce their CPU and memory
limits, requirements with the help of cgroups. However, many applications
legacy or otherwise get the view of the system through sysfs/procfs and
allocate resources like number of threads/processes, memory allocation based
on that information. This can lead to unexpected running behaviors as well as
have a high impact on performance.
The problem is not only limited to the coherency of information. Cloud runtime
environments requests for CPU runtime in millicores[1], which translate to
using CFS period and quota to limit CPU runtime in cgroups. However, generally,
applications operate in terms of threads with little to no cognizance of the
millicore limit or its connotation.
The scope of the RFD, along with the experimental results is anchored
towards CPU system information, rather than the challenges posed by Memory
limits information or its likes in this proposal.
Problem Statement
=================
Provide Virtualized CPU system information to applications running within
the container semantics.
Experiments
===========
Picked a relatively common container application nginx[2] configured with
"worker_processes: auto"[3] (which ensures that the number of processes to spawn
will be derived from resources viewed on the system) and a benchmark/driver
application wrk[4]
Nginx: Nginx is a web server that can also be used as a reverse proxy, load
balancer, mail proxy and HTTP cache
Wrk: wrk is a modern HTTP benchmarking tool capable of generating significant
load when run on a single multi-core CPU
Docker is used as the containerization platform of choice.
For the scope of experimentation a fake sysfs (/sys/devices/system/cpu) is
mounted which encapsulates information in coherence with the limits set to
the container.
The aim of the experiment is to quantify the effects of incoherent information
on resources allocated as well as performance
System configuration1 -- Intel
1. Intel(R) Xeon(R) CPU E5-2470
2. CPUs: 32
3. Memory: 94Gi
System configuration2 -- IBM POWER
1. IBM POWER 9
2. CPUs: 176
3. Memory: 127GB
Exp1: Effects of incorrect CPU information with cpuset
------------------------------------------------------
See [12] for detailed stats -- POWER
See [13] for detailed stats -- Intel
Case1: The container has access to all the CPUs
Case2: cpuset limits set on nginx container to only "0-3". However, the default
sys/ and proc/ file systems display system CPUs
Case3: cpuset limits set to "0-3" and sysfs faked to give coherent information
pertaining to only 0-3
No significant improvement or degradation in terms of performance is observed.
Summary stats -- IBM POWER
+----------------+--------+--------+--------+
| Metric | Case 1 | Case 2 | Case 3 |
+----------------+--------+--------+--------+
| PIDs | 177 | 177 | 5 |
| mem usg (init) | 411.1 | 290.8 | 26.69 |
| mem usg (peak) | 662.8 | 295.3 | 30.69 |
+----------------+--------+--------+--------+
Summary stats -- Intel
+----------------+--------+--------+--------+
| Metric | Case 1 | Case 2 | Case 3 |
+----------------+--------+--------+--------+
| PIDs | 33 | 33 | 5 |
| mem usg (init) | 28.63 | 25.37 | 5.914 |
| mem usg (peak) | 40.14 | 30.7 | 9.914 |
+----------------+--------+--------+--------+
Observations -- Both platforms show the same trend in statistics:
1. The number of PIDs in case 3 are in coherence with the cpu limit provided.
4 worker threads + 1 Master thread, compared for the former cases where the
number of threads spawned were based on the CPUs on the system
2. The memory footprint dropped significantly from case1 to case3 just because
the application received a coherent view of the system
Exp2: Effects of Period and quota information
---------------------------------------------
See [14] for detailed stats -- POWER
See [15] for detailed stats -- Intel
Case1: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto - No limits
Case2: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 4 cpus - Exact CPUs
Case3: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 8 cpus - Overcommit of CPUs
Case4: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 8 cpus - Undercommit of CPUs
Summary statistics of the experiment -- IBM POWER:
+----------------+----------+----------+----------+----------+
| Metric | case1 | case2 | case3 | case4 |
+----------------+----------+----------+----------+----------+
| PIDs | 177 | 5 | 9 | 3 |
| mem usg (init) | 422.2 | 67.5 | 87.12 | 62.5 |
| mem usg (peak) | 571.4 | 130.6 | 131.6 | 85.38 |
| Throttle % | 96.8 | 20.12 | 97.08 | 0 |
| Requests/sec | 18849.97 | 66356.02 | 61121.65 | 35265.99 |
| Transfer/sec | 15.28 | 53.79 | 49.54 | 28.59 |
+----------------+----------+----------+----------+----------+
Summary statistics of the experiment -- Intel:
+----------------+----------+----------+----------+----------+
| Metric | case1 | case2 | case3 | case4 |
+----------------+----------+----------+----------+----------+
| PIDs | 33 | 5 | 9 | 3 |
| mem usg (init) | 29.12 | 7.574 | 10.83 | 6.07 |
| mem usg (peak) | 37.78 | 16.34 | 18.59 | 12.69 |
| Throttle % | 97.4 | 19.80 | 97.4 | 0 |
| Requests/sec | 32778.57 | 44754.85 | 42296.64 | 22500.00 |
| Transfer/sec | 26.57 | 36.28 | 34.28 | 18.24 |
+----------------+----------+----------+----------+----------+
Obervations -- Both platforms show the same trend in statistics:
When the CPU quota limit is set to run for the duration of 4 CPUs and,
Case1: Nginx spawns processes based on the view of the system then there is a
high amount of throttling, high memory footprint as well as low
performance
Case2: A fake sysfs is mounted to display 4 cpus, when period and quota
reflects 4 cpus worth of runtime then the throttling is the lowest as
well as the performance is the highest.
Also, memory footprint is seen to improve.
Case3: A fake sysfs is mounted to display 8 cpus i.e overcommit, then
throttling is seen to increase, while the throttle time is lesser
than case1, the throttle % is the same. Performance also drops as well
as higher memory footprint can be seen when compared to case 2 but
less than case 1
Case4: A fake sysfs is mounted to display 2 cpus i.e undercommit, There is
virtually no throttling to be observed as there is no contention.
The memory footprint is also the lowest, however the performance takes
a dip too and is the worst of all the cases
The above experiments show us that there is merit for applications observing
coherent information in terms of tasks spawned, memory footprint and
performance.
Existing solutions
==================
1. Why don't current applications look at the cgroupfs interface instead of the
old sys and procfs if they need coherent information?
Most of the information that applications seek from the traditional
filesystems is correctly populated in the cgroupfs and that applications
should modify their libraries to receive coherent information from there. This
is a strong argument and cannot be discounted, however it does present two
problems along with it.
a. There are a lot of applications that currently use the traditional
interface which can be range from legacy applications as well relatively
modern applications like nginx as we have seen. Therefore, the sheer volume
of applications and their libraries may make it difficult to implement this
currently.
b. Applications which previously didn't know the concept of millicores would
now have to incorporate that into their business logic for their thread
requirements as well by deriving and interpreting this information from CFS
period and quota
2. Userspace tools like LXCFS[5]
In the experiment above, to give a coherent view of the system we mounted
fake sysfs directories, which is precisely the modus operandi of LXFCS.
LXCFS is a userspace tools which uses FUSE filesystem to provide coherency of
information and mount cgroupfs based information in sys, procfs like:
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu/*
/sys/devices/system/cpu/online
It is also capable to virtualize period and quota information with --enable-cfs
option[6]. It divides period by quota and the resulting number of CPUs "N" is
presented in /sys/devices/system/cpu/online as "0-N".
The benefit of LXCFS is that it is a light, relatively easy to setup userspace
tool which can be used by applications to get coherent information presented
from cgroupfs to sysfs. It does seem to be currently in use with
Kubenetes as described by Google Anthos[7] and the Alibab Cloud tutorial[8]
However, it does pose a couple of concerns too:
a. From a CPU point of view, when it comes to virtualizing of CPUs based on
periods and quotas will always lead to list of CPUs starting from 0 to N,
where N is the translation of number of CPUs it should get a runtime of.
The question aries if this can become an issue where the applications
depend of the CPU list itself, that it is task-setting or setting affinity
to those CPUs?
If that is possible, then in that case where there are multiple
container applications running with the same taskset CPUset; can experience
unwarranted throttling.
b. LXCFS is an external solution that needs to be explicitly setup for
applications that experience problems from incorrect information
in sys/procfs
Hence, I believe an argument can be made to have an in-kernel interface that
can virtualize CPU information and namespace each logical container into its
own view of the CPU topology.
3. Introduce a new interface to present information in-kernel
A patchset was suggested[9] which added /proc/self/meminfo which contained a
subset of /proc/meminfo respecting cgroup restrictions for the memory
incoherence problem.
This design can also be ported for the CPU view of the system too.
The advantage of this approach is that a new interface is setup
without overriding the current interfaces which enables us to not break any
assumptions already established on those sys and procfs interface.
However, this could turn out to be a potential disadvantage too.
As there can be two kinds of applications that the solution is currently
designed for:
a. Legacy applications
b. Newer applications that still look at traditional interfaces
For both a.) and b.) if they do not currently look at the cgroupfs interfaces;
then introduction of yet another interface may not be motivating enough to
modify their codebase to receive this information.
This argument was also presented by Christian Brauner in the same patchset[10],
while also highlighting overlapping points presented from this proposal.
Honorable mention: Kubenetes CPU manager[11]. The CPU manager is a feature for
QoS in container orchestration, here the CPU manager manages the cpuset given
exclusively to pods based on the requests of CPUs in its configuration.
While it is a nifty feature to manage cpuset information, it still does not
reflect this information in traditional sys/procfs interfaces and a LXCFS hook
is needed along with it for the same.
Proposed Solution -> CPU namespace
==================================
This RFD proposes the inclusion of a new namespace feature - CPU namespace.
A CPU namespace can present coherent system CPU information to the contain
applications that reside within it in accordance with the cgroups limits set
onto it. The namespace also virtualizes CPU information and can maintain an
internal translation from the namespace CPU to the logical CPU in the kernel.
Designing a namespace this way presents a coherent interface as well as is able
to cleanly abstract details about the system and it's configuration from the
higher level applications.
The advantage of this approach is also that this can be acheived without the
introduction of a new interface and by just reimagining the interpretation of
the existing sys and proc interfaces.
On the lines of namespaces, an alternative namespace that could also be
proposed is a sys/proc namespace that can virtualize information presented from
cgroupfs. It could be CPUs, memory, even other system topology. This would
resolve memory limits inconsistency issues as reported in [9]. However,
presenting CPU information this way does pose a challenge. There are metrics
like period and quota as discussed earlier which need to be derived to present
as CPUs as well as needs to be abstracted out. If a coherent interpretation of
these derived metrics can be agreed upon then the following could also be a
viable alternative.
The aim of the above proposal is to:
a. Garner perspective from the community around the problem, its implications
in the real world and the cementing a consensus if there is a need to
solving it
b. Spark a discussion around a potential solution
If a consensus can be reached, first towards acceptance of the problem and then
towards a coherent CPU namespace mechanism; I would gladly volunteer to help
in building it out.
Thanks,
Pratik Sampat
IBM, Linux Technology Center
[1]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
[2]: https://docs.nginx.com/nginx/
[3]: http://nginx.org/en/docs/ngx_core_module.html#worker_processes
[4]: https://github.com/wg/wrk
[5]: https://linuxcontainers.org/lxcfs/
[6]: https://www.mankier.com/1/lxcfs#--enable-cfs
[7]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization
[8]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109
[9]: https://lore.kernel.org/lkml/ac070cd90c0d45b7a554366f235262fa5c566435.1622716926.git.legion@kernel.org/
[10]: https://lore.kernel.org/lkml/20210615113222.edzkaqfvrris4nth@wittgenstein/
[11]: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/
[12]: POWER - EXP1: Effects of incorrect CPU information with cpuset
Case1: The container has access to all the CPUs (0-175)
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 411.1MiB / 127.5GiB 0.31% 2.29kB / 0B 0B / 8.19kB 177
PEAK WORKLOAD
pnginx 14383.42% 662.8MiB / 127.5GiB 0.51% 389MB / 2.11GB 0B / 8.19kB 177
Case2: cpuset limits set on nginx container to only "0-3". However the
default sys/ and proc/ file systems display 176 CPUs.
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 290.8MiB / 127.5GiB 0.22% 2.29kB / 0B 0B / 8.19kB 177
PEAK WORKLOAD
pnginx 399.21% 295.3MiB / 127.5GiB 0.23% 197MB / 1.1GB 0B / 8.19kB 177
Case3: cpuset limits set to "0-3" and sysfs faked to give coherent
information pertaining to only 0-3
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 26.69MiB / 127.5GiB 0.02% 2.22kB / 0B 0B / 8.19kB 5
PEAK WORKLOAD
pnginx 399.24% 30.69MiB / 127.5GiB 0.02% 183MB / 1.03GB 0B / 8.19kB 5
[13]: Intel - EXP1: Effects of incorrect CPU information with cpuset
Case1: The container has access to all the CPUs (0-31)
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 28.63MiB / 94.38GiB 0.03% 1.54kB / 0B 69.6kB / 8.19kB 33
PEAK WORKLOAD
pnginx 1562.51% 40.14MiB / 94.38GiB 0.04% 765MB / 4.08GB 0B / 8.19kB 33
Case2: cpuset limits set on nginx container to only "0-3". However the
default sys/ and proc/ file systems display 32 CPUs.
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 25.37MiB / 94.38GiB 0.03% 2.01kB / 0B 0B / 8.19kB 33
PEAK WORKLOAD
pnginx 406.82% 30.7MiB / 94.38GiB 0.03% 243MB / 1.36GB 0B / 8.19kB 33
Case3: cpuset limits set to "0-3" and sysfs faked to give coherent
information pertaining to only 0-3
IDLE container stat
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 5.914MiB / 94.38GiB 0.01% 2.08kB / 0B 0B / 8.19kB 5
PEAK WORKLOAD
pnginx 406.04% 9.914MiB / 94.38GiB 0.01% 251MB / 1.41GB 0B / 8.19kB 5
[14]: POWER - Exp2: Effects of Period and quota information
Case1: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto - No limits
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 422.2MiB / 127.5GiB 0.32% 2.36kB / 0B 0B / 8.19kB 177
--throttle stats--
nr_periods 7
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 391.18% 571.4MiB / 127.5GiB 0.44% 101MB / 561MB 0B / 8.19kB 177
--throttle stats--
nr_periods 313
nr_throttled 303
throttled_time 2168846281268
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.2:80/index.html
Running 30s test @ http://172.17.0.2:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 59.17ms 89.55ms 1.19s 88.62%
Req/Sec 4.75k 4.03k 27.79k 74.00%
567045 requests in 30.08s, 459.63MB read
Requests/sec: 18849.97
Transfer/sec: 15.28MB
Case2: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 4 cpus - Exact CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 67.5MiB / 127.5GiB 0.05% 2.29kB / 0B 0B / 8.19kB 5
--throttle stats--
nr_periods 5
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 398.36% 130.6MiB / 127.5GiB 0.10% 337MB / 1.9GB 0B / 8.19kB 5
--throttle stats--
nr_periods 308
nr_throttled 62
throttled_time 375890674
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.2:80/index.html
Running 30s test @ http://172.17.0.2:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 17.57ms 32.08ms 341.08ms 89.20%
Req/Sec 16.71k 1.26k 24.71k 78.17%
1996404 requests in 30.09s, 1.58GB read
Requests/sec: 66356.02
Transfer/sec: 53.79MB
Case3: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 8 cpus - Overcommit of CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 87.12MiB / 127.5GiB 0.07% 2.36kB / 0B 0B / 8.19kB 9
--throttle stats--
nr_periods 5
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 401.48% 131.6MiB / 127.5GiB 0.10% 300MB / 1.7GB 0B / 8.19kB 9
--throttle stats--
nr_periods 309
nr_throttled 300
throttled_time 119159115734
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.2:80/index.html
Running 30s test @ http://172.17.0.2:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.39ms 16.52ms 151.55ms 81.31%
Req/Sec 15.39k 0.91k 30.95k 90.08%
1838179 requests in 30.07s, 1.46GB read
Requests/sec: 61121.65
Transfer/sec: 49.54MB
Case4: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 2 cpus - Undercommit of CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 62.5MiB / 127.5GiB 0.05% 2.29kB / 0B 0B / 8.19kB 3
--throttle stats--
nr_periods 5
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 199.47% 85.38MiB / 127.5GiB 0.07% 170MB / 963MB 0B / 8.19kB 3
--throttle stats--
nr_periods 308
nr_throttled 0
throttled_time 0
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.2:80/index.html
Running 30s test @ http://172.17.0.2:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 159.81ms 251.64ms 1.05s 81.16%
Req/Sec 8.88k 1.89k 15.59k 71.00%
1060592 requests in 30.07s, 859.69MB read
Requests/sec: 35265.99
Transfer/sec: 28.59MB
[15]: Intel - Exp2: Effects of Period and quota information
Case1: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto - No limits
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 29.12MiB / 94.38GiB 0.03% 1.74kB / 0B 2.26MB / 8.19kB 33
--throttle stats--
nr_periods 5
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 403.43% 37.78MiB / 94.38GiB 0.04% 184MB / 912MB 2.26MB / 8.19kB 33
--throttle stats--
nr_periods 309
nr_throttled 301
throttled_time 506059002784
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.4:80/index.html
Running 30s test @ http://172.17.0.4:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.10ms 31.45ms 189.88ms 79.53%
Req/Sec 8.25k 1.67k 22.62k 79.92%
985441 requests in 30.06s, 798.78MB read
Requests/sec: 32778.57
Transfer/sec: 26.57MB
Case2: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 4 cpus - Exact CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 7.574MiB / 94.38GiB 0.01% 2.01kB / 0B 90.1kB / 8.19kB 5
--throttle stats--
nr_periods 5
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 408.06% 16.34MiB / 94.38GiB 0.02% 227MB / 1.28GB 90.1kB / 8.19kB 5
--throttle stats--
nr_periods 308
nr_throttled 61
throttled_time 100989735
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.4:80/index.html
Running 30s test @ http://172.17.0.4:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.47ms 48.54ms 448.54ms 89.32%
Req/Sec 11.26k 844.04 14.61k 68.67%
1344115 requests in 30.03s, 1.06GB read
Requests/sec: 44754.85
Transfer/sec: 36.28MB
Case3: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 8 cpus - Overcommit of CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 10.83MiB / 94.38GiB 0.01% 2.01kB / 0B 0B / 8.19kB 9
--throttle stats--
nr_periods 6
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 403.62% 18.59MiB / 94.38GiB 0.02% 236MB / 1.23GB 0B / 8.19kB 9
--throttle stats--
nr_periods 308
nr_throttled 300
throttled_time 11847978641
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.4:80/index.html
Running 30s test @ http://172.17.0.4:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 17.52ms 18.08ms 176.48ms 81.30%
Req/Sec 10.64k 692.48 19.12k 80.50%
1270019 requests in 30.03s, 1.01GB read
Requests/sec: 42296.64
Transfer/sec: 34.28MB
Case4: 4 CPUs worth of runtime (period: 100000us quota: 400000 us) ,
worker_processes: auto, fake sysfs to export 2 cpus - Undercommit of CPUs
Inital nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 0.00% 6.07MiB / 94.38GiB 0.01% 2.15kB / 0B 0B / 8.19kB 3
--throttle stats--
nr_periods 6
nr_throttled 0
throttled_time 0
Peak workload nginx stats
--docker stats--
NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
pnginx 202.32% 12.69MiB / 94.38GiB 0.01% 126MB / 681MB 0B / 8.19kB 3
--throttle stats--
nr_periods 308
nr_throttled 0
throttled_time 0
Benchmark stats
# ./wrk -t4 -c500 --latency -d30s http://172.17.0.4:80/index.html
Running 30s test @ http://172.17.0.4:80/index.html
4 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 237.39ms 385.12ms 1.49s 81.66%
Req/Sec 5.66k 1.24k 8.34k 63.42%
676025 requests in 30.05s, 547.97MB read
Requests/sec: 22500.00
Transfer/sec: 18.24MB
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFD] Provide virtualized CPU system information for containers
2021-07-22 7:53 [RFD] Provide virtualized CPU system information for containers Pratik Sampat
@ 2021-07-22 15:22 ` Eric W. Biederman
2021-07-26 11:39 ` Pratik Sampat
0 siblings, 1 reply; 3+ messages in thread
From: Eric W. Biederman @ 2021-07-22 15:22 UTC (permalink / raw)
To: Pratik Sampat
Cc: Linux Kernel Mailing List, containers, containers, legion, akpm,
christian.brauner, hannes, mhocko, Alexey Makhalov, llong,
pratik.r.sampat
As stated I think this idea is a non-starter.
There is a real problem that there are applications that have a
legitimate need to know what cpu resources are available for them to use
and we don't have a good interfaces for them to request that
information.
I think MESOS solved this by passing a MAX_CPUS environment variable,
and at least the JVM was modified to use that variable.
That said as situations can be a bit more dynamic and fluid having
something where an application can look and see what resources are
available from it's view of the world seems reasonable.
AKA we need something so applications can stop conflating physical
cpu resources that are available with cpu resources that are allowed
to be used in an application.
This might be as simple as implementing a /proc/self/cpus_available
file.
Without the will to go through find existing open source applications
that care and update them so that they will use the new interface I
don't think anything will really happen.
The problem I see with changing existing interfaces that describe the
hardware is that the definition becomes unclear and so different
applications can legitimately expect different things, and it would
become impossible to implement what is needed correctly.
The problem I see with using cgroup interfaces is that they are not
targeted at end user applications and but rather are targeted at the
problem of controlling access to a resource. Using them report what is
available again gets you into the multiple master problem. Especially
as cgroups may not be the only thing in the system controlling access to
your resource.
So I really think the only good solution that people won't mind is to go
through the applications figure out what information is legitimately
needed from an application perspective, and build an interface tailored
for applications to get that information.
Then applications can be updated to use the new interface, and as the
implementation of the system changes the implementation in the kernel
can be updated to keep the applications working.
Eric
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFD] Provide virtualized CPU system information for containers
2021-07-22 15:22 ` Eric W. Biederman
@ 2021-07-26 11:39 ` Pratik Sampat
0 siblings, 0 replies; 3+ messages in thread
From: Pratik Sampat @ 2021-07-26 11:39 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linux Kernel Mailing List, containers, containers, legion, akpm,
christian.brauner, hannes, mhocko, Alexey Makhalov, llong,
pratik.r.sampat
Thank you for your comments.
On 22/07/21 8:52 pm, Eric W. Biederman wrote:
> As stated I think this idea is a non-starter.
>
> There is a real problem that there are applications that have a
> legitimate need to know what cpu resources are available for them to use
> and we don't have a good interfaces for them to request that
> information.
>
> I think MESOS solved this by passing a MAX_CPUS environment variable,
> and at least the JVM was modified to use that variable.
>
> That said as situations can be a bit more dynamic and fluid having
> something where an application can look and see what resources are
> available from it's view of the world seems reasonable.
>
> AKA we need something so applications can stop conflating physical
> cpu resources that are available with cpu resources that are allowed
> to be used in an application.
>
> This might be as simple as implementing a /proc/self/cpus_available
> file.
>
> Without the will to go through find existing open source applications
> that care and update them so that they will use the new interface I
> don't think anything will really happen.
From a process granular point of view I believe a /proc/self approach solves
this problem at root. However, as you have stated too; applications will now
have to look at another interface for the correct information and that could
potentially be a challenge.
>
> The problem I see with changing existing interfaces that describe the
> hardware is that the definition becomes unclear and so different
> applications can legitimately expect different things, and it would
> become impossible to implement what is needed correctly.
In our experimentation and survey we found out that container applications which
were restricted based on a cgroup restriction - both cpuset or period/quota
benefited from coherent information. That was also my understanding with the
usage of tools like LXCFS in the userspace.
Would you happen to know if there are any applications that expect the full
hardware/topology view even though it itself is restricted in its usage?
>
> The problem I see with using cgroup interfaces is that they are not
> targeted at end user applications and but rather are targeted at the
> problem of controlling access to a resource. Using them report what is
> available again gets you into the multiple master problem. Especially
> as cgroups may not be the only thing in the system controlling access to
> your resource.
I agree, cgroup is a control interface and should not be used for presenting of
information and cgroups may not be the only thing in the system controlling
access to the resources.
This is where the idea for a different interface really stemmed from.
That although there are mechanisms to restrict and control usage, there is no
interface that presents information coherently to the userspace
> So I really think the only good solution that people won't mind is to go
> through the applications figure out what information is legitimately
> needed from an application perspective, and build an interface tailored
> for applications to get that information.
>
> Then applications can be updated to use the new interface, and as the
> implementation of the system changes the implementation in the kernel
> can be updated to keep the applications working.
I concur with this approach to build an application first interface.
My current frame of reference for the problems come from tools like LXCFS which
are built around the existing interfaces to present information and the
experiments were designed to quantify those shortcomings.
We could definitely use some help in understanding the shortcomings of the
current interfaces from people who use these applications.
--
Pratik
> Eric
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-07-26 11:39 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 7:53 [RFD] Provide virtualized CPU system information for containers Pratik Sampat
2021-07-22 15:22 ` Eric W. Biederman
2021-07-26 11:39 ` Pratik Sampat
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).