LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Paolo Valente <paolo.valente@linaro.org>
To: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>
Cc: linux-fsdevel@vger.kernel.org,
	linux-block <linux-block@vger.kernel.org>,
	linux-ext4@vger.kernel.org, cgroups@vger.kernel.org,
	kernel list <linux-kernel@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Jan Kara <jack@suse.cz>,
	jmoyer@redhat.com, Theodore Ts'o <tytso@mit.edu>,
	amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
Date: Thu, 23 May 2019 19:22:38 +0200	[thread overview]
Message-ID: <2A58C239-EF3F-422B-8D87-E7A3B500C57C@linaro.org> (raw)
In-Reply-To: <6FE0A98F-1E3D-4EF6-8B38-2C85741924A4@linaro.org>


[-- Attachment #1.1: Type: text/plain, Size: 7728 bytes --]



> Il giorno 23 mag 2019, alle ore 11:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>> 
>> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>> 
>>> 
>>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>> 
>>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>> 
>>>>> First, thank you very much for testing my patches, and, above all, for
>>>>> sharing those huge traces!
>>>>> 
>>>>> According to the your traces, the residual 20% lower throughput that you
>>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>>> that you see without this new patch.  After that time, there
>>>>> seems to be no loss according to the trace.
>>>>> 
>>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>>> however not negligible for a write workload that lasts only 3-4
>>>>> seconds.  Could you please try writing a larger file?
>>>>> 
>>>> 
>>>> I tried running dd for longer (about 100 seconds), but still saw around
>>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>>> mq-deadline and noop.
>>> 
>>> Ok, then now the cause is the periodic reset of the mechanism.
>>> 
>>> It would be super easy to fill this gap, by just gearing the mechanism
>>> toward a very aggressive injection.  The problem is maintaining
>>> control.  As you can imagine from the performance gap between CFQ (or
>>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>>> hard to succeed in maximizing the throughput while at the same time
>>> preserving control on per-group I/O.
>>> 
>> 
>> Ah, I see. Just to make sure that this fix doesn't overly optimize for
>> total throughput (because of the testcase we've been using) and end up
>> causing regressions in per-group I/O control, I ran a test with
>> multiple simultaneous dd instances, each writing to a different
>> portion of the filesystem (well separated, to induce seeks), and each
>> dd task bound to its own blkio cgroup. I saw similar results with and
>> without this patch, and the throughput was equally distributed among
>> all the dd tasks.
>> 
> 
> Thank you very much for pre-testing this change, this let me know in
> advance that I shouldn't find issues when I'll test regressions, at
> the end of this change phase.
> 
>>> On the bright side, you might be interested in one of the benefits
>>> that BFQ gives in return for this ~10% loss of throughput, in a
>>> scenario that may be important for you (according to affiliation you
>>> report): from ~500% to ~1000% higher throughput when you have to serve
>>> the I/O of multiple VMs, and to guarantee at least no starvation to
>>> any VM [1].  The same holds with multiple clients or containers, and
>>> in general with any set of entities that may compete for storage.
>>> 
>>> [1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
>>> 
>> 
>> Great article! :) Thank you for sharing it!
> 
> Thanks! I mentioned it just to better put things into context.
> 
>> 
>>>> But I'm not too worried about that difference.
>>>> 
>>>>> In addition, I wanted to ask you whether you measured BFQ throughput
>>>>> with traces disabled.  This may make a difference.
>>>>> 
>>>> 
>>>> The above result (1.4 MB/s) was obtained with traces disabled.
>>>> 
>>>>> After trying writing a larger file, you can try with low_latency on.
>>>>> On my side, it causes results to become a little unstable across
>>>>> repetitions (which is expected).
>>>>> 
>>>> With low_latency on, I get between 60 KB/s - 100 KB/s.
>>>> 
>>> 
>>> Gosh, full regression.  Fortunately, it is simply meaningless to use
>>> low_latency in a scenario where the goal is to guarantee per-group
>>> bandwidths.  Low-latency heuristics, to reach their (low-latency)
>>> goals, modify the I/O schedule compared to the best schedule for
>>> honoring group weights and boosting throughput.  So, as recommended in
>>> BFQ documentation, just switch low_latency off if you want to control
>>> I/O with groups.  It may still make sense to leave low_latency on
>>> in some specific case, which I don't want to bother you about.
>>> 
>> 
>> My main concern here is about Linux's I/O performance out-of-the-box,
>> i.e., with all default settings, which are:
>> 
>> - cgroups and blkio enabled (systemd default)
>> - blkio non-root cgroups in use (this is the implicit systemd behavior
>> if docker is installed; i.e., it runs tasks under user.slice)
>> - I/O scheduler with blkio group sched support: bfq
>> - bfq default configuration: low_latency = 1
>> 
>> If this yields a throughput that is 10x-30x slower than what is
>> achievable, I think we should either fix the code (if possible) or
>> change the defaults such that they don't lead to this performance
>> collapse (perhaps default low_latency to 0 if bfq group scheduling
>> is in use?)
> 
> Yeah, I thought of this after sending my last email yesterday. Group
> scheduling and low-latency heuristics may simply happen to fight
> against each other in personal systems.  Let's proceed this way. I'll
> try first to make the BFQ low-latency mechanism clever enough to not
> hinder throughput when groups are in place.  If I make it, then we
> will get the best of the two worlds: group isolation and intra-group
> low latency; with no configuration change needed.  If I don't make it,
> I'll try to think of the best solution to cope with this non-trivial
> situation.
> 
> 
>>> However, I feel bad with such a low throughput :)  Would you be so
>>> kind to provide me with a trace?
>>> 
>> Certainly! Short runs of dd resulted in a lot of variation in the
>> throughput (between 60 KB/s - 1 MB/s), so I increased dd's runtime
>> to get repeatable numbers (~70 KB/s). As a result, the trace file
>> (trace-bfq-boost-injection-low-latency-71KBps) is quite large, and
>> is available here:
>> 
>> https://www.dropbox.com/s/svqfbv0idcg17pn/bfq-traces.tar.gz?dl=0
>> 
> 
> Thank you very much for your patience and professional help.
> 
>> Also, I'm very happy to run additional tests or experiments to help
>> track down this issue. So, please don't hesitate to let me know if
>> you'd like me to try anything else or get you additional traces etc. :)
>> 
> 
> Here's to you!  :) I've attached a new small improvement that may
> reduce fluctuations (path to apply on top of the others, of course).
> Unfortunately, I don't expect this change to boost the throughput
> though.
> 
> In contrast, I've thought of a solution that might be rather
> effective: making BFQ aware (heuristically) of trivial
> synchronizations between processes in different groups.  This will
> require a little more work and time.
> 

Hi Srivatsa,
I'm back :)

First, there was a mistake in the last patch I sent you, namely in
0001-block-bfq-re-sample-req-service-times-when-possible.patch.
Please don't apply that patch at all.

I've attached a new series of patches instead.  The first patch in this
series is a fixed version of the faulty patch above (if I'm creating too
much confusion, I'll send you again all patches to apply on top of
mainline).

This series also implements the more effective idea I told you a few
hours ago.  In my system, the loss is now around only 10%, even with
low_latency on.

Looking forward to your results,
Paolo


[-- Attachment #1.2: patches-with-waker-detection.tgz --]
[-- Type: application/octet-stream, Size: 2956 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 162 bytes --]



> 
> Thanks,
> Paolo
> 
> <0001-block-bfq-re-sample-req-service-times-when-possible.patch.gz>
> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2019-05-23 17:22 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 22:16 Srivatsa S. Bhat
2019-05-18 18:39 ` Paolo Valente
2019-05-18 19:28   ` Theodore Ts'o
2019-05-20  9:15     ` Jan Kara
2019-05-20 10:45       ` Paolo Valente
2019-05-21 16:48       ` Theodore Ts'o
2019-05-21 18:19         ` Josef Bacik
2019-05-21 19:10           ` Theodore Ts'o
2019-05-20 10:38     ` Paolo Valente
2019-05-21  7:38       ` Andrea Righi
2019-05-18 20:50   ` Srivatsa S. Bhat
2019-05-20 10:19     ` Paolo Valente
2019-05-20 22:45       ` Srivatsa S. Bhat
2019-05-21  6:23         ` Paolo Valente
2019-05-21  7:19           ` Srivatsa S. Bhat
2019-05-21  9:10           ` Jan Kara
2019-05-21 16:31             ` Theodore Ts'o
2019-05-21 11:25       ` Paolo Valente
2019-05-21 13:20         ` Paolo Valente
2019-05-21 16:21           ` Paolo Valente
2019-05-21 17:38             ` Paolo Valente
2019-05-21 22:51               ` Srivatsa S. Bhat
2019-05-22  8:05                 ` Paolo Valente
2019-05-22  9:02                   ` Srivatsa S. Bhat
2019-05-22  9:12                     ` Paolo Valente
2019-05-22 10:02                       ` Srivatsa S. Bhat
2019-05-22  9:09                   ` Paolo Valente
2019-05-22 10:01                     ` Srivatsa S. Bhat
2019-05-22 10:54                       ` Paolo Valente
2019-05-23  2:30                         ` Srivatsa S. Bhat
2019-05-23  9:19                           ` Paolo Valente
2019-05-23 17:22                             ` Paolo Valente [this message]
2019-05-23 23:43                               ` Srivatsa S. Bhat
2019-05-24  6:51                                 ` Paolo Valente
2019-05-24  7:56                                   ` Paolo Valente
2019-05-29  1:09                                   ` Srivatsa S. Bhat
2019-05-29  7:41                                     ` Paolo Valente
2019-05-30  8:29                                       ` Srivatsa S. Bhat
2019-05-30 10:45                                         ` Paolo Valente
2019-06-02  7:04                                           ` Srivatsa S. Bhat
2019-06-11 22:34                                             ` Srivatsa S. Bhat
2019-06-12 13:04                                               ` Jan Kara
2019-06-12 19:36                                                 ` Srivatsa S. Bhat
2019-06-13  6:02                                                   ` Greg Kroah-Hartman
2019-06-13 19:03                                                     ` Srivatsa S. Bhat
2019-06-13  8:20                                                   ` Jan Kara
2019-06-13 19:05                                                     ` Srivatsa S. Bhat
2019-06-13  8:37                                                   ` Jens Axboe
2019-06-13  5:46                                               ` Paolo Valente
2019-06-13 19:13                                                 ` Srivatsa S. Bhat
2019-05-23 23:32                           ` Srivatsa S. Bhat
2019-05-30  8:38                             ` Srivatsa S. Bhat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2A58C239-EF3F-422B-8D87-E7A3B500C57C@linaro.org \
    --to=paolo.valente@linaro.org \
    --cc=amakhalov@vmware.com \
    --cc=anishs@vmware.com \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=srivatsa@csail.mit.edu \
    --cc=srivatsab@vmware.com \
    --cc=tytso@mit.edu \
    --subject='Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).