LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Jan Kara <jack@suse.cz>
Cc: Paolo Valente <paolo.valente@linaro.org>,
	"Srivatsa S. Bhat" <srivatsa@csail.mit.edu>,
	linux-fsdevel@vger.kernel.org,
	linux-block <linux-block@vger.kernel.org>,
	linux-ext4@vger.kernel.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, axboe@kernel.dk, jmoyer@redhat.com,
	amakhalov@vmware.com, anishs@vmware.com, srivatsab@vmware.com
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
Date: Tue, 21 May 2019 12:48:14 -0400	[thread overview]
Message-ID: <20190521164814.GC2591@mit.edu> (raw)
In-Reply-To: <20190520091558.GC2172@quack2.suse.cz>

On Mon, May 20, 2019 at 11:15:58AM +0200, Jan Kara wrote:
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?

Good point, yes, it can.  It depends in what cgroup the file system is
mounted (and hence what cgroup the jbd2 kernel thread is on).  If it
was mounted in the root cgroup, then jbd2 thread is going to be
completely unthrottled (except for the data=ordered writebacks, which
will be charged to the cgroup which write those pages) so the only
thing which is nuking us will be the slice_idle timeout --- both for
the writebacks (which could get charged to N different cgroups, with
disastrous effects --- and this is going to be true for any file
system on a syncfs(2) call as well) and switching between the jbd2
thread's cgroup and the writeback cgroup.

One thing the I/O scheduler could do is use the synchronous flag as a
hint that it should ix-nay on the idle-way.  Or maybe we need to have
a different way to signal this to the jbd2 thread, since I do
recognize that this issue is ext4-specific, *because* we do the
transaction handling in a separate thread, and because of the
data=ordered scheme, both of which are unique to ext4.  So exempting
synchronous writes from cgroup control doesn't make sense for other
file systems.

So maybe a special flag meaning "entangled writes", where the
sched_idle hacks should get suppressed for the data=ordered
writebacks, but we still charge the block I/O to the relevant CSS's?

I could also imagine if there was some way that file system could
track whether all of the file system modifications were charged to a
single cgroup, we could in that case charge it to that cgroup?

> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time. But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.

The best I've come up with is some way of signalling that all of the
writes coming from the jbd2 commit are entangled, probably via a bio
flag.

If we don't have cgroup support, the other thing we could do is assume
that the jbd2 thread should always be in the root (unconstrained)
cgroup, and then force all writes, include data=ordered writebacks, to
be in the jbd2's cgroup.  But that would make the block cgroup
controls trivially bypassable by an application, which could just be
fsync-happy and exempt all of its buffered I/O writes from cgroup
control.  So that's probably not a great way to go --- but it would at
least fix this particular performance issue.  :-/

						- Ted

  parent reply	other threads:[~2019-05-21 16:49 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-17 22:16 Srivatsa S. Bhat
2019-05-18 18:39 ` Paolo Valente
2019-05-18 19:28   ` Theodore Ts'o
2019-05-20  9:15     ` Jan Kara
2019-05-20 10:45       ` Paolo Valente
2019-05-21 16:48       ` Theodore Ts'o [this message]
2019-05-21 18:19         ` Josef Bacik
2019-05-21 19:10           ` Theodore Ts'o
2019-05-20 10:38     ` Paolo Valente
2019-05-21  7:38       ` Andrea Righi
2019-05-18 20:50   ` Srivatsa S. Bhat
2019-05-20 10:19     ` Paolo Valente
2019-05-20 22:45       ` Srivatsa S. Bhat
2019-05-21  6:23         ` Paolo Valente
2019-05-21  7:19           ` Srivatsa S. Bhat
2019-05-21  9:10           ` Jan Kara
2019-05-21 16:31             ` Theodore Ts'o
2019-05-21 11:25       ` Paolo Valente
2019-05-21 13:20         ` Paolo Valente
2019-05-21 16:21           ` Paolo Valente
2019-05-21 17:38             ` Paolo Valente
2019-05-21 22:51               ` Srivatsa S. Bhat
2019-05-22  8:05                 ` Paolo Valente
2019-05-22  9:02                   ` Srivatsa S. Bhat
2019-05-22  9:12                     ` Paolo Valente
2019-05-22 10:02                       ` Srivatsa S. Bhat
2019-05-22  9:09                   ` Paolo Valente
2019-05-22 10:01                     ` Srivatsa S. Bhat
2019-05-22 10:54                       ` Paolo Valente
2019-05-23  2:30                         ` Srivatsa S. Bhat
2019-05-23  9:19                           ` Paolo Valente
2019-05-23 17:22                             ` Paolo Valente
2019-05-23 23:43                               ` Srivatsa S. Bhat
2019-05-24  6:51                                 ` Paolo Valente
2019-05-24  7:56                                   ` Paolo Valente
2019-05-29  1:09                                   ` Srivatsa S. Bhat
2019-05-29  7:41                                     ` Paolo Valente
2019-05-30  8:29                                       ` Srivatsa S. Bhat
2019-05-30 10:45                                         ` Paolo Valente
2019-06-02  7:04                                           ` Srivatsa S. Bhat
2019-06-11 22:34                                             ` Srivatsa S. Bhat
2019-06-12 13:04                                               ` Jan Kara
2019-06-12 19:36                                                 ` Srivatsa S. Bhat
2019-06-13  6:02                                                   ` Greg Kroah-Hartman
2019-06-13 19:03                                                     ` Srivatsa S. Bhat
2019-06-13  8:20                                                   ` Jan Kara
2019-06-13 19:05                                                     ` Srivatsa S. Bhat
2019-06-13  8:37                                                   ` Jens Axboe
2019-06-13  5:46                                               ` Paolo Valente
2019-06-13 19:13                                                 ` Srivatsa S. Bhat
2019-05-23 23:32                           ` Srivatsa S. Bhat
2019-05-30  8:38                             ` Srivatsa S. Bhat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190521164814.GC2591@mit.edu \
    --to=tytso@mit.edu \
    --cc=amakhalov@vmware.com \
    --cc=anishs@vmware.com \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paolo.valente@linaro.org \
    --cc=srivatsa@csail.mit.edu \
    --cc=srivatsab@vmware.com \
    --subject='Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).