Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	syzbot <syzbot+96cc7aba7e969b1d305c@syzkaller.appspotmail.com>,
	syzkaller-bugs <syzkaller-bugs@googlegroups.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: INFO: task hung in pipe_read (2)
Date: Mon, 10 Aug 2020 15:29:41 -0400	[thread overview]
Message-ID: <20200810192941.GA16925@redhat.com> (raw)
In-Reply-To: <e673cccb-1b67-802a-84e3-6aeea4513a09@i-love.sakura.ne.jp>

Hello Tetsuo,

On Sat, Aug 08, 2020 at 10:01:21AM +0900, Tetsuo Handa wrote:
> use of killable waits disables ability to detect possibility of deadlock (because
> lockdep can't check possibility of deadlock which involves actions in userspace), for
> syzkaller process is SIGKILLed after 5 seconds while khungtaskd's timeout is 140 seconds.
> 
> If we encounter a deadlock in an unattended operation (e.g. some server process),
> we don't have a method for resolving the deadlock. Therefore, I consider that
> t->state == TASK_UNINTERRUPTIBLE check is a bad choice. Unless a sleep is neutral
> (e.g. no lock is held, or obviously safe to sleep with that specific lock held),
> sleeping for 140 seconds inside the kernel is a bad sign even if interruptible/killable.

Task in killable state for seconds as result of another task taking
too long to do something in kernel sounds bad, if the other task had a
legitimate reason to take a long time in normal operations, i.e. like
if the other task was just doing an getdents of a large directory.

Nobody force any app to use userfaultfd, if an app uses it and the
other side of the pipe trusts to read from it, and it gets stuck for
seconds in uninterruptible and killable state, it's either an app bug
resolvable with kill -9. We also can't enforce all signals to run in
presence of other bugs, for example if the task that won't respond to
any signal other than CONT and KILL was blocked in stopped state by a
buggy SIGSTOP. The pipe also can get stuck if the network is down and
it's swapping in from NFS and nobody is forced to take the risk of
using network attached storage as swap device either.

The hangcheck is currently correct to report a concern, because the
other side of the pipe may be another process of another user that
cannot SIGKILL the task blocked in the userfault. That sounds far
fetched and it's not particular concerning anyway, but it's not
technically impossible so I agree with the hangcheck timer reporting
an issue that needs correction.

However once the mutex is killable there's no concern anymore and the
hangcheck timer is correct also not reporting any misbehavior anymore.

Instead of userfaultfd, you can think at 100% kernel faults backed by
swapin from NFS or swaping from attached network storage or swapin
from scsi with a scsi fibre channel accidentally pulled out of a few
seconds. It's nice if uffd can survive as well as nfs or scsi would by
retrying and waiting more than 1sec.

> Can we do something like this?
> 
>   bool retried = false;
> retry:
>   lock();
>   disable_fault();
>   ret = access_memory_that_might_fault();
>   enable_fault();
>   if (ret == -EWOULDFAULT && !retried)
>     goto retry_without_lock;
>   if (ret == 0)
>     ret = do_something();
>   unlock();
>   return ret;
> retry_without_lock:
>   unlock();
>   ret = access_memory_that_might_fault();
>   retried = true;
>   goto retry;

This would work, but it'll make the kernel more complex than using a
killable mutex.

It'd also give a worse runtime than the killable mutex, if the only
source of blocking events while holding the mutex wouldn't be the page
fault.

With just 2 processes in this case probably it would be fine and there
are likely won't be other sources of contention, so the main cons is
just the code complexity to be maintained and the fact it won't
provide any measurable practical benefit, if something it'll run
slower by having to repeat the same fault in blocking and non blocking
mode.

With regard to the reporting of the hangcheck timer most modern paging
code uses killable mutex because unlike the pipe code, there can be
other sources of blockage and you don't want to wait for shared
resources to unblock a process that is waiting on a mutex. I think
trying to reduce the usage of killable mutex overall is a ship that
has sailed, it won't move the needle to just avoid it in pipe code
since it'll remain everywhere else.

So I'm certainly not against your proposal, but if we increase the
complexity like above then I'd find it more attractive if it was for
some other benefit unrelated to userfaultfd, or swapin from NFS or
network attached storage for that matter, and I don't see a big enough
benefit to justify it.

Thanks!
Andrea

PS. I'll be busy until Wed sorry if I don't answer promptly to
    followups. If somebody could give a try to add the killable mutex
    bailout failure paths that return to userland direct, or your more
    complex alternative it'd be great.


  reply	other threads:[~2020-08-10 19:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-01  8:45 syzbot
2020-08-01 15:29 ` Tetsuo Handa
2020-08-01 17:39   ` Linus Torvalds
2020-08-07  5:31     ` Andrea Arcangeli
2020-08-08  1:01       ` Tetsuo Handa
2020-08-10 19:29         ` Andrea Arcangeli [this message]
2020-08-13  7:00           ` Tetsuo Handa
2020-08-13 11:20             ` Tetsuo Handa
2020-08-22  4:34               ` [RFC PATCH] pipe: make pipe_release() deferrable Tetsuo Handa
2020-08-22 16:30                 ` Linus Torvalds
2020-08-23  0:21                   ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200810192941.GA16925@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dvyukov@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=syzbot+96cc7aba7e969b1d305c@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --subject='Re: INFO: task hung in pipe_read (2)' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).