Linux-Fsdevel Archive on
help / color / mirror / Atom feed
From: Michal Hocko <>
To: Johannes Weiner <>
Cc: Peter Zijlstra <>,
	Waiman Long <>,
	Andrew Morton <>,
	Vladimir Davydov <>,
	Jonathan Corbet <>,
	Alexey Dobriyan <>,
	Ingo Molnar <>,
	Juri Lelli <>,
	Vincent Guittot <>,,,,,
Subject: Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control
Date: Wed, 9 Sep 2020 13:53:57 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

[Sorry, this slipped through cracks]

On Mon 24-08-20 12:58:50, Johannes Weiner wrote:
> On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
> > Arguably seeing the rate drop to near 0 is a very good point to consider
> > running cgroup-OOM.
> Agreed. In the past, that's actually what we did: In cgroup1, you
> could disable the kernel OOM killer, and when reclaim failed at the
> limit, the allocating task would be put on a waitqueue until woken up
> by a freeing event. Conceptually this is clean & straight-forward.
> However,
> 1. Putting allocation contexts with unknown locks to indefinite sleep
>    caused deadlocks, for obvious reasons. Userspace OOM killing tends
>    to take a lot of task-specific locks when scanning through /proc
>    files for kill candidates, and can easily get stuck.
>    Using bounded over indefinite waits is simply acknowledging that
>    the deadlock potential when connecting arbitrary task stacks in the
>    system through free->alloc ordering is equally difficult to plan
>    out as alloc->free ordering.
>    The non-cgroup OOM killer actually has the same deadlock potential,
>    where the allocating/killing task can hold resources that the OOM
>    victim requires to exit. The OOM reaper hides it, the static
>    emergency reserves hide it - but to truly solve this problem, you
>    would have to have full knowledge of memory & lock ordering
>    dependencies of those tasks. And then can still end up with
>    scenarios where the only answer is panic().

Yes. Even killing all eligible tasks is not guaranteed to help the
situation because a) resources might be not bound to a process life time
(e.g. tmpfs) or ineligible task might be holding resources that block
others to do the proper cleanup. OOM reaper is here to make sure we
reclaim some of the address space of the victim and we go over all
eligible tasks rather than getting stuck at the first victim for ever.
> 2. I don't recall ever seeing situations in cgroup1 where the precise
>    matching of allocation rate to freeing rate has allowed cgroups to
>    run sustainably after reclaim has failed. The practical benefit of
>    a complicated feedback loop over something crude & robust once
>    we're in an OOM situation is not apparent to me.

Yes, this is usually go OOM and kill something. Running on a very edge
of the (memcg) oom doesn't tend to be sustainable and I am not sure it
makes sense to optimize for.

>    [ That's different from the IO-throttling *while still doing
>      reclaim* that Dave brought up. *That* justifies the same effort
>      we put into dirty throttling. I'm only talking about the
>      situation where reclaim has already failed and we need to
>      facilitate userspace OOM handling. ]
> So that was the motivation for the bounded sleeps. They do not
> guarantee containment, but they provide a reasonable amount of time
> for the userspace OOM handler to intervene, without deadlocking.

Yes, memory.high is mostly a best effort containment. We do have the
hard limit to put a stop on runaways or if you are watching for PSI then
the high limit throttling would give you enough idea to take an action
from the userspace.

> That all being said, the semantics of the new 'high' limit in cgroup2
> have allowed us to move reclaim/limit enforcement out of the
> allocation context and into the userspace return path.
> See the call to mem_cgroup_handle_over_high() from
> tracehook_notify_resume(), and the comments in try_charge() around
> set_notify_resume().
> This already solves the free->alloc ordering problem by allowing the
> allocation to exceed the limit temporarily until at least all locks
> are dropped, we know we can sleep etc., before performing enforcement.
> That means we may not need the timed sleeps anymore for that purpose,
> and could bring back directed waits for freeing-events again.
> What do you think? Any hazards around indefinite sleeps in that resume
> path? It's called before __rseq_handle_notify_resume and the
> arch-specific resume callback (which appears to be a no-op currently).
> Chris, Michal, what are your thoughts? It would certainly be simpler
> conceptually on the memcg side.

I would need a more specific description. But as I've already said. It
doesn't seem that we are in a need to fix any practical problem here.
High limit implementation has changed quite a lot recently. I would
rather see it settled for a while and see how it behaves in wider
variety of workloads before changing the implementation again.

Michal Hocko

  parent reply	other threads:[~2020-09-09 15:14 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-17 14:08 Waiman Long
2020-08-17 14:08 ` [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Waiman Long
2020-08-17 14:30   ` Chris Down
2020-08-17 15:38     ` Waiman Long
2020-08-17 16:11       ` Chris Down
2020-08-17 16:44   ` Shakeel Butt
2020-08-17 16:56     ` Chris Down
2020-08-18 19:12       ` Waiman Long
2020-08-18 19:14     ` Waiman Long
2020-08-17 14:08 ` [RFC PATCH 2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit Waiman Long
2020-08-17 14:08 ` [RFC PATCH 3/8] memcg: Allow the use of task RSS memory as over-high action trigger Waiman Long
2020-08-17 14:08 ` [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Waiman Long
2020-08-17 14:08 ` [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking Waiman Long
2020-08-17 14:08 ` [RFC PATCH 6/8] memcg: Introduce additional memory control slowdown if needed Waiman Long
2020-08-17 14:08 ` [RFC PATCH 7/8] memcg: Enable logging of memory control mitigation action Waiman Long
2020-08-17 14:08 ` [RFC PATCH 8/8] memcg: Add over-high action prctl() documentation Waiman Long
2020-08-17 15:26 ` [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Michal Hocko
2020-08-17 15:55   ` Waiman Long
2020-08-17 19:26     ` Michal Hocko
2020-08-18 19:20       ` Waiman Long
2020-08-18  9:14 ` peterz
2020-08-18  9:26   ` Michal Hocko
2020-08-18  9:59     ` peterz
2020-08-18 10:05       ` Michal Hocko
2020-08-18 10:18         ` peterz
2020-08-18 10:30           ` Michal Hocko
2020-08-18 10:36             ` peterz
2020-08-18 13:49           ` Johannes Weiner
2020-08-21 19:37             ` Peter Zijlstra
2020-08-24 16:58               ` Johannes Weiner
2020-09-07 11:47                 ` Chris Down
2020-09-09 11:53                 ` Michal Hocko [this message]
2020-08-18 10:17       ` Chris Down
2020-08-18 10:26         ` peterz
2020-08-18 10:35           ` Chris Down
2020-08-23  2:49         ` Waiman Long
2020-08-18  9:27   ` Chris Down
2020-08-18 10:04     ` peterz
2020-08-18 12:55       ` Matthew Wilcox
2020-08-20  6:11         ` Dave Chinner
2020-08-18 19:30     ` Waiman Long
2020-08-18 19:27   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).