LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Shakeel Butt <firstname.lastname@example.org>
To: Roman Gushchin <email@example.com>
Cc: Andrew Morton <firstname.lastname@example.org>,
Linux MM <email@example.com>,
Kernel Team <firstname.lastname@example.org>,
Johannes Weiner <email@example.com>,
Michal Hocko <firstname.lastname@example.org>, Rik van Riel <email@example.com>,
Christoph Lameter <firstname.lastname@example.org>,
Vladimir Davydov <email@example.com>,
Subject: Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
Date: Fri, 10 May 2019 17:32:15 -0700 [thread overview]
Message-ID: <CALvZod4WGVVq+UY_TZdKP_PHdifDrkYqPGgKYTeUB6DsxGAdVw@mail.gmail.com> (raw)
From: Roman Gushchin <firstname.lastname@example.org>
Date: Wed, May 8, 2019 at 1:30 PM
To: Andrew Morton, Shakeel Butt
Cc: <email@example.com>, <firstname.lastname@example.org>,
<email@example.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <firstname.lastname@example.org>, Roman
> # Why do we need this?
> We've noticed that the number of dying cgroups is steadily growing on most
> of our hosts in production. The following investigation revealed an issue
> in userspace memory reclaim code , accounting of kernel stacks ,
> and also the mainreason: slab objects.
> The underlying problem is quite simple: any page charged
> to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> all charged pages are gone. If a slab object is actively used by other cgroups,
> it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> Slab objects, and first of all vfs cache, is shared between cgroups, which are
> using the same underlying fs, and what's even more important, it's shared
> between multiple generations of the same workload. So if something is running
> periodically every time in a new cgroup (like how systemd works), we do
> accumulate multiple dying cgroups.
> Strictly speaking pagecache isn't different here, but there is a key difference:
> we disable protection and apply some extra pressure on LRUs of dying cgroups,
How do you apply extra pressure on dying cgroups? cgroup-v2 does not
> and these LRUs contain all charged pages.
> My experiments show that with the disabled kernel memory accounting the number
> of dying cgroups stabilizes at a relatively small number (~100, depends on
> memory pressure and cgroup creation rate), and with kernel memory accounting
> it grows pretty steadily up to several thousands.
> Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> so it leads to noticeable memory losses. Memory occupied by dying cgroups
> is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
> of memory wasted for dying cgroups. It leads to a degradation of performance
> with the uptime, and generally limits the usage of cgroups.
> My previous attempt  to fix the problem by applying extra pressure on slab
> shrinker lists caused a regressions with xfs and ext4, and has been reverted .
> The following attempts to find the right balance [5, 6] were not successful.
> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.
> # Implementation approach
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
> instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> page->kmem_cache->memcg indirection instead. It's used only on
> slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
> be able to destroy kmem_caches when they become empty and release
> the associated memory cgroup.
> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.
> Some additional implementation details are provided in corresponding
> commit messages.
> # Results
> Below is the average number of dying cgroups on two groups of our production
> hosts. They do run some sort of web frontend workload, the memory pressure
> is moderate. As we can see, with the kernel memory reparenting the number
> stabilizes in 50s range; however with the original version it grows almost
> linearly and doesn't show any signs of plateauing. The difference in slab
> and percpu usage between patched and unpatched versions also grows linearly.
> In 6 days it reached 200Mb.
> day 0 1 2 3 4 5 6
> original 39 338 580 827 1098 1349 1574
> patched 23 44 45 47 50 46 55
> mem diff(Mb) 53 73 99 137 148 182 209
> # History
> 1) reworked memcg kmem_cache search on allocation path
> 2) fixed /proc/kpagecgroup interface
> 1) switched to percpu kmem_cache refcounter
> 2) a reference to kmem_cache is held during the allocation
> 3) slabs stats are fixed for !MEMCG case (and the refactoring
> is separated into a standalone patch)
> 4) kmem_cache reparenting is performed from deactivatation context
> # Links
> : commit 68600f623d69 ("mm: don't miss the last page because of
> round-off error")
> : commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
> : commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
> small number of objects")
> : commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
> with a relatively small number of objects")
> : https://lkml.org/lkml/2019/1/28/1865
> : https://marc.info/?l=linux-mm&m=155064763626437&w=2
> Roman Gushchin (7):
> mm: postpone kmem_cache memcg pointer initialization to
> mm: generalize postponed non-root kmem_cache deactivation
> mm: introduce __memcg_kmem_uncharge_memcg()
> mm: unify SLAB and SLUB page accounting
> mm: rework non-root kmem_cache lifecycle management
> mm: reparent slab memory on cgroup removal
> mm: fix /proc/kpagecgroup interface for slab pages
> include/linux/memcontrol.h | 10 +++
> include/linux/slab.h | 13 ++--
> mm/memcontrol.c | 97 ++++++++++++++++--------
> mm/slab.c | 25 ++----
> mm/slab.h | 120 +++++++++++++++++++++--------
> mm/slab_common.c | 151 ++++++++++++++++++++-----------------
> mm/slub.c | 36 ++-------
> 7 files changed, 267 insertions(+), 185 deletions(-)
next parent reply other threads:[~2019-05-11 0:32 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <email@example.com>
2019-05-11 0:32 ` Shakeel Butt [this message]
2019-05-13 20:21 ` Roman Gushchin
2019-05-14 19:22 ` Shakeel Butt
2019-05-14 20:04 ` Roman Gushchin
[not found] ` <firstname.lastname@example.org>
2019-05-11 0:32 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Shakeel Butt
[not found] ` <email@example.com>
2019-05-11 0:33 ` [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation Shakeel Butt
[not found] ` <firstname.lastname@example.org>
2019-05-11 0:33 ` [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Shakeel Butt
[not found] ` <email@example.com>
2019-05-11 0:33 ` [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management Shakeel Butt
[not found] ` <firstname.lastname@example.org>
2019-05-11 0:34 ` [PATCH v3 6/7] mm: reparent slab memory on cgroup removal Shakeel Butt
[not found] ` <email@example.com>
2019-05-11 0:34 ` [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages Shakeel Butt
[not found] ` <firstname.lastname@example.org>
2019-05-11 0:33 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Shakeel Butt
2019-05-13 18:01 ` Christopher Lameter
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--subject='Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal' \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).