LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 RESEND 1/2] mm: introduce memory.min
@ 2018-05-02 15:47 Roman Gushchin
2018-05-02 15:47 ` [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups Roman Gushchin
0 siblings, 1 reply; 6+ messages in thread
From: Roman Gushchin @ 2018-05-02 15:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, kernel-team, Roman Gushchin, Johannes Weiner,
Michal Hocko, Vladimir Davydov, Tejun Heo
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and
allows protecting working sets of important workloads from
sudden reclaim.
But its semantics has a significant limitation: it works
only as long as there is a supply of reclaimable memory.
This makes it pretty useless against any sort of slow memory
leaks or memory usage increases. This is especially true
for swapless systems. If swap is enabled, memory soft protection
effectively postpones problems, allowing a leaking application
to fill all swap area, which makes no sense.
The only effective way to guarantee the memory protection
in this case is to invoke the OOM killer.
It's possible to handle this case in userspace by reacting
on MEMCG_LOW events; but there is still a place for a fail-safe
in-kernel mechanism to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2
memory controller. It works very similarly to memory.low
(sharing the same hierarchical behavior), except that it's
not disabled if there is no more reclaimable memory in the system.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/cgroup-v2.txt | 24 ++++++++-
include/linux/memcontrol.h | 15 ++++--
include/linux/page_counter.h | 11 +++-
mm/memcontrol.c | 118 ++++++++++++++++++++++++++++++++++---------
mm/page_counter.c | 63 ++++++++++++++++-------
mm/vmscan.c | 18 ++++++-
6 files changed, 199 insertions(+), 50 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 657fe1769c75..a413118b9c29 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1002,6 +1002,26 @@ PAGE_SIZE multiple when read back.
The total amount of memory currently being used by the cgroup
and its descendants.
+ memory.min
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ Hard memory protection. If the memory usage of a cgroup
+ is within its effective min boundary, the cgroup's memory
+ won't be reclaimed under any conditions. If there is no
+ unprotected reclaimable memory available, OOM killer
+ is invoked.
+
+ Effective low boundary is limited by memory.min values of
+ all ancestor cgroups. If there is memory.min overcommitment
+ (child cgroup or cgroups are requiring more protected memory
+ than parent will allow), then each child cgroup will get
+ the part of parent's protection proportional to its
+ actual memory usage below memory.min.
+
+ Putting more memory than generally available under this
+ protection is discouraged and may lead to constant OOMs.
+
memory.low
A read-write single value file which exists on non-root
cgroups. The default is "0".
@@ -1013,9 +1033,9 @@ PAGE_SIZE multiple when read back.
Effective low boundary is limited by memory.low values of
all ancestor cgroups. If there is memory.low overcommitment
- (child cgroup or cgroups are requiring more protected memory,
+ (child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
- the part of parent's protection proportional to the its
+ the part of parent's protection proportional to its
actual memory usage below memory.low.
Putting more memory than generally available under this
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a2dfb1872dca..3b65d092614f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -59,6 +59,12 @@ enum memcg_memory_event {
MEMCG_NR_MEMORY_EVENTS,
};
+enum mem_cgroup_protection {
+ MEMCG_PROT_NONE,
+ MEMCG_PROT_LOW,
+ MEMCG_PROT_MIN,
+};
+
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
int priority;
@@ -297,7 +303,8 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg);
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -756,10 +763,10 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg,
{
}
-static inline bool mem_cgroup_low(struct mem_cgroup *root,
- struct mem_cgroup *memcg)
+static inline enum mem_cgroup_protection mem_cgroup_protected(
+ struct mem_cgroup *root, struct mem_cgroup *memcg)
{
- return false;
+ return MEMCG_PROT_NONE;
}
static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 7902a727d3b6..bab7e57f659b 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -8,10 +8,16 @@
struct page_counter {
atomic_long_t usage;
- unsigned long max;
+ unsigned long min;
unsigned long low;
+ unsigned long max;
struct page_counter *parent;
+ /* effective memory.min and memory.min usage tracking */
+ unsigned long emin;
+ atomic_long_t min_usage;
+ atomic_long_t children_min_usage;
+
/* effective memory.low and memory.low usage tracking */
unsigned long elow;
atomic_long_t low_usage;
@@ -47,8 +53,9 @@ bool page_counter_try_charge(struct page_counter *counter,
unsigned long nr_pages,
struct page_counter **fail);
void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
-int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages);
int page_counter_memparse(const char *buf, const char *max,
unsigned long *nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index db89468c231c..d298b06a7fad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4503,6 +4503,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
}
spin_unlock(&memcg->event_list_lock);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg_offline_kmem(memcg);
@@ -4557,6 +4558,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_max(&memcg->memsw, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
+ page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
@@ -5294,6 +5296,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
}
+static int memory_min_show(struct seq_file *m, void *v)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+ unsigned long min = READ_ONCE(memcg->memory.min);
+
+ if (min == PAGE_COUNTER_MAX)
+ seq_puts(m, "max\n");
+ else
+ seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+ return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long min;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &min);
+ if (err)
+ return err;
+
+ page_counter_set_min(&memcg->memory, min);
+
+ return nbytes;
+}
+
static int memory_low_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5561,6 +5593,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+ {
+ .name = "min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_min_show,
+ .write = memory_min_write,
+ },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5616,19 +5654,24 @@ struct cgroup_subsys memory_cgrp_subsys = {
};
/**
- * mem_cgroup_low - check if memory consumption is in the normal range
+ * mem_cgroup_protected - check if memory consumption is in the normal range
* @root: the top ancestor of the sub-tree being checked
* @memcg: the memory cgroup to check
*
* WARNING: This function is not stateless! It can only be used as part
* of a top-down tree iteration, not for isolated queries.
*
- * Returns %true if memory consumption of @memcg is in the normal range.
+ * Returns one of the following:
+ * MEMCG_PROT_NONE: cgroup memory is not protected
+ * MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ * an unprotected supply of reclaimable memory from other cgroups.
+ * MEMCG_PROT_MIN: cgroup memory is protected
*
- * @root is exclusive; it is never low when looked at directly
+ * @root is exclusive; it is never protected when looked at directly
*
- * To provide a proper hierarchical behavior, effective memory.low value
- * is used.
+ * To provide a proper hierarchical behavior, effective memory.min/low values
+ * are used. Below is the description of how effective memory.low is calculated.
+ * Effective memory.min values is calculated in the same way.
*
* Effective memory.low is always equal or less than the original memory.low.
* If there is no memory.low overcommittment (which is always true for
@@ -5673,51 +5716,78 @@ struct cgroup_subsys memory_cgrp_subsys = {
* E/memory.current = 0
*
* These calculations require constant tracking of the actual low usages
- * (see propagate_low_usage()), as well as recursive calculation of
- * effective memory.low values. But as we do call mem_cgroup_low()
+ * (see propagate_protected_usage()), as well as recursive calculation of
+ * effective memory.low values. But as we do call mem_cgroup_protected()
* path for each memory cgroup top-down from the reclaim,
* it's possible to optimize this part, and save calculated elow
* for next usage. This part is intentionally racy, but it's ok,
* as memory.low is a best-effort mechanism.
*/
-bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
+enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
{
- unsigned long usage, low_usage, siblings_low_usage;
- unsigned long elow, parent_elow;
struct mem_cgroup *parent;
+ unsigned long emin, parent_emin;
+ unsigned long elow, parent_elow;
+ unsigned long usage;
if (mem_cgroup_disabled())
- return false;
+ return MEMCG_PROT_NONE;
if (!root)
root = root_mem_cgroup;
if (memcg == root)
- return false;
+ return MEMCG_PROT_NONE;
- elow = memcg->memory.low;
usage = page_counter_read(&memcg->memory);
- parent = parent_mem_cgroup(memcg);
+ if (!usage)
+ return MEMCG_PROT_NONE;
+
+ emin = memcg->memory.min;
+ elow = memcg->memory.low;
+ parent = parent_mem_cgroup(memcg);
if (parent == root)
goto exit;
+ parent_emin = READ_ONCE(parent->memory.emin);
+ emin = min(emin, parent_emin);
+ if (emin && parent_emin) {
+ unsigned long min_usage, siblings_min_usage;
+
+ min_usage = min(usage, memcg->memory.min);
+ siblings_min_usage = atomic_long_read(
+ &parent->memory.children_min_usage);
+
+ if (min_usage && siblings_min_usage)
+ emin = min(emin, parent_emin * min_usage /
+ siblings_min_usage);
+ }
+
parent_elow = READ_ONCE(parent->memory.elow);
elow = min(elow, parent_elow);
+ if (elow && parent_elow) {
+ unsigned long low_usage, siblings_low_usage;
- if (!elow || !parent_elow)
- goto exit;
+ low_usage = min(usage, memcg->memory.low);
+ siblings_low_usage = atomic_long_read(
+ &parent->memory.children_low_usage);
- low_usage = min(usage, memcg->memory.low);
- siblings_low_usage = atomic_long_read(
- &parent->memory.children_low_usage);
-
- if (!low_usage || !siblings_low_usage)
- goto exit;
+ if (low_usage && siblings_low_usage)
+ elow = min(elow, parent_elow * low_usage /
+ siblings_low_usage);
+ }
- elow = min(elow, parent_elow * low_usage / siblings_low_usage);
exit:
+ memcg->memory.emin = emin;
memcg->memory.elow = elow;
- return usage && usage <= elow;
+
+ if (usage <= emin)
+ return MEMCG_PROT_MIN;
+ else if (usage <= elow)
+ return MEMCG_PROT_LOW;
+ else
+ return MEMCG_PROT_NONE;
}
/**
diff --git a/mm/page_counter.c b/mm/page_counter.c
index a5ff4cbc355a..de31470655f6 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,26 +13,38 @@
#include <linux/bug.h>
#include <asm/page.h>
-static void propagate_low_usage(struct page_counter *c, unsigned long usage)
+static void propagate_protected_usage(struct page_counter *c,
+ unsigned long usage)
{
- unsigned long low_usage, old;
+ unsigned long protected, old_protected;
long delta;
if (!c->parent)
return;
- if (!c->low && !atomic_long_read(&c->low_usage))
- return;
+ if (c->min || atomic_long_read(&c->min_usage)) {
+ if (usage <= c->min)
+ protected = usage;
+ else
+ protected = 0;
+
+ old_protected = atomic_long_xchg(&c->min_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_min_usage);
+ }
- if (usage <= c->low)
- low_usage = usage;
- else
- low_usage = 0;
+ if (c->low || atomic_long_read(&c->low_usage)) {
+ if (usage <= c->low)
+ protected = usage;
+ else
+ protected = 0;
- old = atomic_long_xchg(&c->low_usage, low_usage);
- delta = low_usage - old;
- if (delta)
- atomic_long_add(delta, &c->parent->children_low_usage);
+ old_protected = atomic_long_xchg(&c->low_usage, protected);
+ delta = protected - old_protected;
+ if (delta)
+ atomic_long_add(delta, &c->parent->children_low_usage);
+ }
}
/**
@@ -45,7 +57,7 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_sub_return(nr_pages, &counter->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/* More uncharges than charges? */
WARN_ON_ONCE(new < 0);
}
@@ -65,7 +77,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
long new;
new = atomic_long_add_return(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
@@ -109,7 +121,7 @@ bool page_counter_try_charge(struct page_counter *counter,
new = atomic_long_add_return(nr_pages, &c->usage);
if (new > c->max) {
atomic_long_sub(nr_pages, &c->usage);
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* This is racy, but we can live with some
* inaccuracy in the failcnt.
@@ -118,7 +130,7 @@ bool page_counter_try_charge(struct page_counter *counter,
*fail = c;
goto failed;
}
- propagate_low_usage(counter, new);
+ propagate_protected_usage(counter, new);
/*
* Just like with failcnt, we can live with some
* inaccuracy in the watermark.
@@ -190,6 +202,23 @@ int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages)
}
}
+/**
+ * page_counter_set_min - set the amount of protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages)
+{
+ struct page_counter *c;
+
+ counter->min = nr_pages;
+
+ for (c = counter; c; c = c->parent)
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
/**
* page_counter_set_low - set the amount of protected memory
* @counter: counter
@@ -204,7 +233,7 @@ void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
counter->low = nr_pages;
for (c = counter; c; c = c->parent)
- propagate_low_usage(c, atomic_long_read(&c->usage));
+ propagate_protected_usage(c, atomic_long_read(&c->usage));
}
/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10c8a38c5eef..50055d72f294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2544,12 +2544,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
- if (mem_cgroup_low(root, memcg)) {
+ switch (mem_cgroup_protected(root, memcg)) {
+ case MEMCG_PROT_MIN:
+ /*
+ * Hard protection.
+ * If there is no reclaimable memory, OOM.
+ */
+ continue;
+ case MEMCG_PROT_LOW:
+ /*
+ * Soft protection.
+ * Respect the protection only as long as
+ * there is an unprotected supply
+ * of reclaimable memory from other cgroups.
+ */
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
+ break;
+ case MEMCG_PROT_NONE:
+ break;
}
reclaimed = sc->nr_reclaimed;
--
2.14.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups
2018-05-02 15:47 [PATCH v2 RESEND 1/2] mm: introduce memory.min Roman Gushchin
@ 2018-05-02 15:47 ` Roman Gushchin
2018-05-02 23:45 ` kbuild test robot
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Roman Gushchin @ 2018-05-02 15:47 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, kernel-team, Roman Gushchin, Johannes Weiner,
Michal Hocko, Vladimir Davydov, Tejun Heo
If a cgroup has no associated tasks, invoking the OOM killer
won't help release any memory, so respecting the memory.min
can lead to an infinite OOM loop or system stall.
Let's ignore memory.min of unpopulated cgroups.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 50055d72f294..709237feddc1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2549,8 +2549,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
/*
* Hard protection.
* If there is no reclaimable memory, OOM.
+ * Abandoned cgroups are loosing protection,
+ * because OOM killer won't release any memory.
*/
- continue;
+ if (cgroup_is_populated(memcg->css.cgroup))
+ continue;
case MEMCG_PROT_LOW:
/*
* Soft protection.
--
2.14.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups
2018-05-02 15:47 ` [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups Roman Gushchin
@ 2018-05-02 23:45 ` kbuild test robot
2018-05-03 1:08 ` kbuild test robot
2018-05-03 2:31 ` Matthew Wilcox
2 siblings, 0 replies; 6+ messages in thread
From: kbuild test robot @ 2018-05-02 23:45 UTC (permalink / raw)
To: Roman Gushchin
Cc: kbuild-all, linux-mm, linux-kernel, kernel-team, Roman Gushchin,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Tejun Heo
[-- Attachment #1: Type: text/plain, Size: 7835 bytes --]
Hi Roman,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20180502]
[cannot apply to v4.17-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Roman-Gushchin/mm-introduce-memory-min/20180503-064145
base: git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x006-201817 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'shrink_node':
>> mm/vmscan.c:2555:34: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
if (cgroup_is_populated(memcg->css.cgroup))
^~
vim +2555 mm/vmscan.c
2520
2521 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
2522 {
2523 struct reclaim_state *reclaim_state = current->reclaim_state;
2524 unsigned long nr_reclaimed, nr_scanned;
2525 bool reclaimable = false;
2526
2527 do {
2528 struct mem_cgroup *root = sc->target_mem_cgroup;
2529 struct mem_cgroup_reclaim_cookie reclaim = {
2530 .pgdat = pgdat,
2531 .priority = sc->priority,
2532 };
2533 unsigned long node_lru_pages = 0;
2534 struct mem_cgroup *memcg;
2535
2536 memset(&sc->nr, 0, sizeof(sc->nr));
2537
2538 nr_reclaimed = sc->nr_reclaimed;
2539 nr_scanned = sc->nr_scanned;
2540
2541 memcg = mem_cgroup_iter(root, NULL, &reclaim);
2542 do {
2543 unsigned long lru_pages;
2544 unsigned long reclaimed;
2545 unsigned long scanned;
2546
2547 switch (mem_cgroup_protected(root, memcg)) {
2548 case MEMCG_PROT_MIN:
2549 /*
2550 * Hard protection.
2551 * If there is no reclaimable memory, OOM.
2552 * Abandoned cgroups are loosing protection,
2553 * because OOM killer won't release any memory.
2554 */
> 2555 if (cgroup_is_populated(memcg->css.cgroup))
2556 continue;
2557 case MEMCG_PROT_LOW:
2558 /*
2559 * Soft protection.
2560 * Respect the protection only as long as
2561 * there is an unprotected supply
2562 * of reclaimable memory from other cgroups.
2563 */
2564 if (!sc->memcg_low_reclaim) {
2565 sc->memcg_low_skipped = 1;
2566 continue;
2567 }
2568 memcg_memory_event(memcg, MEMCG_LOW);
2569 break;
2570 case MEMCG_PROT_NONE:
2571 break;
2572 }
2573
2574 reclaimed = sc->nr_reclaimed;
2575 scanned = sc->nr_scanned;
2576 shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
2577 node_lru_pages += lru_pages;
2578
2579 if (memcg)
2580 shrink_slab(sc->gfp_mask, pgdat->node_id,
2581 memcg, sc->priority);
2582
2583 /* Record the group's reclaim efficiency */
2584 vmpressure(sc->gfp_mask, memcg, false,
2585 sc->nr_scanned - scanned,
2586 sc->nr_reclaimed - reclaimed);
2587
2588 /*
2589 * Direct reclaim and kswapd have to scan all memory
2590 * cgroups to fulfill the overall scan target for the
2591 * node.
2592 *
2593 * Limit reclaim, on the other hand, only cares about
2594 * nr_to_reclaim pages to be reclaimed and it will
2595 * retry with decreasing priority if one round over the
2596 * whole hierarchy is not sufficient.
2597 */
2598 if (!global_reclaim(sc) &&
2599 sc->nr_reclaimed >= sc->nr_to_reclaim) {
2600 mem_cgroup_iter_break(root, memcg);
2601 break;
2602 }
2603 } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
2604
2605 if (global_reclaim(sc))
2606 shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
2607 sc->priority);
2608
2609 if (reclaim_state) {
2610 sc->nr_reclaimed += reclaim_state->reclaimed_slab;
2611 reclaim_state->reclaimed_slab = 0;
2612 }
2613
2614 /* Record the subtree's reclaim efficiency */
2615 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
2616 sc->nr_scanned - nr_scanned,
2617 sc->nr_reclaimed - nr_reclaimed);
2618
2619 if (sc->nr_reclaimed - nr_reclaimed)
2620 reclaimable = true;
2621
2622 if (current_is_kswapd()) {
2623 /*
2624 * If reclaim is isolating dirty pages under writeback,
2625 * it implies that the long-lived page allocation rate
2626 * is exceeding the page laundering rate. Either the
2627 * global limits are not being effective at throttling
2628 * processes due to the page distribution throughout
2629 * zones or there is heavy usage of a slow backing
2630 * device. The only option is to throttle from reclaim
2631 * context which is not ideal as there is no guarantee
2632 * the dirtying process is throttled in the same way
2633 * balance_dirty_pages() manages.
2634 *
2635 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
2636 * count the number of pages under pages flagged for
2637 * immediate reclaim and stall if any are encountered
2638 * in the nr_immediate check below.
2639 */
2640 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
2641 set_bit(PGDAT_WRITEBACK, &pgdat->flags);
2642
2643 /*
2644 * Tag a node as congested if all the dirty pages
2645 * scanned were backed by a congested BDI and
2646 * wait_iff_congested will stall.
2647 */
2648 if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
2649 set_bit(PGDAT_CONGESTED, &pgdat->flags);
2650
2651 /* Allow kswapd to start writing pages during reclaim.*/
2652 if (sc->nr.unqueued_dirty == sc->nr.file_taken)
2653 set_bit(PGDAT_DIRTY, &pgdat->flags);
2654
2655 /*
2656 * If kswapd scans pages marked marked for immediate
2657 * reclaim and under writeback (nr_immediate), it
2658 * implies that pages are cycling through the LRU
2659 * faster than they are written so also forcibly stall.
2660 */
2661 if (sc->nr.immediate)
2662 congestion_wait(BLK_RW_ASYNC, HZ/10);
2663 }
2664
2665 /*
2666 * Legacy memcg will stall in page writeback so avoid forcibly
2667 * stalling in wait_iff_congested().
2668 */
2669 if (!global_reclaim(sc) && sane_reclaim(sc) &&
2670 sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
2671 set_memcg_congestion(pgdat, root, true);
2672
2673 /*
2674 * Stall direct reclaim for IO completions if underlying BDIs
2675 * and node is congested. Allow kswapd to continue until it
2676 * starts encountering unqueued dirty pages or cycling through
2677 * the LRU too quickly.
2678 */
2679 if (!sc->hibernation_mode && !current_is_kswapd() &&
2680 current_may_throttle() && pgdat_memcg_congested(pgdat, root))
2681 wait_iff_congested(BLK_RW_ASYNC, HZ/10);
2682
2683 } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
2684 sc->nr_scanned - nr_scanned, sc));
2685
2686 /*
2687 * Kswapd gives up on balancing particular nodes after too
2688 * many failures to reclaim anything from them and goes to
2689 * sleep. On reclaim progress, reset the failure counter. A
2690 * successful direct reclaim run will revive a dormant kswapd.
2691 */
2692 if (reclaimable)
2693 pgdat->kswapd_failures = 0;
2694
2695 return reclaimable;
2696 }
2697
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29778 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups
2018-05-02 15:47 ` [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups Roman Gushchin
2018-05-02 23:45 ` kbuild test robot
@ 2018-05-03 1:08 ` kbuild test robot
2018-05-03 2:31 ` Matthew Wilcox
2 siblings, 0 replies; 6+ messages in thread
From: kbuild test robot @ 2018-05-03 1:08 UTC (permalink / raw)
To: Roman Gushchin
Cc: kbuild-all, linux-mm, linux-kernel, kernel-team, Roman Gushchin,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Tejun Heo
[-- Attachment #1: Type: text/plain, Size: 8133 bytes --]
Hi Roman,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20180502]
[cannot apply to v4.17-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Roman-Gushchin/mm-introduce-memory-min/20180503-064145
base: git://git.cmpxchg.org/linux-mmotm.git master
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'shrink_node':
>> mm/vmscan.c:2555:9: error: implicit declaration of function 'cgroup_is_populated'; did you mean 'cgroup_bpf_put'? [-Werror=implicit-function-declaration]
if (cgroup_is_populated(memcg->css.cgroup))
^~~~~~~~~~~~~~~~~~~
cgroup_bpf_put
mm/vmscan.c:2555:34: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
if (cgroup_is_populated(memcg->css.cgroup))
^~
cc1: some warnings being treated as errors
vim +2555 mm/vmscan.c
2520
2521 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
2522 {
2523 struct reclaim_state *reclaim_state = current->reclaim_state;
2524 unsigned long nr_reclaimed, nr_scanned;
2525 bool reclaimable = false;
2526
2527 do {
2528 struct mem_cgroup *root = sc->target_mem_cgroup;
2529 struct mem_cgroup_reclaim_cookie reclaim = {
2530 .pgdat = pgdat,
2531 .priority = sc->priority,
2532 };
2533 unsigned long node_lru_pages = 0;
2534 struct mem_cgroup *memcg;
2535
2536 memset(&sc->nr, 0, sizeof(sc->nr));
2537
2538 nr_reclaimed = sc->nr_reclaimed;
2539 nr_scanned = sc->nr_scanned;
2540
2541 memcg = mem_cgroup_iter(root, NULL, &reclaim);
2542 do {
2543 unsigned long lru_pages;
2544 unsigned long reclaimed;
2545 unsigned long scanned;
2546
2547 switch (mem_cgroup_protected(root, memcg)) {
2548 case MEMCG_PROT_MIN:
2549 /*
2550 * Hard protection.
2551 * If there is no reclaimable memory, OOM.
2552 * Abandoned cgroups are loosing protection,
2553 * because OOM killer won't release any memory.
2554 */
> 2555 if (cgroup_is_populated(memcg->css.cgroup))
2556 continue;
2557 case MEMCG_PROT_LOW:
2558 /*
2559 * Soft protection.
2560 * Respect the protection only as long as
2561 * there is an unprotected supply
2562 * of reclaimable memory from other cgroups.
2563 */
2564 if (!sc->memcg_low_reclaim) {
2565 sc->memcg_low_skipped = 1;
2566 continue;
2567 }
2568 memcg_memory_event(memcg, MEMCG_LOW);
2569 break;
2570 case MEMCG_PROT_NONE:
2571 break;
2572 }
2573
2574 reclaimed = sc->nr_reclaimed;
2575 scanned = sc->nr_scanned;
2576 shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
2577 node_lru_pages += lru_pages;
2578
2579 if (memcg)
2580 shrink_slab(sc->gfp_mask, pgdat->node_id,
2581 memcg, sc->priority);
2582
2583 /* Record the group's reclaim efficiency */
2584 vmpressure(sc->gfp_mask, memcg, false,
2585 sc->nr_scanned - scanned,
2586 sc->nr_reclaimed - reclaimed);
2587
2588 /*
2589 * Direct reclaim and kswapd have to scan all memory
2590 * cgroups to fulfill the overall scan target for the
2591 * node.
2592 *
2593 * Limit reclaim, on the other hand, only cares about
2594 * nr_to_reclaim pages to be reclaimed and it will
2595 * retry with decreasing priority if one round over the
2596 * whole hierarchy is not sufficient.
2597 */
2598 if (!global_reclaim(sc) &&
2599 sc->nr_reclaimed >= sc->nr_to_reclaim) {
2600 mem_cgroup_iter_break(root, memcg);
2601 break;
2602 }
2603 } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
2604
2605 if (global_reclaim(sc))
2606 shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
2607 sc->priority);
2608
2609 if (reclaim_state) {
2610 sc->nr_reclaimed += reclaim_state->reclaimed_slab;
2611 reclaim_state->reclaimed_slab = 0;
2612 }
2613
2614 /* Record the subtree's reclaim efficiency */
2615 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
2616 sc->nr_scanned - nr_scanned,
2617 sc->nr_reclaimed - nr_reclaimed);
2618
2619 if (sc->nr_reclaimed - nr_reclaimed)
2620 reclaimable = true;
2621
2622 if (current_is_kswapd()) {
2623 /*
2624 * If reclaim is isolating dirty pages under writeback,
2625 * it implies that the long-lived page allocation rate
2626 * is exceeding the page laundering rate. Either the
2627 * global limits are not being effective at throttling
2628 * processes due to the page distribution throughout
2629 * zones or there is heavy usage of a slow backing
2630 * device. The only option is to throttle from reclaim
2631 * context which is not ideal as there is no guarantee
2632 * the dirtying process is throttled in the same way
2633 * balance_dirty_pages() manages.
2634 *
2635 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
2636 * count the number of pages under pages flagged for
2637 * immediate reclaim and stall if any are encountered
2638 * in the nr_immediate check below.
2639 */
2640 if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
2641 set_bit(PGDAT_WRITEBACK, &pgdat->flags);
2642
2643 /*
2644 * Tag a node as congested if all the dirty pages
2645 * scanned were backed by a congested BDI and
2646 * wait_iff_congested will stall.
2647 */
2648 if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
2649 set_bit(PGDAT_CONGESTED, &pgdat->flags);
2650
2651 /* Allow kswapd to start writing pages during reclaim.*/
2652 if (sc->nr.unqueued_dirty == sc->nr.file_taken)
2653 set_bit(PGDAT_DIRTY, &pgdat->flags);
2654
2655 /*
2656 * If kswapd scans pages marked marked for immediate
2657 * reclaim and under writeback (nr_immediate), it
2658 * implies that pages are cycling through the LRU
2659 * faster than they are written so also forcibly stall.
2660 */
2661 if (sc->nr.immediate)
2662 congestion_wait(BLK_RW_ASYNC, HZ/10);
2663 }
2664
2665 /*
2666 * Legacy memcg will stall in page writeback so avoid forcibly
2667 * stalling in wait_iff_congested().
2668 */
2669 if (!global_reclaim(sc) && sane_reclaim(sc) &&
2670 sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
2671 set_memcg_congestion(pgdat, root, true);
2672
2673 /*
2674 * Stall direct reclaim for IO completions if underlying BDIs
2675 * and node is congested. Allow kswapd to continue until it
2676 * starts encountering unqueued dirty pages or cycling through
2677 * the LRU too quickly.
2678 */
2679 if (!sc->hibernation_mode && !current_is_kswapd() &&
2680 current_may_throttle() && pgdat_memcg_congested(pgdat, root))
2681 wait_iff_congested(BLK_RW_ASYNC, HZ/10);
2682
2683 } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
2684 sc->nr_scanned - nr_scanned, sc));
2685
2686 /*
2687 * Kswapd gives up on balancing particular nodes after too
2688 * many failures to reclaim anything from them and goes to
2689 * sleep. On reclaim progress, reset the failure counter. A
2690 * successful direct reclaim run will revive a dormant kswapd.
2691 */
2692 if (reclaimable)
2693 pgdat->kswapd_failures = 0;
2694
2695 return reclaimable;
2696 }
2697
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6317 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups
2018-05-02 15:47 ` [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups Roman Gushchin
2018-05-02 23:45 ` kbuild test robot
2018-05-03 1:08 ` kbuild test robot
@ 2018-05-03 2:31 ` Matthew Wilcox
2018-05-03 11:45 ` Roman Gushchin
2 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2018-05-03 2:31 UTC (permalink / raw)
To: Roman Gushchin
Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
Michal Hocko, Vladimir Davydov, Tejun Heo
On Wed, May 02, 2018 at 04:47:10PM +0100, Roman Gushchin wrote:
> + * Abandoned cgroups are loosing protection,
"losing".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups
2018-05-03 2:31 ` Matthew Wilcox
@ 2018-05-03 11:45 ` Roman Gushchin
0 siblings, 0 replies; 6+ messages in thread
From: Roman Gushchin @ 2018-05-03 11:45 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
Michal Hocko, Vladimir Davydov, Tejun Heo
On Wed, May 02, 2018 at 07:31:43PM -0700, Matthew Wilcox wrote:
> On Wed, May 02, 2018 at 04:47:10PM +0100, Roman Gushchin wrote:
> > + * Abandoned cgroups are loosing protection,
>
> "losing".
>
Fixed in v3.
Thanks!
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-05-03 11:47 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-02 15:47 [PATCH v2 RESEND 1/2] mm: introduce memory.min Roman Gushchin
2018-05-02 15:47 ` [PATCH v2 RESEND 2/2] mm: ignore memory.min of abandoned memory cgroups Roman Gushchin
2018-05-02 23:45 ` kbuild test robot
2018-05-03 1:08 ` kbuild test robot
2018-05-03 2:31 ` Matthew Wilcox
2018-05-03 11:45 ` Roman Gushchin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).