LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
@ 2015-03-25  6:17 Johannes Weiner
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
                   ` (12 more replies)
  0 siblings, 13 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

Hi everybody,

in the recent past we've had several reports and discussions on how to
deal with allocations hanging in the allocator upon OOM.

The idea of this series is mainly to make the mechanism of detecting
OOM situations reliable enough that we can be confident about failing
allocations, and then leave the fallback strategy to the caller rather
than looping forever in the allocator.

The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
least for the short term while we don't have a reservation system yet.

Here is a breakdown of the proposed changes:

 mm: oom_kill: remove pointless locking in oom_enable()
 mm: oom_kill: clean up victim marking and exiting interfaces
 mm: oom_kill: remove misleading test-and-clear of known TIF_MEMDIE
 mm: oom_kill: remove pointless locking in exit_oom_victim()
 mm: oom_kill: generalize OOM progress waitqueue
 mm: oom_kill: simplify OOM killer locking
 mm: page_alloc: inline should_alloc_retry() contents

These are preparational patches to clean up parts in the OOM killer
and the page allocator.  Filesystem folks and others that only care
about allocation semantics may want to skip over these.

 mm: page_alloc: wait for OOM killer progress before retrying

One of the hangs we have seen reported is from lower order allocations
that loop infinitely in the allocator.  In an attempt to address that,
it has been proposed to limit the number of retry loops - possibly
even make that number configurable from userspace - and return NULL
once we are certain that the system is "truly OOM".  But it wasn't
clear how high that number needs to be to reliably determine a global
OOM situation from the perspective of an individual allocation.

An issue is that OOM killing is currently an asynchroneous operation
and the optimal retry number depends on how long it takes an OOM kill
victim to exit and release its memory - which of course varies with
system load and exiting task.

To address this, this patch makes OOM killing synchroneous and only
returns to the allocator once the victim has actually exited.  With
that, the allocator no longer requires retry loops just to poll for
the victim releasing memory.

 mm: page_alloc: private memory reserves for OOM-killing allocations

Once out_of_memory() is synchroneous, there are still two issues that
can make determining system-wide OOM from a single allocation context
unreliable.  For one, concurrent allocations can swoop in right after
a kill and steal the memory, causing spurious allocation failures for
contexts that actually freed memory.  But also, the OOM victim could
get blocked on some state that the allocation is holding, which would
delay the release of the memory (and refilling of the reserves) until
after the allocation has completed.

This patch creates private reserves for allocations that have issued
an OOM kill.  Once these reserves run dry, it seems reasonable to
assume that other allocations are not succeeding either anymore.

 mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations

An exacerbation of the victim-stuck-behind-allocation scenario are
__GFP_NOFAIL allocations, because they will actually deadlock.  To
avoid this, or try to, give __GFP_NOFAIL allocations access to not
just the OOM reserves but also the system's emergency reserves.

This is basically a poor man's reservation system, which could or
should be replaced later on with an explicit reservation system that
e.g. filesystems have control over for use by transactions.

It's obviously not bulletproof and might still lock up, but it should
greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
been using this successfully for some time.

 mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

Another hang that was reported was from NOFS allocations.  The trouble
with these is that they can't issue or wait for writeback during page
reclaim, and so we don't want to OOM kill on their behalf.  However,
with such restrictions on making progress, they are prone to hangs.

This patch makes NOFS allocations fail if reclaim can't free anything.

It would be good if the filesystem people could weigh in on whether
they can deal with failing GFP_NOFS allocations, or annotate the
exceptions with __GFP_NOFAIL etc.  It could well be that a middle
ground is required that allows using the OOM killer before giving up.

 mm: page_alloc: do not lock up low-order allocations upon OOM

With both OOM killing and "true OOM situation" detection more
reliable, this patch finally allows allocations up to order 3 to
actually fail on OOM and leave the fallback strategy to the caller -
as opposed to the current policy of hanging in the allocator.

Comments?

 drivers/staging/android/lowmemorykiller.c |   2 +-
 include/linux/mmzone.h                    |   2 +
 include/linux/oom.h                       |  12 +-
 kernel/exit.c                             |   2 +-
 mm/internal.h                             |   3 +-
 mm/memcontrol.c                           |  20 +--
 mm/oom_kill.c                             | 167 +++++++-----------------
 mm/page_alloc.c                           | 189 +++++++++++++---------------
 mm/vmstat.c                               |   2 +
 9 files changed, 154 insertions(+), 245 deletions(-)


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26  0:51   ` David Rientjes
                     ` (2 more replies)
  2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

Setting oom_killer_disabled to false is atomic, there is no need for
further synchronization with ongoing allocations trying to OOM-kill.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2b665da1b3c9..73763e489e86 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,9 +488,7 @@ bool oom_killer_disable(void)
  */
 void oom_killer_enable(void)
 {
-	down_write(&oom_sem);
 	oom_killer_disabled = false;
-	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26  3:34   ` David Rientjes
  2015-03-26 11:54   ` Michal Hocko
  2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

Rename unmark_oom_victim() to exit_oom_victim().  Marking and
unmarking are related in functionality, but the interface is not
symmetrical at all: one is an internal OOM killer function used during
the killing, the other is for an OOM victim to signal its own death on
exit later on.  This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/staging/android/lowmemorykiller.c |  2 +-
 include/linux/oom.h                       |  7 ++++---
 kernel/exit.c                             |  2 +-
 mm/memcontrol.c                           |  2 +-
 mm/oom_kill.c                             | 16 +++++++---------
 5 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index feafa172b155..2345ee7342d9 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -165,7 +165,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 		 * infrastructure. There is no real reason why the selected
 		 * task should have access to the memory reserves.
 		 */
-		mark_tsk_oom_victim(selected);
+		mark_oom_victim(selected);
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 44b2f6f7bbd8..a8e6a498cbcb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,9 +47,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
 	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
 }
 
-extern void mark_tsk_oom_victim(struct task_struct *tsk);
-
-extern void unmark_oom_victim(void);
+extern void mark_oom_victim(struct task_struct *tsk);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -75,6 +73,9 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 
 extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
+
+extern void exit_oom_victim(void);
+
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index feff10bbb307..4089c2fd373e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		unmark_oom_victim();
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a9641d8f9f..aab5604e0ac4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1536,7 +1536,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
-		mark_tsk_oom_victim(current);
+		mark_oom_victim(current);
 		return;
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 73763e489e86..b2f081fe4b1a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -408,13 +408,13 @@ bool oom_killer_disabled __read_mostly;
 static DECLARE_RWSEM(oom_sem);
 
 /**
- * mark_tsk_oom_victim - marks the given task as OOM victim.
+ * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
  *
  * Has to be called with oom_sem taken for read and never after
  * oom has been disabled already.
  */
-void mark_tsk_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk)
 {
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
@@ -431,11 +431,9 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
 }
 
 /**
- * unmark_oom_victim - unmarks the current task as OOM victim.
- *
- * Wakes up all waiters in oom_killer_disable()
+ * exit_oom_victim - note the exit of an OOM victim
  */
-void unmark_oom_victim(void)
+void exit_oom_victim(void)
 {
 	if (!test_and_clear_thread_flag(TIF_MEMDIE))
 		return;
@@ -515,7 +513,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 */
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
-		mark_tsk_oom_victim(p);
+		mark_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -570,7 +568,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
-	mark_tsk_oom_victim(victim);
+	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -728,7 +726,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_tsk_oom_victim(current);
+		mark_oom_victim(current);
 		return;
 	}
 
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
  2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26  3:31   ` David Rientjes
  2015-03-26 11:57   ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Michal Hocko
  2015-03-25  6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
else can clear it concurrently.  Use clear_thread_flag() directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b2f081fe4b1a..4b9547be9170 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -435,8 +435,7 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(void)
 {
-	if (!test_and_clear_thread_flag(TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	down_read(&oom_sem);
 	/*
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim()
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (2 preceding siblings ...)
  2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 12:53   ` Michal Hocko
  2015-03-25  6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

Disabling the OOM killer needs to exclude allocators from entering,
not existing victims from exiting.

Right now the only waiter is suspend code, which achieves quiescence
by disabling the OOM killer.  But later on we want to add waits that
hold the lock instead to stop new victims from showing up.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b9547be9170..88aa9ba40fa5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -437,14 +437,12 @@ void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
 
-	down_read(&oom_sem);
 	/*
 	 * There is no need to signal the lasst oom_victim if there
 	 * is nobody who cares.
 	 */
 	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
 		wake_up_all(&oom_victims_wait);
-	up_read(&oom_sem);
 }
 
 /**
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (3 preceding siblings ...)
  2015-03-25  6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 13:03   ` Michal Hocko
  2015-03-25  6:17 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Johannes Weiner
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

It turns out that the mechanism to wait for exiting OOM victims is
less generic than it looks: it won't issue wakeups unless the OOM
killer is disabled.

The reason this check was added was the thought that, since only the
OOM disabling code would wait on this queue, wakeup operations could
be saved when that specific consumer is known to be absent.

However, this is quite the handgrenade.  Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs
and the failure mode will appear seemingly illogical.  Generally,
providers shouldn't make unnecessary assumptions about consumers.

This could have been replaced with waitqueue_active(), but it only
saves a few instructions in one of the coldest paths in the kernel.
Simply remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 88aa9ba40fa5..d3490b019d46 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -437,11 +437,7 @@ void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
 
-	/*
-	 * There is no need to signal the lasst oom_victim if there
-	 * is nobody who cares.
-	 */
-	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
 
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 06/12] mm: oom_kill: simplify OOM killer locking
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (4 preceding siblings ...)
  2015-03-25  6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 13:31   ` Michal Hocko
  2015-03-25  6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

The zonelist locking and the oom_sem are two overlapping locks that
are used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations
with overlapping zonelists against each other to prevent killing more
tasks than necessary in the same memory domain.  Only when neither
tasklists nor zonelists from two concurrent OOM kills overlap (tasks
in separate memcgs bound to separate nodes) are OOM kills allowed to
execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing
against the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really
no reason to optimize for highly performant and concurrent OOM kills.
And the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing
OOM kills regardless of context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/oom.h |   5 +--
 mm/memcontrol.c     |  18 +++++---
 mm/oom_kill.c       | 127 +++++++++++-----------------------------------------
 mm/page_alloc.c     |   8 ++--
 4 files changed, 44 insertions(+), 114 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index a8e6a498cbcb..7deecb7bca5e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -32,6 +32,8 @@ enum oom_scan_t {
 /* Thread is the potential origin of an oom condition; kill first on oom */
 #define OOM_FLAG_ORIGIN		((__force oom_flags_t)0x1)
 
+extern struct mutex oom_lock;
+
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flags |= OOM_FLAG_ORIGIN;
@@ -60,9 +62,6 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			     struct mem_cgroup *memcg, nodemask_t *nodemask,
 			     const char *message);
 
-extern bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_flags);
-extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags);
-
 extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 			       int order, const nodemask_t *nodemask,
 			       struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aab5604e0ac4..9f280b9df848 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1530,6 +1530,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int points = 0;
 	struct task_struct *chosen = NULL;
 
+	mutex_lock(&oom_lock);
+
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
@@ -1537,7 +1539,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		return;
+		goto unlock;
 	}
 
 	check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
@@ -1564,7 +1566,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				mem_cgroup_iter_break(memcg, iter);
 				if (chosen)
 					put_task_struct(chosen);
-				return;
+				goto unlock;
 			case OOM_SCAN_OK:
 				break;
 			};
@@ -1585,11 +1587,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		css_task_iter_end(&it);
 	}
 
-	if (!chosen)
-		return;
-	points = chosen_points * 1000 / totalpages;
-	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
-			 NULL, "Memory cgroup out of memory");
+	if (chosen) {
+		points = chosen_points * 1000 / totalpages;
+		oom_kill_process(chosen, gfp_mask, order, points, totalpages,
+				 memcg, NULL, "Memory cgroup out of memory");
+	}
+unlock:
+	mutex_unlock(&oom_lock);
 }
 
 #if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3490b019d46..5cfda39b3268 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -42,7 +42,8 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
-static DEFINE_SPINLOCK(zone_scan_lock);
+
+DEFINE_MUTEX(oom_lock);
 
 #ifdef CONFIG_NUMA
 /**
@@ -405,13 +406,12 @@ static atomic_t oom_victims = ATOMIC_INIT(0);
 static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
-static DECLARE_RWSEM(oom_sem);
 
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
  *
- * Has to be called with oom_sem taken for read and never after
+ * Has to be called with oom_lock held and never after
  * oom has been disabled already.
  */
 void mark_oom_victim(struct task_struct *tsk)
@@ -460,14 +460,14 @@ bool oom_killer_disable(void)
 	 * Make sure to not race with an ongoing OOM killer
 	 * and that the current is not the victim.
 	 */
-	down_write(&oom_sem);
+	mutex_lock(&oom_lock);
 	if (test_thread_flag(TIF_MEMDIE)) {
-		up_write(&oom_sem);
+		mutex_unlock(&oom_lock);
 		return false;
 	}
 
 	oom_killer_disabled = true;
-	up_write(&oom_sem);
+	mutex_unlock(&oom_lock);
 
 	wait_event(oom_victims_wait, !atomic_read(&oom_victims));
 
@@ -634,52 +634,6 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
-/*
- * Try to acquire the OOM killer lock for the zones in zonelist.  Returns zero
- * if a parallel OOM killing is already taking place that includes a zone in
- * the zonelist.  Otherwise, locks all zones in the zonelist and returns 1.
- */
-bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
-	struct zoneref *z;
-	struct zone *zone;
-	bool ret = true;
-
-	spin_lock(&zone_scan_lock);
-	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
-		if (test_bit(ZONE_OOM_LOCKED, &zone->flags)) {
-			ret = false;
-			goto out;
-		}
-
-	/*
-	 * Lock each zone in the zonelist under zone_scan_lock so a parallel
-	 * call to oom_zonelist_trylock() doesn't succeed when it shouldn't.
-	 */
-	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
-		set_bit(ZONE_OOM_LOCKED, &zone->flags);
-
-out:
-	spin_unlock(&zone_scan_lock);
-	return ret;
-}
-
-/*
- * Clears the ZONE_OOM_LOCKED flag for all zones in the zonelist so that failed
- * allocation attempts with zonelists containing them may now recall the OOM
- * killer, if necessary.
- */
-void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	spin_lock(&zone_scan_lock);
-	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
-		clear_bit(ZONE_OOM_LOCKED, &zone->flags);
-	spin_unlock(&zone_scan_lock);
-}
-
 /**
  * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
@@ -693,8 +647,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
-		int order, nodemask_t *nodemask, bool force_kill)
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		   int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
 	struct task_struct *p;
@@ -704,10 +658,13 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 	int killed = 0;
 
+	if (oom_killer_disabled)
+		return false;
+
 	blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 	if (freed > 0)
 		/* Got some memory back in the last second. */
-		return;
+		goto out;
 
 	/*
 	 * If current has a pending SIGKILL or is exiting, then automatically
@@ -720,7 +677,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
-		return;
+		goto out;
 	}
 
 	/*
@@ -760,32 +717,8 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 */
 	if (killed)
 		schedule_timeout_killable(1);
-}
-
-/**
- * out_of_memory -  tries to invoke OOM killer.
- * @zonelist: zonelist pointer
- * @gfp_mask: memory allocation flags
- * @order: amount of memory being requested as a power of 2
- * @nodemask: nodemask passed to page allocator
- * @force_kill: true if a task must be killed, even if others are exiting
- *
- * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
- * when it returns false. Otherwise returns true.
- */
-bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
-		int order, nodemask_t *nodemask, bool force_kill)
-{
-	bool ret = false;
-
-	down_read(&oom_sem);
-	if (!oom_killer_disabled) {
-		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
-		ret = true;
-	}
-	up_read(&oom_sem);
 
-	return ret;
+	return true;
 }
 
 /*
@@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
  */
 void pagefault_out_of_memory(void)
 {
-	struct zonelist *zonelist;
-
-	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		goto unlock;
+		return;
 
-	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
-	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		if (!oom_killer_disabled)
-			__out_of_memory(NULL, 0, 0, NULL, false);
-		else
-			/*
-			 * There shouldn't be any user tasks runable while the
-			 * OOM killer is disabled so the current task has to
-			 * be a racing OOM victim for which oom_killer_disable()
-			 * is waiting for.
-			 */
-			WARN_ON(test_thread_flag(TIF_MEMDIE));
+	if (!mutex_trylock(&oom_lock))
+		return;
 
-		oom_zonelist_unlock(zonelist, GFP_KERNEL);
+	if (!out_of_memory(NULL, 0, 0, NULL, false)) {
+		/*
+		 * There shouldn't be any user tasks runnable while the
+		 * OOM killer is disabled, so the current task has to
+		 * be a racing OOM victim for which oom_killer_disable()
+		 * is waiting for.
+		 */
+		WARN_ON(test_thread_flag(TIF_MEMDIE));
 	}
-unlock:
-	up_read(&oom_sem);
+
+	mutex_unlock(&oom_lock);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 656379820190..9ebc760187ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2380,10 +2380,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the per-zone oom lock for each zone.  If that
-	 * fails, somebody else is making progress for us.
+	 * Acquire the oom lock.  If that fails, somebody else is
+	 * making progress for us.
 	 */
-	if (!oom_zonelist_trylock(ac->zonelist, gfp_mask)) {
+	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
 		schedule_timeout_uninterruptible(1);
 		return NULL;
@@ -2428,7 +2428,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 		*did_some_progress = 1;
 out:
-	oom_zonelist_unlock(ac->zonelist, gfp_mask);
+	mutex_unlock(&oom_lock);
 	return page;
 }
 
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 07/12] mm: page_alloc: inline should_alloc_retry()
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (5 preceding siblings ...)
  2015-03-25  6:17 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 14:11   ` Michal Hocko
  2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress.  The physical
separation of those conditions make the code hard to follow.

Inline the should_alloc_retry() checks.  Notes:

- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
  replace it with looping on OOM killer progress

- The pm_suspended_storage() check is meant to skip the OOM killer
  when reclaim has no IO available, move to __alloc_pages_may_oom()

- The order < PAGE_ALLOC_COSTLY order is re-united with its original
  counterpart of checking whether reclaim actually made any progress

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 104 +++++++++++++++++---------------------------------------
 1 file changed, 32 insertions(+), 72 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ebc760187ac..c1224ba45548 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2329,48 +2329,6 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		show_mem(filter);
 }
 
-static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
-				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
-{
-	/* Do not loop if specifically requested */
-	if (gfp_mask & __GFP_NORETRY)
-		return 0;
-
-	/* Always retry if specifically requested */
-	if (gfp_mask & __GFP_NOFAIL)
-		return 1;
-
-	/*
-	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
-	 * making forward progress without invoking OOM. Suspend also disables
-	 * storage devices so kswapd will not help. Bail if we are suspending.
-	 */
-	if (!did_some_progress && pm_suspended_storage())
-		return 0;
-
-	/*
-	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
-	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
-
-	/*
-	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
-	 * specified, then we retry until we no longer reclaim any pages
-	 * (above), or we've reclaimed an order of pages at least as
-	 * large as the allocation's order. In both cases, if the
-	 * allocation still fails, we stop retrying.
-	 */
-	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
-		return 1;
-
-	return 0;
-}
-
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -2409,16 +2367,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for light reclaim */
+		/* The OOM killer does not compensate for IO-less reclaim */
 		if (!(gfp_mask & __GFP_FS)) {
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
-			 * keep looping as per should_alloc_retry().
+			 * keep looping as per tradition.
 			 */
 			*did_some_progress = 1;
 			goto out;
 		}
+		if (pm_suspended_storage())
+			goto out;
 		/* The OOM killer may not free memory on a specific node */
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
@@ -2801,40 +2761,40 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
-	/* Check if we should retry the allocation */
+	/* Do not loop if specifically requested */
+	if (gfp_mask & __GFP_NORETRY)
+		goto noretry;
+
+	/* Keep reclaiming pages as long as there is reasonable progress */
 	pages_reclaimed += did_some_progress;
-	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
-		/*
-		 * If we fail to make progress by freeing individual
-		 * pages, but the allocation wants us to keep going,
-		 * start OOM killing tasks.
-		 */
-		if (!did_some_progress) {
-			page = __alloc_pages_may_oom(gfp_mask, order, ac,
-							&did_some_progress);
-			if (page)
-				goto got_pg;
-			if (!did_some_progress)
-				goto nopage;
-		}
+	if ((did_some_progress && order < PAGE_ALLOC_COSTLY_ORDER) ||
+	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
-	} else {
-		/*
-		 * High-order allocations do not necessarily loop after
-		 * direct reclaim and reclaim/compaction depends on compaction
-		 * being called after reclaim so call directly if necessary
-		 */
-		page = __alloc_pages_direct_compact(gfp_mask, order,
-					alloc_flags, ac, migration_mode,
-					&contended_compaction,
-					&deferred_compaction);
-		if (page)
-			goto got_pg;
 	}
 
+	/* Reclaim has failed us, start killing things */
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	if (page)
+		goto got_pg;
+
+	/* Retry as long as the OOM killer is making progress */
+	if (did_some_progress)
+		goto retry;
+
+noretry:
+	/*
+	 * High-order allocations do not necessarily loop after
+	 * direct reclaim and reclaim/compaction depends on compaction
+	 * being called after reclaim so call directly if necessary
+	 */
+	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
+					    ac, migration_mode,
+					    &contended_compaction,
+					    &deferred_compaction);
+	if (page)
+		goto got_pg;
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (6 preceding siblings ...)
  2015-03-25  6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-25 14:15   ` Tetsuo Handa
  2015-03-26 15:58   ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
  2015-03-25  6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

There is not much point in rushing back to the freelists and burning
CPU cycles in direct reclaim when somebody else is in the process of
OOM killing, or right after issuing a kill ourselves, because it could
take some time for the OOM victim to release memory.

This is a very cold error path, so there is not much hurry.  Use the
OOM victim waitqueue to wait for victims to actually exit, which is a
solid signal that the memory pinned by those tasks has been released.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c   | 11 +++++++----
 mm/page_alloc.c | 42 +++++++++++++++++++++++++-----------------
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39b3268..e066ac7353a4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
+	if (test_thread_flag(TIF_MEMDIE))
+		return true;
 	/*
-	 * Give the killed threads a good chance of exiting before trying to
-	 * allocate memory again.
+	 * Wait for any outstanding OOM victims to die.  In rare cases
+	 * victims can get stuck behind the allocating tasks, so the
+	 * wait needs to be bounded.  It's crude alright, but cheaper
+	 * than keeping a global dependency tree between all tasks.
 	 */
-	if (killed)
-		schedule_timeout_killable(1);
+	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
 
 	return true;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1224ba45548..9ce9c4c083a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 }
 
 static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
+	 * This allocating task can become the OOM victim itself at
+	 * any point before acquiring the lock.  In that case, exit
+	 * quickly and don't block on the lock held by another task
+	 * waiting for us to exit.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
-		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
-		return NULL;
+	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+		goto alloc;
 	}
 
 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
+	 * While we have been waiting for the lock, the previous OOM
+	 * kill might have released enough memory for the both of us.
 	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto out;
 
@@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
-	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
-			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+
+	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
 		*did_some_progress = 1;
+	} else {
+		/* Oops, these shouldn't happen with the OOM killer disabled */
+		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+			*did_some_progress = 1;
+	}
 out:
 	mutex_unlock(&oom_lock);
+alloc:
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+
 	return page;
 }
 
@@ -2775,7 +2782,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
+				     &did_some_progress);
 	if (page)
 		goto got_pg;
 
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (7 preceding siblings ...)
  2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-04-14 16:49   ` Michal Hocko
  2015-03-25  6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

The OOM killer connects random tasks in the system with unknown
dependencies between them, and the OOM victim might well get blocked
behind the task that is trying to allocate.  That means that while
allocations can issue OOM kills to improve the low memory situation,
which generally frees more than they are going to take out, they can
not rely on their *own* OOM kills to make forward progress for them.

Secondly, we want to avoid a racing allocation swooping in to steal
the work of the OOM killing allocation, causing spurious allocation
failures.  The one that put in the work must have priority - if its
efforts are enough to serve both allocations that's fine, otherwise
concurrent allocations should be forced to issue their own OOM kills.

Keep some pages below the min watermark reserved for OOM-killing
allocations to protect them from blocking victims and concurrent
allocations not pulling their weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h          |  3 ++-
 mm/page_alloc.c        | 27 +++++++++++++++++++++++----
 mm/vmstat.c            |  2 ++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 218f89208e83..284a36c7c1ce 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -239,12 +239,14 @@ struct lruvec {
 typedef unsigned __bitwise__ isolate_mode_t;
 
 enum zone_watermarks {
+	WMARK_OOM,
 	WMARK_MIN,
 	WMARK_LOW,
 	WMARK_HIGH,
 	NR_WMARK
 };
 
+#define oom_wmark_pages(z) (z->watermark[WMARK_OOM])
 #define min_wmark_pages(z) (z->watermark[WMARK_MIN])
 #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
 #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
diff --git a/mm/internal.h b/mm/internal.h
index edaab69a9c35..f59f3711f26c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -419,10 +419,11 @@ extern void set_pageblock_order(void);
 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 					    struct list_head *page_list);
 /* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_OOM		WMARK_OOM
 #define ALLOC_WMARK_MIN		WMARK_MIN
 #define ALLOC_WMARK_LOW		WMARK_LOW
 #define ALLOC_WMARK_HIGH	WMARK_HIGH
-#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
+#define ALLOC_NO_WATERMARKS	0x08 /* don't check watermarks at all */
 
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ce9c4c083a0..3c165016175d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2390,6 +2390,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
 			*did_some_progress = 1;
 	}
+
+	/*
+	 * Allocate from the OOM killer reserves.
+	 *
+	 * For one, this prevents parallel allocations from stealing
+	 * our work and cause us to fail the allocation prematurely.
+	 * If our kill is not enough for both, a racing allocation
+	 * should issue a kill on its own.
+	 *
+	 * We might also hold a lock or state that keeps the OOM kill
+	 * from exiting.  While allocations can use OOM kills to free
+	 * memory, they can not necessarily rely on their *own* kills
+	 * to make forward progress.
+	 */
+	alloc_flags &= ~ALLOC_WMARK_MASK;
+	alloc_flags |= ALLOC_WMARK_OOM;
 out:
 	mutex_unlock(&oom_lock);
 alloc:
@@ -3274,6 +3290,7 @@ void show_free_areas(unsigned int filter)
 		show_node(zone);
 		printk("%s"
 			" free:%lukB"
+			" oom:%lukB"
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
@@ -3306,6 +3323,7 @@ void show_free_areas(unsigned int filter)
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(oom_wmark_pages(zone)),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
@@ -5747,17 +5765,18 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
-			zone->watermark[WMARK_MIN] = min_pages;
+			zone->watermark[WMARK_OOM] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->watermark[WMARK_MIN] = tmp;
+			zone->watermark[WMARK_OOM] = tmp;
 		}
 
-		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
-		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
+		zone->watermark[WMARK_MIN]  = oom_wmark_pages(zone) + (tmp >> 3);
+		zone->watermark[WMARK_LOW]  = oom_wmark_pages(zone) + (tmp >> 2);
+		zone->watermark[WMARK_HIGH] = oom_wmark_pages(zone) + (tmp >> 1);
 
 		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
 			high_wmark_pages(zone) - low_wmark_pages(zone) -
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..a62f16ef524c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1188,6 +1188,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
 	seq_printf(m,
 		   "\n  pages free     %lu"
+		   "\n        oom      %lu"
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
@@ -1196,6 +1197,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
+		   oom_wmark_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (8 preceding siblings ...)
  2015-03-25  6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-04-14 16:55   ` Michal Hocko
  2015-03-25  6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

__GFP_NOFAIL allocations can deadlock the OOM killer when they're
holding locks that the OOM victim might need to exit.  When that
happens the allocation may never complete, which has disastrous
effects on things like in-flight filesystem transactions.

When the system is OOM, allow __GFP_NOFAIL allocations to dip into the
emergency reserves in the hope that this will allow transactions and
writeback to complete and the deadlock can be avoided.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c165016175d..832ad1c7cd4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2403,9 +2403,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * from exiting.  While allocations can use OOM kills to free
 	 * memory, they can not necessarily rely on their *own* kills
 	 * to make forward progress.
+	 *
+	 * This last point is crucial for __GFP_NOFAIL allocations.
+	 * Since they can't quit, they might actually deadlock, so
+	 * give them hail mary access to the emergency reserves.
 	 */
-	alloc_flags &= ~ALLOC_WMARK_MASK;
-	alloc_flags |= ALLOC_WMARK_OOM;
+	if (gfp_mask & __GFP_NOFAIL) {
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+	} else {
+		alloc_flags &= ~ALLOC_WMARK_MASK;
+		alloc_flags |= ALLOC_WMARK_OOM;
+	}
 out:
 	mutex_unlock(&oom_lock);
 alloc:
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (9 preceding siblings ...)
  2015-03-25  6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 14:50   ` Michal Hocko
  2015-03-25  6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
  2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

GFP_NOFS allocations are not allowed to invoke the OOM killer since
their reclaim abilities are severely diminished.  However, without the
OOM killer available there is no hope of progress once the reclaimable
pages have been exhausted.

Don't risk hanging these allocations.  Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 832ad1c7cd4f..9e45e97aa934 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2367,15 +2367,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
+		if (!(gfp_mask & __GFP_FS))
 			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch 12/12] mm: page_alloc: do not lock up low-order allocations upon OOM
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (10 preceding siblings ...)
  2015-03-25  6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
@ 2015-03-25  6:17 ` Johannes Weiner
  2015-03-26 15:32   ` Michal Hocko
  2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
  12 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-25  6:17 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Michal Hocko, Theodore Ts'o

When both page reclaim and the OOM killer fail to free memory, there
are no more options for the allocator to make progress on its own.

Don't risk hanging these allocations.  Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e45e97aa934..f2b1a17416c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2331,12 +2331,10 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+		      const struct alloc_context *ac)
 {
 	struct page *page = NULL;
 
-	*did_some_progress = 0;
-
 	/*
 	 * This allocating task can become the OOM victim itself at
 	 * any point before acquiring the lock.  In that case, exit
@@ -2376,13 +2374,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			goto out;
 	}
 
-	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
-		*did_some_progress = 1;
-	} else {
+	if (!out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false))
 		/* Oops, these shouldn't happen with the OOM killer disabled */
-		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
-			*did_some_progress = 1;
-	}
+		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
 
 	/*
 	 * Allocate from the OOM killer reserves.
@@ -2799,13 +2793,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
-				     &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
 
-	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	/* Wait for user to order more dimms, cuz these are done */
+	if (gfp_mask & __GFP_NOFAIL)
 		goto retry;
 
 noretry:
-- 
2.3.3


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
@ 2015-03-25 14:15   ` Tetsuo Handa
  2015-03-25 17:01     ` Vlastimil Babka
  2015-03-26 11:24     ` Johannes Weiner
  2015-03-26 15:58   ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
  1 sibling, 2 replies; 69+ messages in thread
From: Tetsuo Handa @ 2015-03-25 14:15 UTC (permalink / raw)
  To: hannes, linux-mm, linux-fsdevel, linux-kernel
  Cc: torvalds, akpm, ying.huang, aarcange, david, mhocko, tytso

Johannes Weiner wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39b3268..e066ac7353a4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> +	if (test_thread_flag(TIF_MEMDIE))
> +		return true;
>  	/*
> -	 * Give the killed threads a good chance of exiting before trying to
> -	 * allocate memory again.
> +	 * Wait for any outstanding OOM victims to die.  In rare cases
> +	 * victims can get stuck behind the allocating tasks, so the
> +	 * wait needs to be bounded.  It's crude alright, but cheaper
> +	 * than keeping a global dependency tree between all tasks.
>  	 */
> -	if (killed)
> -		schedule_timeout_killable(1);
> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>  
>  	return true;
>  }

out_of_memory() returning true with bounded wait effectively means that
wait forever without choosing subsequent OOM victims when first OOM victim
failed to die. The system will lock up, won't it?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c1224ba45548..9ce9c4c083a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  }
>  
>  static inline struct page *
> -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * This allocating task can become the OOM victim itself at
> +	 * any point before acquiring the lock.  In that case, exit
> +	 * quickly and don't block on the lock held by another task
> +	 * waiting for us to exit.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> -		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> -		return NULL;
> +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		goto alloc;
>  	}

When a thread group has 1000 threads and most of them are doing memory allocation
request, all of them will get fatal_signal_pending() == true when one of them are
chosen by OOM killer.
This code will allow most of them to access memory reserves, won't it?

> @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> -	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +
> +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
>  		*did_some_progress = 1;
> +	} else {
> +		/* Oops, these shouldn't happen with the OOM killer disabled */
> +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +			*did_some_progress = 1;
> +	}

I think GFP_NOFAIL allocations need to involve OOM killer than
pretending as if forward progress is made. If all of in-flight
allocation requests are GFP_NOFAIL, the system will lock up.

After all, if we wait for OOM killer progress before retrying, I think
we should involve OOM killer after some bounded timeout regardless of
gfp flags, and let OOM killer kill more threads after another bounded
timeout. Otherwise, the corner cases will lock up the system.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25 14:15   ` Tetsuo Handa
@ 2015-03-25 17:01     ` Vlastimil Babka
  2015-03-26 11:28       ` Johannes Weiner
  2015-03-26 11:24     ` Johannes Weiner
  1 sibling, 1 reply; 69+ messages in thread
From: Vlastimil Babka @ 2015-03-25 17:01 UTC (permalink / raw)
  To: Tetsuo Handa, hannes, linux-mm, linux-fsdevel, linux-kernel
  Cc: torvalds, akpm, ying.huang, aarcange, david, mhocko, tytso

On 03/25/2015 03:15 PM, Tetsuo Handa wrote:
> Johannes Weiner wrote:
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 5cfda39b3268..e066ac7353a4 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>>   		killed = 1;
>>   	}
>>   out:
>> +	if (test_thread_flag(TIF_MEMDIE))
>> +		return true;
>>   	/*
>> -	 * Give the killed threads a good chance of exiting before trying to
>> -	 * allocate memory again.
>> +	 * Wait for any outstanding OOM victims to die.  In rare cases
>> +	 * victims can get stuck behind the allocating tasks, so the
>> +	 * wait needs to be bounded.  It's crude alright, but cheaper
>> +	 * than keeping a global dependency tree between all tasks.
>>   	 */
>> -	if (killed)
>> -		schedule_timeout_killable(1);
>> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>>
>>   	return true;
>>   }
>
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

And after patch 12, does this mean that you may not be waiting long 
enough for the victim to die, before you fail the allocation, 
prematurely? I can imagine there would be situations where the victim is 
not deadlocked, but still take more than HZ to finish, no?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
@ 2015-03-26  0:51   ` David Rientjes
  2015-03-26 11:51     ` Michal Hocko
  2015-03-26 11:43   ` Michal Hocko
  2015-03-26 20:05   ` David Rientjes
  2 siblings, 1 reply; 69+ messages in thread
From: David Rientjes @ 2015-03-26  0:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

On Wed, 25 Mar 2015, Johannes Weiner wrote:

> Setting oom_killer_disabled to false is atomic, there is no need for
> further synchronization with ongoing allocations trying to OOM-kill.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/oom_kill.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 2b665da1b3c9..73763e489e86 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -488,9 +488,7 @@ bool oom_killer_disable(void)
>   */
>  void oom_killer_enable(void)
>  {
> -	down_write(&oom_sem);
>  	oom_killer_disabled = false;
> -	up_write(&oom_sem);
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))

I haven't looked through the new disable-oom-killer-for-pm patchset that 
was merged, but this oom_killer_disabled thing already looks improperly 
handled.  I think any correctness or cleanups in this area would be very 
helpful.

I think mark_tsk_oom_victim() in mem_cgroup_out_of_memory() is just 
luckily not racing with a call to oom_killer_enable() and triggering the 
WARN_ON(oom_killer_disabled) since there's no "oom_sem" held here, and 
it's an improper context based on the comment of mark_tsk_oom_victim().  
There might be something else that is intended but not implemented 
correctly that I'm unaware of, but I know of no reason why setting of 
oom_killer_disabled would need to take a semaphore?

I'm thinking it has something to do with the remainder of that comment, 
specifically the "never after oom has been disabled already."

Michal?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
@ 2015-03-26  3:31   ` David Rientjes
  2015-03-26 11:05     ` Johannes Weiner
  2015-03-26 11:57   ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: David Rientjes @ 2015-03-26  3:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

On Wed, 25 Mar 2015, Johannes Weiner wrote:

> exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
> else can clear it concurrently.  Use clear_thread_flag() directly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

For the oom killer, that's true because of task_lock(): we always only set 
TIF_MEMDIE when there is a valid p->mm and it's cleared in the exit path 
after the unlock, acting as a barrier, when p->mm is set to NULL so it's 
no longer a valid victim.  So that part is fine.

The problem is the android low memory killer that does 
mark_tsk_oom_victim() without the protection of task_lock(), it's just rcu 
protected so the reference to the task itself is guaranteed to still be 
valid.

I assume that's why Michal implemented it this way and added the comment 
to the lmk in commit 49550b605587 ("oom: add helpers for setting and 
clearing TIF_MEMDIE") to avoid TIF_MEMDIE entirely there.

> ---
>  mm/oom_kill.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b2f081fe4b1a..4b9547be9170 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -435,8 +435,7 @@ void mark_oom_victim(struct task_struct *tsk)
>   */
>  void exit_oom_victim(void)
>  {
> -	if (!test_and_clear_thread_flag(TIF_MEMDIE))
> -		return;
> +	clear_thread_flag(TIF_MEMDIE);
>  
>  	down_read(&oom_sem);
>  	/*

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces
  2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
@ 2015-03-26  3:34   ` David Rientjes
  2015-03-26 11:54   ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: David Rientjes @ 2015-03-26  3:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

On Wed, 25 Mar 2015, Johannes Weiner wrote:

> Rename unmark_oom_victim() to exit_oom_victim().  Marking and
> unmarking are related in functionality, but the interface is not
> symmetrical at all: one is an internal OOM killer function used during
> the killing, the other is for an OOM victim to signal its own death on
> exit later on.  This has locking implications, see follow-up changes.
> 
> While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
> is easier on the eye.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David Rientjes <rientjes@google.com>

I like exit_oom_victim() better since it follows the format of other 
do_exit() functions.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-26  3:31   ` David Rientjes
@ 2015-03-26 11:05     ` Johannes Weiner
  2015-03-26 19:50       ` David Rientjes
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 11:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

Hi David,

On Wed, Mar 25, 2015 at 08:31:49PM -0700, David Rientjes wrote:
> On Wed, 25 Mar 2015, Johannes Weiner wrote:
> 
> > exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
> > else can clear it concurrently.  Use clear_thread_flag() directly.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> For the oom killer, that's true because of task_lock(): we always only set 
> TIF_MEMDIE when there is a valid p->mm and it's cleared in the exit path 
> after the unlock, acting as a barrier, when p->mm is set to NULL so it's 
> no longer a valid victim.  So that part is fine.
> 
> The problem is the android low memory killer that does 
> mark_tsk_oom_victim() without the protection of task_lock(), it's just rcu 
> protected so the reference to the task itself is guaranteed to still be 
> valid.

But this is about *setting* it without a lock.  My point was that once
TIF_MEMDIE is actually set, the task owns it and nobody else can clear
it for them, so it's safe to test and clear non-atomically from the
task's own context.  Am I missing something?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25 14:15   ` Tetsuo Handa
  2015-03-25 17:01     ` Vlastimil Babka
@ 2015-03-26 11:24     ` Johannes Weiner
  2015-03-26 14:32       ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 11:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-fsdevel, linux-kernel, torvalds, akpm,
	ying.huang, aarcange, david, mhocko, tytso

On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5cfda39b3268..e066ac7353a4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		killed = 1;
> >  	}
> >  out:
> > +	if (test_thread_flag(TIF_MEMDIE))
> > +		return true;
> >  	/*
> > -	 * Give the killed threads a good chance of exiting before trying to
> > -	 * allocate memory again.
> > +	 * Wait for any outstanding OOM victims to die.  In rare cases
> > +	 * victims can get stuck behind the allocating tasks, so the
> > +	 * wait needs to be bounded.  It's crude alright, but cheaper
> > +	 * than keeping a global dependency tree between all tasks.
> >  	 */
> > -	if (killed)
> > -		schedule_timeout_killable(1);
> > +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >  
> >  	return true;
> >  }
> 
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

The OOM killer already refuses to choose another victim as long as the
first one hasn't exited, see oom_scan_process_thread().  That's why
later patches in this series introduce a reserve for OOM-killing tasks
and give nofail allocations access to emergency reserves, in case they
themselves prevent that single OOM victim from exiting.  But otherwise
victims should be exiting eventually.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c1224ba45548..9ce9c4c083a0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >  }
> >  
> >  static inline struct page *
> > -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  
> >  	*did_some_progress = 0;
> >  
> >  	/*
> > -	 * Acquire the oom lock.  If that fails, somebody else is
> > -	 * making progress for us.
> > +	 * This allocating task can become the OOM victim itself at
> > +	 * any point before acquiring the lock.  In that case, exit
> > +	 * quickly and don't block on the lock held by another task
> > +	 * waiting for us to exit.
> >  	 */
> > -	if (!mutex_trylock(&oom_lock)) {
> > -		*did_some_progress = 1;
> > -		schedule_timeout_uninterruptible(1);
> > -		return NULL;
> > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		goto alloc;
> >  	}
> 
> When a thread group has 1000 threads and most of them are doing memory allocation
> request, all of them will get fatal_signal_pending() == true when one of them are
> chosen by OOM killer.
> This code will allow most of them to access memory reserves, won't it?

Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
any dying thread.  Thanks, I'll fix it in v2.

> > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (gfp_mask & __GFP_THISNODE)
> >  			goto out;
> >  	}
> > -	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +
> > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> >  		*did_some_progress = 1;
> > +	} else {
> > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +			*did_some_progress = 1;
> > +	}
> 
> I think GFP_NOFAIL allocations need to involve OOM killer than
> pretending as if forward progress is made. If all of in-flight
> allocation requests are GFP_NOFAIL, the system will lock up.

Hm?  They do involve the OOM killer, but once userspace is frozen for
suspend/hibernate we shouldn't kill and thaw random tasks anymore as
that might corrupt the memory snapshot, so nofail allocations are a
bug at this point.

> After all, if we wait for OOM killer progress before retrying, I think
> we should involve OOM killer after some bounded timeout regardless of
> gfp flags, and let OOM killer kill more threads after another bounded
> timeout. Otherwise, the corner cases will lock up the system.

Giving nofail allocations access to emergency reserves targets this
problem, but I agree with you that it's still possible for the system
to lock up if they have been consumed and still no task made enough
forward progress to release memory.  It is unlikely but possible.

I will probably come back to the OOM victim timeout patch some time in
the future as that seems more robust.  It would also drastically
simplify memcg OOM handling.  But that patch was controversial in the
past and seemed beyond the scope of this patch set.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25 17:01     ` Vlastimil Babka
@ 2015-03-26 11:28       ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 11:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, david, mhocko, tytso

On Wed, Mar 25, 2015 at 06:01:48PM +0100, Vlastimil Babka wrote:
> On 03/25/2015 03:15 PM, Tetsuo Handa wrote:
> >Johannes Weiner wrote:
> >>diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>index 5cfda39b3268..e066ac7353a4 100644
> >>--- a/mm/oom_kill.c
> >>+++ b/mm/oom_kill.c
> >>@@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >>  		killed = 1;
> >>  	}
> >>  out:
> >>+	if (test_thread_flag(TIF_MEMDIE))
> >>+		return true;
> >>  	/*
> >>-	 * Give the killed threads a good chance of exiting before trying to
> >>-	 * allocate memory again.
> >>+	 * Wait for any outstanding OOM victims to die.  In rare cases
> >>+	 * victims can get stuck behind the allocating tasks, so the
> >>+	 * wait needs to be bounded.  It's crude alright, but cheaper
> >>+	 * than keeping a global dependency tree between all tasks.
> >>  	 */
> >>-	if (killed)
> >>-		schedule_timeout_killable(1);
> >>+	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >>
> >>  	return true;
> >>  }
> >
> >out_of_memory() returning true with bounded wait effectively means that
> >wait forever without choosing subsequent OOM victims when first OOM victim
> >failed to die. The system will lock up, won't it?
> 
> And after patch 12, does this mean that you may not be waiting long enough
> for the victim to die, before you fail the allocation, prematurely? I can
> imagine there would be situations where the victim is not deadlocked, but
> still take more than HZ to finish, no?

Arguably it should be reasonable to fail allocations once the OOM
victim is stuck for over a second and the OOM reserves have been
depleted.

On the other hand, we don't need to play it that tight, because that
timeout is only targetted for the victim-blocked-on-alloc situations
which aren't all that common.  Something like 5 seconds should still
be okay.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
  2015-03-26  0:51   ` David Rientjes
@ 2015-03-26 11:43   ` Michal Hocko
  2015-03-26 20:05   ` David Rientjes
  2 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 11:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:05, Johannes Weiner wrote:
> Setting oom_killer_disabled to false is atomic, there is no need for
> further synchronization with ongoing allocations trying to OOM-kill.

True, races with an ongoing allocations are not harmful.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/oom_kill.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 2b665da1b3c9..73763e489e86 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -488,9 +488,7 @@ bool oom_killer_disable(void)
>   */
>  void oom_killer_enable(void)
>  {
> -	down_write(&oom_sem);
>  	oom_killer_disabled = false;
> -	up_write(&oom_sem);
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> -- 
> 2.3.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-26  0:51   ` David Rientjes
@ 2015-03-26 11:51     ` Michal Hocko
  2015-03-26 13:18       ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 11:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Theodore Ts'o

On Wed 25-03-15 17:51:31, David Rientjes wrote:
> On Wed, 25 Mar 2015, Johannes Weiner wrote:
> 
> > Setting oom_killer_disabled to false is atomic, there is no need for
> > further synchronization with ongoing allocations trying to OOM-kill.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/oom_kill.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 2b665da1b3c9..73763e489e86 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -488,9 +488,7 @@ bool oom_killer_disable(void)
> >   */
> >  void oom_killer_enable(void)
> >  {
> > -	down_write(&oom_sem);
> >  	oom_killer_disabled = false;
> > -	up_write(&oom_sem);
> >  }
> >  
> >  #define K(x) ((x) << (PAGE_SHIFT-10))
> 
> I haven't looked through the new disable-oom-killer-for-pm patchset that 
> was merged, but this oom_killer_disabled thing already looks improperly 
> handled.  I think any correctness or cleanups in this area would be very 
> helpful.
> 
> I think mark_tsk_oom_victim() in mem_cgroup_out_of_memory() is just 
> luckily not racing with a call to oom_killer_enable() and triggering the 
                                    ^^^^^^^^^^
                                    oom_killer_disable?

> WARN_ON(oom_killer_disabled) since there's no "oom_sem" held here, and 
> it's an improper context based on the comment of mark_tsk_oom_victim().

OOM killer is disabled only _after_ all user tasks have been frozen. So
we cannot get any page fault and a race. So the semaphore is not needed
in this path although the comment says otherwise. I can add a comment
clarifying this...
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..20828ecaf3ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1536,6 +1536,11 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * quickly exit and free its memory.
 	 */
 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+		/*
+		 * We do not hold oom_sem in this path because we know
+		 * we cannot race with oom_kill_disable(). No user runable
+		 * tasks are allowed at the time oom_kill_disable is called.
+		 */
 		mark_tsk_oom_victim(current);
 		return;
 	}

> There might be something else that is intended but not implemented 
> correctly that I'm unaware of, but I know of no reason why setting of 
> oom_killer_disabled would need to take a semaphore?
> 
> I'm thinking it has something to do with the remainder of that comment, 
> specifically the "never after oom has been disabled already."
> 
> Michal?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces
  2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
  2015-03-26  3:34   ` David Rientjes
@ 2015-03-26 11:54   ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 11:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:06, Johannes Weiner wrote:
> Rename unmark_oom_victim() to exit_oom_victim().  Marking and
> unmarking are related in functionality, but the interface is not
> symmetrical at all: one is an internal OOM killer function used during
> the killing, the other is for an OOM victim to signal its own death on
> exit later on.  This has locking implications, see follow-up changes.
> 
> While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
> is easier on the eye.

The reason _tsk_ was used was to be in sync with *tsk_thread_flag API.
I do not mind changing that though.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  drivers/staging/android/lowmemorykiller.c |  2 +-
>  include/linux/oom.h                       |  7 ++++---
>  kernel/exit.c                             |  2 +-
>  mm/memcontrol.c                           |  2 +-
>  mm/oom_kill.c                             | 16 +++++++---------
>  5 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index feafa172b155..2345ee7342d9 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -165,7 +165,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
>  		 * infrastructure. There is no real reason why the selected
>  		 * task should have access to the memory reserves.
>  		 */
> -		mark_tsk_oom_victim(selected);
> +		mark_oom_victim(selected);
>  		send_sig(SIGKILL, selected, 0);
>  		rem += selected_tasksize;
>  	}
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 44b2f6f7bbd8..a8e6a498cbcb 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -47,9 +47,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
>  	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
>  }
>  
> -extern void mark_tsk_oom_victim(struct task_struct *tsk);
> -
> -extern void unmark_oom_victim(void);
> +extern void mark_oom_victim(struct task_struct *tsk);
>  
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> @@ -75,6 +73,9 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  
>  extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *mask, bool force_kill);
> +
> +extern void exit_oom_victim(void);
> +
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
>  
> diff --git a/kernel/exit.c b/kernel/exit.c
> index feff10bbb307..4089c2fd373e 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
>  	mm_update_next_owner(mm);
>  	mmput(mm);
>  	if (test_thread_flag(TIF_MEMDIE))
> -		unmark_oom_victim();
> +		exit_oom_victim();
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 74a9641d8f9f..aab5604e0ac4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1536,7 +1536,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * quickly exit and free its memory.
>  	 */
>  	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> -		mark_tsk_oom_victim(current);
> +		mark_oom_victim(current);
>  		return;
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 73763e489e86..b2f081fe4b1a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -408,13 +408,13 @@ bool oom_killer_disabled __read_mostly;
>  static DECLARE_RWSEM(oom_sem);
>  
>  /**
> - * mark_tsk_oom_victim - marks the given task as OOM victim.
> + * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
>   *
>   * Has to be called with oom_sem taken for read and never after
>   * oom has been disabled already.
>   */
> -void mark_tsk_oom_victim(struct task_struct *tsk)
> +void mark_oom_victim(struct task_struct *tsk)
>  {
>  	WARN_ON(oom_killer_disabled);
>  	/* OOM killer might race with memcg OOM */
> @@ -431,11 +431,9 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
>  }
>  
>  /**
> - * unmark_oom_victim - unmarks the current task as OOM victim.
> - *
> - * Wakes up all waiters in oom_killer_disable()
> + * exit_oom_victim - note the exit of an OOM victim
>   */
> -void unmark_oom_victim(void)
> +void exit_oom_victim(void)
>  {
>  	if (!test_and_clear_thread_flag(TIF_MEMDIE))
>  		return;
> @@ -515,7 +513,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 */
>  	task_lock(p);
>  	if (p->mm && task_will_free_mem(p)) {
> -		mark_tsk_oom_victim(p);
> +		mark_oom_victim(p);
>  		task_unlock(p);
>  		put_task_struct(p);
>  		return;
> @@ -570,7 +568,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  
>  	/* mm cannot safely be dereferenced after task_unlock(victim) */
>  	mm = victim->mm;
> -	mark_tsk_oom_victim(victim);
> +	mark_oom_victim(victim);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> @@ -728,7 +726,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 */
>  	if (current->mm &&
>  	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> -		mark_tsk_oom_victim(current);
> +		mark_oom_victim(current);
>  		return;
>  	}
>  
> -- 
> 2.3.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
  2015-03-26  3:31   ` David Rientjes
@ 2015-03-26 11:57   ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 11:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:07, Johannes Weiner wrote:
> exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
> else can clear it concurrently.  Use clear_thread_flag() directly.

Yeah. This is a left over from the review process. I originally did
unmarking unconditionally but Tejun suggested calling test_thread_flag
before calling here. So the test_and_clear is safe here.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/oom_kill.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b2f081fe4b1a..4b9547be9170 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -435,8 +435,7 @@ void mark_oom_victim(struct task_struct *tsk)
>   */
>  void exit_oom_victim(void)
>  {
> -	if (!test_and_clear_thread_flag(TIF_MEMDIE))
> -		return;
> +	clear_thread_flag(TIF_MEMDIE);
>  
>  	down_read(&oom_sem);
>  	/*
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim()
  2015-03-25  6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
@ 2015-03-26 12:53   ` Michal Hocko
  2015-03-26 13:01     ` Michal Hocko
  2015-03-26 15:04     ` Johannes Weiner
  0 siblings, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 12:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:08, Johannes Weiner wrote:
> Disabling the OOM killer needs to exclude allocators from entering,
> not existing victims from exiting.

The idea was that exit_oom_victim doesn't miss a waiter.

exit_oom_victim is doing
	atomic_dec_return(&oom_victims) && oom_killer_disabled)

so there is a full (implicit) memory barrier befor oom_killer_disabled
check. The other part is trickier. oom_killer_disable does:
	oom_killer_disabled = true;
        up_write(&oom_sem);

        wait_event(oom_victims_wait, !atomic_read(&oom_victims));

up_write doesn't guarantee a full memory barrier AFAICS in
Documentation/memory-barriers.txt (although the generic and x86
implementations seem to implement it as a full barrier) but wait_event
implies the full memory barrier (prepare_to_wait_event does spin
lock&unlock) before checking the condition in the slow path. This should
be sufficient and docummented...

	/*
	 * We do not need to hold oom_sem here because oom_killer_disable
	 * guarantees that oom_killer_disabled chage is visible before
	 * the waiter is put into sleep (prepare_to_wait_event) so
	 * we cannot miss a wake up.
	 */

in unmark_oom_victim()

> Right now the only waiter is suspend code, which achieves quiescence
> by disabling the OOM killer.  But later on we want to add waits that
> hold the lock instead to stop new victims from showing up.

It is not entirely clear what you mean by this from the current context.
exit_oom_victim is not called from any context which would be locked by
any OOM internals so it should be safe to use the locking.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I have nothing against the change as it seems correct but it would be
good to get a better clarification and also document the implicit memory
barriers.

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/oom_kill.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b9547be9170..88aa9ba40fa5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -437,14 +437,12 @@ void exit_oom_victim(void)
>  {
>  	clear_thread_flag(TIF_MEMDIE);
>  
> -	down_read(&oom_sem);
>  	/*
>  	 * There is no need to signal the lasst oom_victim if there
>  	 * is nobody who cares.
>  	 */
>  	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
>  		wake_up_all(&oom_victims_wait);
> -	up_read(&oom_sem);
>  }
>  
>  /**
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim()
  2015-03-26 12:53   ` Michal Hocko
@ 2015-03-26 13:01     ` Michal Hocko
  2015-03-26 15:10       ` Johannes Weiner
  2015-03-26 15:04     ` Johannes Weiner
  1 sibling, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 13:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu 26-03-15 13:53:48, Michal Hocko wrote:
> On Wed 25-03-15 02:17:08, Johannes Weiner wrote:
> > Disabling the OOM killer needs to exclude allocators from entering,
> > not existing victims from exiting.
> 
> The idea was that exit_oom_victim doesn't miss a waiter.
> 
> exit_oom_victim is doing
> 	atomic_dec_return(&oom_victims) && oom_killer_disabled)
> 
> so there is a full (implicit) memory barrier befor oom_killer_disabled
> check. The other part is trickier. oom_killer_disable does:
> 	oom_killer_disabled = true;
>         up_write(&oom_sem);
> 
>         wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> 
> up_write doesn't guarantee a full memory barrier AFAICS in
> Documentation/memory-barriers.txt (although the generic and x86
> implementations seem to implement it as a full barrier) but wait_event
> implies the full memory barrier (prepare_to_wait_event does spin
> lock&unlock) before checking the condition in the slow path. This should
> be sufficient and docummented...
> 
> 	/*
> 	 * We do not need to hold oom_sem here because oom_killer_disable
> 	 * guarantees that oom_killer_disabled chage is visible before
> 	 * the waiter is put into sleep (prepare_to_wait_event) so
> 	 * we cannot miss a wake up.
> 	 */
> 
> in unmark_oom_victim()

OK, I can see that the next patch removes oom_killer_disabled
completely. The dependency won't be there and so the concerns about the
memory barriers.

Is there any reason why the ordering is done this way? It would sound
more logical to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue
  2015-03-25  6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
@ 2015-03-26 13:03   ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 13:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:09, Johannes Weiner wrote:
> It turns out that the mechanism to wait for exiting OOM victims is
> less generic than it looks: it won't issue wakeups unless the OOM
> killer is disabled.
> 
> The reason this check was added was the thought that, since only the
> OOM disabling code would wait on this queue, wakeup operations could
> be saved when that specific consumer is known to be absent.
> 
> However, this is quite the handgrenade.  Later attempts to reuse the
> waitqueue for other purposes will lead to completely unexpected bugs
> and the failure mode will appear seemingly illogical.  Generally,
> providers shouldn't make unnecessary assumptions about consumers.
> 
> This could have been replaced with waitqueue_active(), but it only
> saves a few instructions in one of the coldest paths in the kernel.
> Simply remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/oom_kill.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 88aa9ba40fa5..d3490b019d46 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -437,11 +437,7 @@ void exit_oom_victim(void)
>  {
>  	clear_thread_flag(TIF_MEMDIE);
>  
> -	/*
> -	 * There is no need to signal the lasst oom_victim if there
> -	 * is nobody who cares.
> -	 */
> -	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
>  }
>  
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-26 11:51     ` Michal Hocko
@ 2015-03-26 13:18       ` Michal Hocko
  2015-03-26 19:30         ` David Rientjes
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 13:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Theodore Ts'o

On Thu 26-03-15 12:51:40, Michal Hocko wrote:
> On Wed 25-03-15 17:51:31, David Rientjes wrote:
> > On Wed, 25 Mar 2015, Johannes Weiner wrote:
> > 
> > > Setting oom_killer_disabled to false is atomic, there is no need for
> > > further synchronization with ongoing allocations trying to OOM-kill.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/oom_kill.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > > 
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 2b665da1b3c9..73763e489e86 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -488,9 +488,7 @@ bool oom_killer_disable(void)
> > >   */
> > >  void oom_killer_enable(void)
> > >  {
> > > -	down_write(&oom_sem);
> > >  	oom_killer_disabled = false;
> > > -	up_write(&oom_sem);
> > >  }
> > >  
> > >  #define K(x) ((x) << (PAGE_SHIFT-10))
> > 
> > I haven't looked through the new disable-oom-killer-for-pm patchset that 
> > was merged, but this oom_killer_disabled thing already looks improperly 
> > handled.  I think any correctness or cleanups in this area would be very 
> > helpful.
> > 
> > I think mark_tsk_oom_victim() in mem_cgroup_out_of_memory() is just 
> > luckily not racing with a call to oom_killer_enable() and triggering the 
>                                     ^^^^^^^^^^
>                                     oom_killer_disable?
> 
> > WARN_ON(oom_killer_disabled) since there's no "oom_sem" held here, and 
> > it's an improper context based on the comment of mark_tsk_oom_victim().
> 
> OOM killer is disabled only _after_ all user tasks have been frozen. So
> we cannot get any page fault and a race. So the semaphore is not needed
> in this path although the comment says otherwise. I can add a comment
> clarifying this...

I am wrong here! pagefault_out_of_memory takes the lock and so the whole
mem_cgroup_out_of_memory is called under the same lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 06/12] mm: oom_kill: simplify OOM killer locking
  2015-03-25  6:17 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Johannes Weiner
@ 2015-03-26 13:31   ` Michal Hocko
  2015-03-26 15:17     ` Johannes Weiner
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 13:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:10, Johannes Weiner wrote:
> The zonelist locking and the oom_sem are two overlapping locks that
> are used to serialize global OOM killing against different things.
> 
> The historical zonelist locking serializes OOM kills from allocations
> with overlapping zonelists against each other to prevent killing more
> tasks than necessary in the same memory domain.  Only when neither
> tasklists nor zonelists from two concurrent OOM kills overlap (tasks
> in separate memcgs bound to separate nodes) are OOM kills allowed to
> execute in parallel.
> 
> The younger oom_sem is a read-write lock to serialize OOM killing
> against the PM code trying to disable the OOM killer altogether.
> 
> However, the OOM killer is a fairly cold error path, there is really
> no reason to optimize for highly performant and concurrent OOM kills.
> And the oom_sem is just flat-out redundant.
> 
> Replace both locking schemes with a single global mutex serializing
> OOM kills regardless of context.

OK, this is much simpler.

You have missed drivers/tty/sysrq.c which should take the lock as well.
ZONE_OOM_LOCKED can be removed as well. __out_of_memory in the kerneldoc
should be renamed.

[...]
> @@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct zonelist *zonelist;
> -
> -	down_read(&oom_sem);
>  	if (mem_cgroup_oom_synchronize(true))
> -		goto unlock;
> +		return;

OK, so we are back to what David has asked previously. We do not need
the lock for memcg and oom_killer_disabled because we know that no tasks
(except for potential oom victim) are lurking around at the time
oom_killer_disable() is called. So I guess we want to stick a comment
into mem_cgroup_oom_synchronize before we check for oom_killer_disabled.

After those are fixed, feel free to add
Acked-by: Michal Hocko <mhocko@suse.cz>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 07/12] mm: page_alloc: inline should_alloc_retry()
  2015-03-25  6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
@ 2015-03-26 14:11   ` Michal Hocko
  2015-03-26 15:18     ` Johannes Weiner
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 14:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:11, Johannes Weiner wrote:
> The should_alloc_retry() function was meant to encapsulate retry
> conditions of the allocator slowpath, but there are still checks
> remaining in the main function, and much of how the retrying is
> performed also depends on the OOM killer progress.  The physical
> separation of those conditions make the code hard to follow.
> 
> Inline the should_alloc_retry() checks.  Notes:
> 
> - The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
>   replace it with looping on OOM killer progress
> 
> - The pm_suspended_storage() check is meant to skip the OOM killer
>   when reclaim has no IO available, move to __alloc_pages_may_oom()
> 
> - The order < PAGE_ALLOC_COSTLY order is re-united with its original
>   counterpart of checking whether reclaim actually made any progress

it should be order <= PAGE_ALLOC_COSTLY
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The resulting code looks much better and logical.

After the COSTLY check is fixed.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/page_alloc.c | 104 +++++++++++++++++---------------------------------------
>  1 file changed, 32 insertions(+), 72 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ebc760187ac..c1224ba45548 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2329,48 +2329,6 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		show_mem(filter);
>  }
>  
> -static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> -				unsigned long did_some_progress,
> -				unsigned long pages_reclaimed)
> -{
> -	/* Do not loop if specifically requested */
> -	if (gfp_mask & __GFP_NORETRY)
> -		return 0;
> -
> -	/* Always retry if specifically requested */
> -	if (gfp_mask & __GFP_NOFAIL)
> -		return 1;
> -
> -	/*
> -	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
> -	 * making forward progress without invoking OOM. Suspend also disables
> -	 * storage devices so kswapd will not help. Bail if we are suspending.
> -	 */
> -	if (!did_some_progress && pm_suspended_storage())
> -		return 0;
> -
> -	/*
> -	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> -	 * means __GFP_NOFAIL, but that may not be true in other
> -	 * implementations.
> -	 */
> -	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> -		return 1;
> -
> -	/*
> -	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
> -	 * specified, then we retry until we no longer reclaim any pages
> -	 * (above), or we've reclaimed an order of pages at least as
> -	 * large as the allocation's order. In both cases, if the
> -	 * allocation still fails, we stop retrying.
> -	 */
> -	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> -		return 1;
> -
> -	return 0;
> -}
> -
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
> @@ -2409,16 +2367,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		/* The OOM killer does not needlessly kill tasks for lowmem */
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
> -		/* The OOM killer does not compensate for light reclaim */
> +		/* The OOM killer does not compensate for IO-less reclaim */
>  		if (!(gfp_mask & __GFP_FS)) {
>  			/*
>  			 * XXX: Page reclaim didn't yield anything,
>  			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per should_alloc_retry().
> +			 * keep looping as per tradition.
>  			 */
>  			*did_some_progress = 1;
>  			goto out;
>  		}
> +		if (pm_suspended_storage())
> +			goto out;
>  		/* The OOM killer may not free memory on a specific node */
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
> @@ -2801,40 +2761,40 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (page)
>  		goto got_pg;
>  
> -	/* Check if we should retry the allocation */
> +	/* Do not loop if specifically requested */
> +	if (gfp_mask & __GFP_NORETRY)
> +		goto noretry;
> +
> +	/* Keep reclaiming pages as long as there is reasonable progress */
>  	pages_reclaimed += did_some_progress;
> -	if (should_alloc_retry(gfp_mask, order, did_some_progress,
> -						pages_reclaimed)) {
> -		/*
> -		 * If we fail to make progress by freeing individual
> -		 * pages, but the allocation wants us to keep going,
> -		 * start OOM killing tasks.
> -		 */
> -		if (!did_some_progress) {
> -			page = __alloc_pages_may_oom(gfp_mask, order, ac,
> -							&did_some_progress);
> -			if (page)
> -				goto got_pg;
> -			if (!did_some_progress)
> -				goto nopage;
> -		}
> +	if ((did_some_progress && order < PAGE_ALLOC_COSTLY_ORDER) ||
> +	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
>  		goto retry;
> -	} else {
> -		/*
> -		 * High-order allocations do not necessarily loop after
> -		 * direct reclaim and reclaim/compaction depends on compaction
> -		 * being called after reclaim so call directly if necessary
> -		 */
> -		page = __alloc_pages_direct_compact(gfp_mask, order,
> -					alloc_flags, ac, migration_mode,
> -					&contended_compaction,
> -					&deferred_compaction);
> -		if (page)
> -			goto got_pg;
>  	}
>  
> +	/* Reclaim has failed us, start killing things */
> +	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> +	if (page)
> +		goto got_pg;
> +
> +	/* Retry as long as the OOM killer is making progress */
> +	if (did_some_progress)
> +		goto retry;
> +
> +noretry:
> +	/*
> +	 * High-order allocations do not necessarily loop after
> +	 * direct reclaim and reclaim/compaction depends on compaction
> +	 * being called after reclaim so call directly if necessary
> +	 */
> +	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> +					    ac, migration_mode,
> +					    &contended_compaction,
> +					    &deferred_compaction);
> +	if (page)
> +		goto got_pg;
>  nopage:
>  	warn_alloc_failed(gfp_mask, order, NULL);
>  got_pg:
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-26 11:24     ` Johannes Weiner
@ 2015-03-26 14:32       ` Michal Hocko
  2015-03-26 15:23         ` Johannes Weiner
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 14:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, david, tytso

On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > Johannes Weiner wrote:
[...]
> > >  	/*
> > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > -	 * making progress for us.
> > > +	 * This allocating task can become the OOM victim itself at
> > > +	 * any point before acquiring the lock.  In that case, exit
> > > +	 * quickly and don't block on the lock held by another task
> > > +	 * waiting for us to exit.
> > >  	 */
> > > -	if (!mutex_trylock(&oom_lock)) {
> > > -		*did_some_progress = 1;
> > > -		schedule_timeout_uninterruptible(1);
> > > -		return NULL;
> > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > +		goto alloc;
> > >  	}
> > 
> > When a thread group has 1000 threads and most of them are doing memory allocation
> > request, all of them will get fatal_signal_pending() == true when one of them are
> > chosen by OOM killer.
> > This code will allow most of them to access memory reserves, won't it?
> 
> Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> any dying thread.  Thanks, I'll fix it in v2.

Do you plan to post this v2 here for review?

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
  2015-03-25  6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
@ 2015-03-26 14:50   ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 14:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:15, Johannes Weiner wrote:
> GFP_NOFS allocations are not allowed to invoke the OOM killer since
> their reclaim abilities are severely diminished.  However, without the
> OOM killer available there is no hope of progress once the reclaimable
> pages have been exhausted.
> 
> Don't risk hanging these allocations.  Leave it to the allocation site
> to implement the fallback policy for failing allocations.

I fully support this. We need at least
http://marc.info/?l=linux-mm&m=142669354424905&w=2 for this to work
properly, which I am planning to post soon.

I am not sure the RO remount issues in ext4 seen in the previous round
of the similar change have been addressed already.

So it might be safer to route this separately from the previous OOM
enahancements.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/page_alloc.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 832ad1c7cd4f..9e45e97aa934 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,15 +2367,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for IO-less reclaim */
> -		if (!(gfp_mask & __GFP_FS)) {
> -			/*
> -			 * XXX: Page reclaim didn't yield anything,
> -			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per tradition.
> -			 */
> -			*did_some_progress = 1;
> +		if (!(gfp_mask & __GFP_FS))
>  			goto out;
> -		}
>  		if (pm_suspended_storage())
>  			goto out;
>  		/* The OOM killer may not free memory on a specific node */
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim()
  2015-03-26 12:53   ` Michal Hocko
  2015-03-26 13:01     ` Michal Hocko
@ 2015-03-26 15:04     ` Johannes Weiner
  1 sibling, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 15:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu, Mar 26, 2015 at 01:53:48PM +0100, Michal Hocko wrote:
> On Wed 25-03-15 02:17:08, Johannes Weiner wrote:
> > Right now the only waiter is suspend code, which achieves quiescence
> > by disabling the OOM killer.  But later on we want to add waits that
> > hold the lock instead to stop new victims from showing up.
> 
> It is not entirely clear what you mean by this from the current context.
> exit_oom_victim is not called from any context which would be locked by
> any OOM internals so it should be safe to use the locking.

A later patch will add another wait_event() to wait for oom victims to
drop to zero.  But that new consumer won't be disabling the OOM killer
to prevent new victims from showing up, it will just hold the lock to
exclude OOM kills.  So the exiting victims shouldn't get stuck on that
lock which the guy that is waiting for them is holding.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim()
  2015-03-26 13:01     ` Michal Hocko
@ 2015-03-26 15:10       ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu, Mar 26, 2015 at 02:01:06PM +0100, Michal Hocko wrote:
> On Thu 26-03-15 13:53:48, Michal Hocko wrote:
> > On Wed 25-03-15 02:17:08, Johannes Weiner wrote:
> > > Disabling the OOM killer needs to exclude allocators from entering,
> > > not existing victims from exiting.
> > 
> > The idea was that exit_oom_victim doesn't miss a waiter.
> > 
> > exit_oom_victim is doing
> > 	atomic_dec_return(&oom_victims) && oom_killer_disabled)
> > 
> > so there is a full (implicit) memory barrier befor oom_killer_disabled
> > check. The other part is trickier. oom_killer_disable does:
> > 	oom_killer_disabled = true;
> >         up_write(&oom_sem);
> > 
> >         wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > 
> > up_write doesn't guarantee a full memory barrier AFAICS in
> > Documentation/memory-barriers.txt (although the generic and x86
> > implementations seem to implement it as a full barrier) but wait_event
> > implies the full memory barrier (prepare_to_wait_event does spin
> > lock&unlock) before checking the condition in the slow path. This should
> > be sufficient and docummented...
> > 
> > 	/*
> > 	 * We do not need to hold oom_sem here because oom_killer_disable
> > 	 * guarantees that oom_killer_disabled chage is visible before
> > 	 * the waiter is put into sleep (prepare_to_wait_event) so
> > 	 * we cannot miss a wake up.
> > 	 */
> > 
> > in unmark_oom_victim()
> 
> OK, I can see that the next patch removes oom_killer_disabled
> completely. The dependency won't be there and so the concerns about the
> memory barriers.
> 
> Is there any reason why the ordering is done this way? It would sound
> more logical to me.

I honestly didn't even think about the dependency between the lock and
this check.  They both looked unnecessary to me and I stopped putting
any more thought into it once I had convinced myself that they are.

The order was chosen because the waitqueue generalization seemed like
a bigger deal.  One is just an unnecessary lock, but this extra check
cost me quite some time debugging and seems like a much more harmful
piece of code to fix.  It's no problem to reorder the patches, though.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 06/12] mm: oom_kill: simplify OOM killer locking
  2015-03-26 13:31   ` Michal Hocko
@ 2015-03-26 15:17     ` Johannes Weiner
  2015-03-26 16:07       ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 15:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu, Mar 26, 2015 at 02:31:11PM +0100, Michal Hocko wrote:
> On Wed 25-03-15 02:17:10, Johannes Weiner wrote:
> > The zonelist locking and the oom_sem are two overlapping locks that
> > are used to serialize global OOM killing against different things.
> > 
> > The historical zonelist locking serializes OOM kills from allocations
> > with overlapping zonelists against each other to prevent killing more
> > tasks than necessary in the same memory domain.  Only when neither
> > tasklists nor zonelists from two concurrent OOM kills overlap (tasks
> > in separate memcgs bound to separate nodes) are OOM kills allowed to
> > execute in parallel.
> > 
> > The younger oom_sem is a read-write lock to serialize OOM killing
> > against the PM code trying to disable the OOM killer altogether.
> > 
> > However, the OOM killer is a fairly cold error path, there is really
> > no reason to optimize for highly performant and concurrent OOM kills.
> > And the oom_sem is just flat-out redundant.
> > 
> > Replace both locking schemes with a single global mutex serializing
> > OOM kills regardless of context.
> 
> OK, this is much simpler.
> 
> You have missed drivers/tty/sysrq.c which should take the lock as well.
> ZONE_OOM_LOCKED can be removed as well. __out_of_memory in the kerneldoc
> should be renamed.

Argh, an older version had the lock inside out_of_memory() and I never
updated the caller when I changed the rules.  Thanks.  I'll fix both.

> > @@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >   */
> >  void pagefault_out_of_memory(void)
> >  {
> > -	struct zonelist *zonelist;
> > -
> > -	down_read(&oom_sem);
> >  	if (mem_cgroup_oom_synchronize(true))
> > -		goto unlock;
> > +		return;
> 
> OK, so we are back to what David has asked previously. We do not need
> the lock for memcg and oom_killer_disabled because we know that no tasks
> (except for potential oom victim) are lurking around at the time
> oom_killer_disable() is called. So I guess we want to stick a comment
> into mem_cgroup_oom_synchronize before we check for oom_killer_disabled.

I would prefer everybody that sets TIF_MEMDIE and kills a task to hold
the lock, including memcg.  Simplicity is one thing, but also a global
OOM kill might not even be necessary when it's racing with the memcg.

> After those are fixed, feel free to add
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 07/12] mm: page_alloc: inline should_alloc_retry()
  2015-03-26 14:11   ` Michal Hocko
@ 2015-03-26 15:18     ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu, Mar 26, 2015 at 03:11:28PM +0100, Michal Hocko wrote:
> On Wed 25-03-15 02:17:11, Johannes Weiner wrote:
> > The should_alloc_retry() function was meant to encapsulate retry
> > conditions of the allocator slowpath, but there are still checks
> > remaining in the main function, and much of how the retrying is
> > performed also depends on the OOM killer progress.  The physical
> > separation of those conditions make the code hard to follow.
> > 
> > Inline the should_alloc_retry() checks.  Notes:
> > 
> > - The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
> >   replace it with looping on OOM killer progress
> > 
> > - The pm_suspended_storage() check is meant to skip the OOM killer
> >   when reclaim has no IO available, move to __alloc_pages_may_oom()
> > 
> > - The order < PAGE_ALLOC_COSTLY order is re-united with its original
> >   counterpart of checking whether reclaim actually made any progress
> 
> it should be order <= PAGE_ALLOC_COSTLY

Oops, thanks for catching that.  I'll fix it in v2.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The resulting code looks much better and logical.
> 
> After the COSTLY check is fixed.
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thank you

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-26 14:32       ` Michal Hocko
@ 2015-03-26 15:23         ` Johannes Weiner
  2015-03-26 15:38           ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 15:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, david, tytso

On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> [...]
> > > >  	/*
> > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > -	 * making progress for us.
> > > > +	 * This allocating task can become the OOM victim itself at
> > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > +	 * quickly and don't block on the lock held by another task
> > > > +	 * waiting for us to exit.
> > > >  	 */
> > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > -		*did_some_progress = 1;
> > > > -		schedule_timeout_uninterruptible(1);
> > > > -		return NULL;
> > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > +		goto alloc;
> > > >  	}
> > > 
> > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > chosen by OOM killer.
> > > This code will allow most of them to access memory reserves, won't it?
> > 
> > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > any dying thread.  Thanks, I'll fix it in v2.
> 
> Do you plan to post this v2 here for review?

Yeah, I was going to wait for feedback to settle before updating the
code.  But I was thinking something like this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ce9c4c083a0..106793a75461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * waiting for us to exit.
 	 */
 	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
-		alloc_flags |= ALLOC_NO_WATERMARKS;
+		if (test_thread_flag(TIF_MEMDIE))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 		goto alloc;
 	}
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 12/12] mm: page_alloc: do not lock up low-order allocations upon OOM
  2015-03-25  6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
@ 2015-03-26 15:32   ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:16, Johannes Weiner wrote:
> When both page reclaim and the OOM killer fail to free memory, there
> are no more options for the allocator to make progress on its own.
> 
> Don't risk hanging these allocations.  Leave it to the allocation site
> to implement the fallback policy for failing allocations.

The changelog communicates the impact of this patch _very_ poorly. The
potential regression space is quite large. Every syscall which is not
allowed to return ENOMEM and it relies on an allocation would have to be
audited or a common mechanism to catch them deployed.

I really believe this is a good thing _longterm_ but I still do not
think it is the upstream material anytime soon without extensive testing
which is even not mentioned here.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 19 ++++++-------------
>  1 file changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9e45e97aa934..f2b1a17416c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2331,12 +2331,10 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> -	const struct alloc_context *ac, unsigned long *did_some_progress)
> +		      const struct alloc_context *ac)
>  {
>  	struct page *page = NULL;
>  
> -	*did_some_progress = 0;
> -
>  	/*
>  	 * This allocating task can become the OOM victim itself at
>  	 * any point before acquiring the lock.  In that case, exit
> @@ -2376,13 +2374,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  			goto out;
>  	}
>  
> -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> -		*did_some_progress = 1;
> -	} else {
> +	if (!out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false))
>  		/* Oops, these shouldn't happen with the OOM killer disabled */
> -		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> -			*did_some_progress = 1;
> -	}
> +		WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL);
>  
>  	/*
>  	 * Allocate from the OOM killer reserves.
> @@ -2799,13 +2793,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/* Reclaim has failed us, start killing things */
> -	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
> -				     &did_some_progress);
> +	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto got_pg;
>  
> -	/* Retry as long as the OOM killer is making progress */
> -	if (did_some_progress)
> +	/* Wait for user to order more dimms, cuz these are done */
> +	if (gfp_mask & __GFP_NOFAIL)
>  		goto retry;
>  
>  noretry:
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-26 15:23         ` Johannes Weiner
@ 2015-03-26 15:38           ` Michal Hocko
  2015-03-26 18:17             ` Johannes Weiner
  2015-03-27 14:01             ` [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying Tetsuo Handa
  0 siblings, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 15:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, david, tytso

On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > [...]
> > > > >  	/*
> > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > -	 * making progress for us.
> > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > +	 * quickly and don't block on the lock held by another task
> > > > > +	 * waiting for us to exit.
> > > > >  	 */
> > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > -		*did_some_progress = 1;
> > > > > -		schedule_timeout_uninterruptible(1);
> > > > > -		return NULL;
> > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > +		goto alloc;
> > > > >  	}
> > > > 
> > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > chosen by OOM killer.
> > > > This code will allow most of them to access memory reserves, won't it?
> > > 
> > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > any dying thread.  Thanks, I'll fix it in v2.
> > 
> > Do you plan to post this v2 here for review?
> 
> Yeah, I was going to wait for feedback to settle before updating the
> code.  But I was thinking something like this?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ce9c4c083a0..106793a75461 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	 * waiting for us to exit.
>  	 */
>  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> -		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		if (test_thread_flag(TIF_MEMDIE))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		goto alloc;
>  	}

OK, I have expected something like this. I understand why you want to
retry inside this function. But I would prefer if gfp_to_alloc_flags was
used here so that we do not have that TIF_MEMDIE logic duplicated at two
places.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
  2015-03-25 14:15   ` Tetsuo Handa
@ 2015-03-26 15:58   ` Michal Hocko
  2015-03-26 18:23     ` Johannes Weiner
  1 sibling, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 15:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:12, Johannes Weiner wrote:
> There is not much point in rushing back to the freelists and burning
> CPU cycles in direct reclaim when somebody else is in the process of
> OOM killing, or right after issuing a kill ourselves, because it could
> take some time for the OOM victim to release memory.

Yes this makes sense and it is better than what we have now. The
question is how long we should wait. I can see you have gone with HZ.
What is the value based on? Have your testing shown that the OOM victim
manages to die within a second most of the time?

I do not want to get into which value is the best discussion but I would
expect a larger value. Most OOM victims are not blocked so they would
wake up soon. This is a safety net for those who are blocked and I do
not think we have to expedite those rare cases and rather optimize for
"regular" OOM situations. How about 10-30s?

> This is a very cold error path, so there is not much hurry.  Use the
> OOM victim waitqueue to wait for victims to actually exit, which is a
> solid signal that the memory pinned by those tasks has been released.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/oom_kill.c   | 11 +++++++----
>  mm/page_alloc.c | 42 +++++++++++++++++++++++++-----------------
>  2 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39b3268..e066ac7353a4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> +	if (test_thread_flag(TIF_MEMDIE))
> +		return true;
>  	/*
> -	 * Give the killed threads a good chance of exiting before trying to
> -	 * allocate memory again.
> +	 * Wait for any outstanding OOM victims to die.  In rare cases
> +	 * victims can get stuck behind the allocating tasks, so the
> +	 * wait needs to be bounded.  It's crude alright, but cheaper
> +	 * than keeping a global dependency tree between all tasks.
>  	 */
> -	if (killed)
> -		schedule_timeout_killable(1);
> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>  
>  	return true;
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c1224ba45548..9ce9c4c083a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  }
>  
>  static inline struct page *
> -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * This allocating task can become the OOM victim itself at
> +	 * any point before acquiring the lock.  In that case, exit
> +	 * quickly and don't block on the lock held by another task
> +	 * waiting for us to exit.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> -		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> -		return NULL;
> +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		goto alloc;
>  	}
>  
>  	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure.
> +	 * While we have been waiting for the lock, the previous OOM
> +	 * kill might have released enough memory for the both of us.
>  	 */
> -	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
> -					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto out;
>  
> @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> -	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +
> +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
>  		*did_some_progress = 1;
> +	} else {
> +		/* Oops, these shouldn't happen with the OOM killer disabled */
> +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +			*did_some_progress = 1;
> +	}
>  out:
>  	mutex_unlock(&oom_lock);
> +alloc:
> +	if (!page)
> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +
>  	return page;
>  }
>  
> @@ -2775,7 +2782,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/* Reclaim has failed us, start killing things */
> -	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> +	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
> +				     &did_some_progress);
>  	if (page)
>  		goto got_pg;
>  
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 06/12] mm: oom_kill: simplify OOM killer locking
  2015-03-26 15:17     ` Johannes Weiner
@ 2015-03-26 16:07       ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-26 16:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu 26-03-15 11:17:46, Johannes Weiner wrote:
> On Thu, Mar 26, 2015 at 02:31:11PM +0100, Michal Hocko wrote:
[...]
> > > @@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > >   */
> > >  void pagefault_out_of_memory(void)
> > >  {
> > > -	struct zonelist *zonelist;
> > > -
> > > -	down_read(&oom_sem);
> > >  	if (mem_cgroup_oom_synchronize(true))
> > > -		goto unlock;
> > > +		return;
> > 
> > OK, so we are back to what David has asked previously. We do not need
> > the lock for memcg and oom_killer_disabled because we know that no tasks
> > (except for potential oom victim) are lurking around at the time
> > oom_killer_disable() is called. So I guess we want to stick a comment
> > into mem_cgroup_oom_synchronize before we check for oom_killer_disabled.
> 
> I would prefer everybody that sets TIF_MEMDIE and kills a task to hold
> the lock, including memcg.  Simplicity is one thing, but also a global
> OOM kill might not even be necessary when it's racing with the memcg.

sure I am find with that.
 
> > After those are fixed, feel free to add
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-26 15:38           ` Michal Hocko
@ 2015-03-26 18:17             ` Johannes Weiner
  2015-03-27 14:01             ` [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying Tetsuo Handa
  1 sibling, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 18:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, david, tytso

On Thu, Mar 26, 2015 at 04:38:47PM +0100, Michal Hocko wrote:
> On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> > On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > > Johannes Weiner wrote:
> > > [...]
> > > > > >  	/*
> > > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > > -	 * making progress for us.
> > > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > > +	 * quickly and don't block on the lock held by another task
> > > > > > +	 * waiting for us to exit.
> > > > > >  	 */
> > > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > > -		*did_some_progress = 1;
> > > > > > -		schedule_timeout_uninterruptible(1);
> > > > > > -		return NULL;
> > > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > > +		goto alloc;
> > > > > >  	}
> > > > > 
> > > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > > chosen by OOM killer.
> > > > > This code will allow most of them to access memory reserves, won't it?
> > > > 
> > > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > > any dying thread.  Thanks, I'll fix it in v2.
> > > 
> > > Do you plan to post this v2 here for review?
> > 
> > Yeah, I was going to wait for feedback to settle before updating the
> > code.  But I was thinking something like this?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9ce9c4c083a0..106793a75461 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * waiting for us to exit.
> >  	 */
> >  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > -		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		goto alloc;
> >  	}
> 
> OK, I have expected something like this. I understand why you want to
> retry inside this function. But I would prefer if gfp_to_alloc_flags was
> used here so that we do not have that TIF_MEMDIE logic duplicated at two
> places.

I don't think that's a good idea.  gfp_to_alloc_flags() reinitializes
the entire allocation context from the gfp flags and the task state,
but the only thing we care about, which can actually change here, is
TIF_MEMDIE.  This is perfectly obvious and expected in the OOM kill
allocation function, which makes my code self-documenting, whereas if
you use gfp_to_alloc_flags() you have to explain why it is called.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying
  2015-03-26 15:58   ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
@ 2015-03-26 18:23     ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-26 18:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Thu, Mar 26, 2015 at 04:58:46PM +0100, Michal Hocko wrote:
> On Wed 25-03-15 02:17:12, Johannes Weiner wrote:
> > There is not much point in rushing back to the freelists and burning
> > CPU cycles in direct reclaim when somebody else is in the process of
> > OOM killing, or right after issuing a kill ourselves, because it could
> > take some time for the OOM victim to release memory.
> 
> Yes this makes sense and it is better than what we have now. The
> question is how long we should wait. I can see you have gone with HZ.
> What is the value based on? Have your testing shown that the OOM victim
> manages to die within a second most of the time?
> 
> I do not want to get into which value is the best discussion but I would
> expect a larger value. Most OOM victims are not blocked so they would
> wake up soon. This is a safety net for those who are blocked and I do
> not think we have to expedite those rare cases and rather optimize for
> "regular" OOM situations. How about 10-30s?

Yup, I agree with that reasoning.  We can certainly go higher than HZ.

However, we should probably try to stay within the thresholds of any
lock/hang detection watchdogs, which on a higher level includes the
user itself, who might get confused if the machine hangs for 30s.

As I replied to Vlastimil, once the OOM victim hangs for several
seconds without a deadlock, failing the allocation wouldn't seem
entirely unreasonable, either.

But yes, something like 5-10s would still sound good to me.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-26 13:18       ` Michal Hocko
@ 2015-03-26 19:30         ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2015-03-26 19:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Theodore Ts'o

On Thu, 26 Mar 2015, Michal Hocko wrote:

> I am wrong here! pagefault_out_of_memory takes the lock and so the whole
> mem_cgroup_out_of_memory is called under the same lock.

If all userspace processes are frozen by the time oom_killer_disable() is 
called, then there shouldn't be any race with the android lmk calling 
mark_tsk_oom_victim() either, so I assume that you're acking this patch?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-26 11:05     ` Johannes Weiner
@ 2015-03-26 19:50       ` David Rientjes
  2015-03-30 14:48         ` Michal Hocko
  2015-04-02 23:01         ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
  0 siblings, 2 replies; 69+ messages in thread
From: David Rientjes @ 2015-03-26 19:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

On Thu, 26 Mar 2015, Johannes Weiner wrote:

> > > exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
> > > else can clear it concurrently.  Use clear_thread_flag() directly.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > For the oom killer, that's true because of task_lock(): we always only set 
> > TIF_MEMDIE when there is a valid p->mm and it's cleared in the exit path 
> > after the unlock, acting as a barrier, when p->mm is set to NULL so it's 
> > no longer a valid victim.  So that part is fine.
> > 
> > The problem is the android low memory killer that does 
> > mark_tsk_oom_victim() without the protection of task_lock(), it's just rcu 
> > protected so the reference to the task itself is guaranteed to still be 
> > valid.
> 
> But this is about *setting* it without a lock.  My point was that once
> TIF_MEMDIE is actually set, the task owns it and nobody else can clear
> it for them, so it's safe to test and clear non-atomically from the
> task's own context.  Am I missing something?
> 

Yes, I'm thinking about the following which already exists before your 
patch:

	tskA			tskB
	----			----
	lowmem_scan()
	-> tskB->mm != NULL
	-> selected = tskB
				exit_mm()
				exit_oom_victim()
				-> TIF_MEMDIE not set, return	
	mark_oom_victim(tskB)
	-> set TIF_MEMDIE

And now if tskA fails to exit then the oom killer is going to stall 
forever because we don't check for p->mm != NULL when testing eligible 
processes for TIF_MEMDIE.

So there's nothing wrong with your patch, I'm just digesting all of this 
new mark_oom_victim() stuff.

Acked-by: David Rientjes <rientjes@google.com>

I think the lmk should be doing this, in addition:


android, lmk: avoid setting TIF_MEMDIE if process has already exited

TIF_MEMDIE should not be set on a process if it does not have a valid 
->mm, and this is protected by task_lock().

If TIF_MEMDIE gets set after the mm has detached, and the process fails to 
exit, then the oom killer will defer forever waiting for it to exit.

Make sure that the mm is still valid before setting TIF_MEMDIE by way of 
mark_tsk_oom_victim().

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -156,20 +156,27 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     p->pid, p->comm, oom_score_adj, tasksize);
 	}
 	if (selected) {
-		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
-			     selected->pid, selected->comm,
-			     selected_oom_score_adj, selected_tasksize);
-		lowmem_deathpending_timeout = jiffies + HZ;
+		task_lock(selected);
+		if (!selected->mm) {
+			/* Already exited, cannot do mark_tsk_oom_victim() */
+			task_unlock(selected);
+			goto out;
+		}
 		/*
 		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
 		 * infrastructure. There is no real reason why the selected
 		 * task should have access to the memory reserves.
 		 */
 		mark_tsk_oom_victim(selected);
+		task_unlock(selected);
+		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
+			     selected->pid, selected->comm,
+			     selected_oom_score_adj, selected_tasksize);
+		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
-
+out:
 	lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
 		     sc->nr_to_scan, sc->gfp_mask, rem);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
                   ` (11 preceding siblings ...)
  2015-03-25  6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
@ 2015-03-26 19:58 ` Dave Chinner
  2015-03-27 15:05   ` Johannes Weiner
  12 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-03-26 19:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Michal Hocko, Theodore Ts'o

On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> Hi everybody,
> 
> in the recent past we've had several reports and discussions on how to
> deal with allocations hanging in the allocator upon OOM.
> 
> The idea of this series is mainly to make the mechanism of detecting
> OOM situations reliable enough that we can be confident about failing
> allocations, and then leave the fallback strategy to the caller rather
> than looping forever in the allocator.
> 
> The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> least for the short term while we don't have a reservation system yet.

A valid goal, but I think this series goes about it the wrong way.
i.e. it forces us to use __GFP_NOFAIL rather than providing us a
valid fallback mechanism to access reserves.

....

>  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> 
> An exacerbation of the victim-stuck-behind-allocation scenario are
> __GFP_NOFAIL allocations, because they will actually deadlock.  To
> avoid this, or try to, give __GFP_NOFAIL allocations access to not
> just the OOM reserves but also the system's emergency reserves.
> 
> This is basically a poor man's reservation system, which could or
> should be replaced later on with an explicit reservation system that
> e.g. filesystems have control over for use by transactions.
> 
> It's obviously not bulletproof and might still lock up, but it should
> greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> been using this successfully for some time.

So, if we want GFP_NOFS allocations to be able to dip into a
small extra reservation to make progress at ENOMEM, we have to use
use __GFP_NOFAIL because looping ourselves won't allow use of these
extra reserves?

>  mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
> 
> Another hang that was reported was from NOFS allocations.  The trouble
> with these is that they can't issue or wait for writeback during page
> reclaim, and so we don't want to OOM kill on their behalf.  However,
> with such restrictions on making progress, they are prone to hangs.

And because this effectively means GFP_NOFS allocations are
going to fail much more often, we're either going to have to loop
ourselves or use __GFP_NOFAIL...

> This patch makes NOFS allocations fail if reclaim can't free anything.
> 
> It would be good if the filesystem people could weigh in on whether
> they can deal with failing GFP_NOFS allocations, or annotate the
> exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> ground is required that allows using the OOM killer before giving up.

... which looks to me like a catch-22 situation for us: We
have reserves, but callers need to use __GFP_NOFAIL to access them.
GFP_NOFS is going to fail more often, so callers need to handle that
in some way, either by looping or erroring out.

But if we loop manually because we try to handle ENOMEM situations
gracefully (e.g. try a number of times before erroring out) we can't
dip into the reserves because the only semantics being provided are
"try-once-without-reserves" or "try-forever-with-reserves".  i.e.
what we actually need here is "try-once-with-reserves" semantics so
that we can make progress after a failing GFP_NOFS
"try-once-without-reserves" allocation.

IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
__GFP_USE_RESERVE that we need on the failure fallback path. Which,
incidentally, is trivial to add to the XFS allocation code. Indeed,
I'll request that you test series like this on metadata intensive
filesystem workloads on XFS under memory stress and quantify how
many new "XFS: possible deadlock in memory allocation" warnings are
emitted. If the patch set floods the system with such warnings, then
it means the proposed means the fallback for "caller handles
allocation failure" is not making progress.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable()
  2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
  2015-03-26  0:51   ` David Rientjes
  2015-03-26 11:43   ` Michal Hocko
@ 2015-03-26 20:05   ` David Rientjes
  2 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2015-03-26 20:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Michal Hocko, Theodore Ts'o

On Wed, 25 Mar 2015, Johannes Weiner wrote:

> Setting oom_killer_disabled to false is atomic, there is no need for
> further synchronization with ongoing allocations trying to OOM-kill.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying
  2015-03-26 15:38           ` Michal Hocko
  2015-03-26 18:17             ` Johannes Weiner
@ 2015-03-27 14:01             ` Tetsuo Handa
  1 sibling, 0 replies; 69+ messages in thread
From: Tetsuo Handa @ 2015-03-27 14:01 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: linux-mm, linux-fsdevel, linux-kernel, torvalds, akpm,
	ying.huang, aarcange, david, tytso

Michal Hocko wrote:
> On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> > On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > > Johannes Weiner wrote:
> > > [...]
> > > > > >  	/*
> > > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > > -	 * making progress for us.
> > > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > > +	 * quickly and don't block on the lock held by another task
> > > > > > +	 * waiting for us to exit.
> > > > > >  	 */
> > > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > > -		*did_some_progress = 1;
> > > > > > -		schedule_timeout_uninterruptible(1);
> > > > > > -		return NULL;
> > > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > > +		goto alloc;
> > > > > >  	}
> > > > > 
> > > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > > chosen by OOM killer.
> > > > > This code will allow most of them to access memory reserves, won't it?
> > > > 
> > > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > > any dying thread.  Thanks, I'll fix it in v2.
> > > 
> > > Do you plan to post this v2 here for review?
> > 
> > Yeah, I was going to wait for feedback to settle before updating the
> > code.  But I was thinking something like this?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9ce9c4c083a0..106793a75461 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * waiting for us to exit.
> >  	 */
> >  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > -		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		goto alloc;
> >  	}
> 
> OK, I have expected something like this. I understand why you want to
> retry inside this function. But I would prefer if gfp_to_alloc_flags was
> used here so that we do not have that TIF_MEMDIE logic duplicated at two
> places.

I thought we expected something like

-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
-		return NULL;
+		if (test_thread_flag(TIF_MEMDIE))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else
+			schedule_timeout_uninterruptible(1);
+		goto alloc;
 	}

or

 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
+		if (test_thread_flag(TIF_MEMDIE)) {
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+			goto alloc;
+		}
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}

because jumping to

  return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

without modifying alloc_flags and without setting *did_some_progress to 1
will lead to immediate allocation failures for !TIF_MEMDIE threads.

I don't like allowing only TIF_MEMDIE to get reserve access, for it can be
one of !TIF_MEMDIE threads which really need memory to safely terminate without
failing allocations from do_exit(). Rather, why not to discontinue TIF_MEMDIE
handling and allow getting access to private memory reserves for all
fatal_signal_pending() threads (i.e. replacing WMARK_OOM with WMARK_KILLED
in "[patch 09/12] mm: page_alloc: private memory reserves for OOM-killing
allocations") ?

> > > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (gfp_mask & __GFP_THISNODE)
> > >  			goto out;
> > >  	}
> > > -	/* Exhausted what can be done so it's blamo time */
> > > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > > +
> > > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> > >  		*did_some_progress = 1;
> > > +	} else {
> > > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > > +			*did_some_progress = 1;
> > > +	}
> > 
> > I think GFP_NOFAIL allocations need to involve OOM killer than
> > pretending as if forward progress is made. If all of in-flight
> > allocation requests are GFP_NOFAIL, the system will lock up.
> 
> Hm?  They do involve the OOM killer, but once userspace is frozen for
> suspend/hibernate we shouldn't kill and thaw random tasks anymore as
> that might corrupt the memory snapshot, so nofail allocations are a
> bug at this point.
> 

Aren't there still kernel threads which might do GFP_NOFAIL allocations?
I think corrupting the memory snapshot by involving the OOM killer is the
correct behavior than a bug.

> > After all, if we wait for OOM killer progress before retrying, I think
> > we should involve OOM killer after some bounded timeout regardless of
> > gfp flags, and let OOM killer kill more threads after another bounded
> > timeout. Otherwise, the corner cases will lock up the system.
> 
> Giving nofail allocations access to emergency reserves targets this
> problem, but I agree with you that it's still possible for the system
> to lock up if they have been consumed and still no task made enough
> forward progress to release memory.  It is unlikely but possible.
> 
> I will probably come back to the OOM victim timeout patch some time in
> the future as that seems more robust.  It would also drastically
> simplify memcg OOM handling.  But that patch was controversial in the
> past and seemed beyond the scope of this patch set.
> 

A timeout for recovering from WMARK_KILLED to WMARK_MIN could be used for
detecting whether we need to trigger additinal OOM-killing?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
@ 2015-03-27 15:05   ` Johannes Weiner
  2015-03-30  0:32     ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-03-27 15:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Michal Hocko, Theodore Ts'o

On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> > Hi everybody,
> > 
> > in the recent past we've had several reports and discussions on how to
> > deal with allocations hanging in the allocator upon OOM.
> > 
> > The idea of this series is mainly to make the mechanism of detecting
> > OOM situations reliable enough that we can be confident about failing
> > allocations, and then leave the fallback strategy to the caller rather
> > than looping forever in the allocator.
> > 
> > The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> > least for the short term while we don't have a reservation system yet.
> 
> A valid goal, but I think this series goes about it the wrong way.
> i.e. it forces us to use __GFP_NOFAIL rather than providing us a
> valid fallback mechanism to access reserves.

I think you misunderstood the goal.

While I agree that reserves would be the optimal fallback strategy,
this series is about avoiding deadlocks in existing callsites that
currently can not fail.  This is about getting the best out of our
existing mechanisms until we have universal reservation coverage,
which will take time to devise and transition our codebase to.

GFP_NOFS sites are currently one of the sites that can deadlock inside
the allocator, even though many of them seem to have fallback code.
My reasoning here is that if you *have* an exit strategy for failing
allocations that is smarter than hanging, we should probably use that.

> >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > 
> > An exacerbation of the victim-stuck-behind-allocation scenario are
> > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > just the OOM reserves but also the system's emergency reserves.
> > 
> > This is basically a poor man's reservation system, which could or
> > should be replaced later on with an explicit reservation system that
> > e.g. filesystems have control over for use by transactions.
> > 
> > It's obviously not bulletproof and might still lock up, but it should
> > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > been using this successfully for some time.
> 
> So, if we want GFP_NOFS allocations to be able to dip into a
> small extra reservation to make progress at ENOMEM, we have to use
> use __GFP_NOFAIL because looping ourselves won't allow use of these
> extra reserves?

As I said, this series is not about providing reserves just yet.  It
is about using the fallback strategies you already implemented.  And
where you don't have any, it's about making the allocator's last way
of forward progress, the OOM killer, more reliable.

If you have an allocation site that is endlessly looping around calls
to the allocator, it means you DON'T have a fallback strategy.  In
that case, it would be in your interest to tell the allocator, such
that it can take measures to break the infinite loop.

However, those measures are not without their own risk and they need
to be carefully sequenced to reduce the risk for deadlocks.  E.g. we
can not give __GFP_NOFAIL allocations access to the statically-sized
emergency reserves without taking steps to free memory at the same
time, because then we'd just trade forward progress of that allocation
against forward progress of some memory reclaimer later on which finds
the emergency reserves exhausted.

> >  mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
> > 
> > Another hang that was reported was from NOFS allocations.  The trouble
> > with these is that they can't issue or wait for writeback during page
> > reclaim, and so we don't want to OOM kill on their behalf.  However,
> > with such restrictions on making progress, they are prone to hangs.
> 
> And because this effectively means GFP_NOFS allocations are
> going to fail much more often, we're either going to have to loop
> ourselves or use __GFP_NOFAIL...
> 
> > This patch makes NOFS allocations fail if reclaim can't free anything.
> > 
> > It would be good if the filesystem people could weigh in on whether
> > they can deal with failing GFP_NOFS allocations, or annotate the
> > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > ground is required that allows using the OOM killer before giving up.
> 
> ... which looks to me like a catch-22 situation for us: We
> have reserves, but callers need to use __GFP_NOFAIL to access them.
> GFP_NOFS is going to fail more often, so callers need to handle that
> in some way, either by looping or erroring out.
> 
> But if we loop manually because we try to handle ENOMEM situations
> gracefully (e.g. try a number of times before erroring out) we can't
> dip into the reserves because the only semantics being provided are
> "try-once-without-reserves" or "try-forever-with-reserves".  i.e.
> what we actually need here is "try-once-with-reserves" semantics so
> that we can make progress after a failing GFP_NOFS
> "try-once-without-reserves" allocation.
>
> IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
> __GFP_USE_RESERVE that we need on the failure fallback path. Which,
> incidentally, is trivial to add to the XFS allocation code. Indeed,
> I'll request that you test series like this on metadata intensive
> filesystem workloads on XFS under memory stress and quantify how
> many new "XFS: possible deadlock in memory allocation" warnings are
> emitted. If the patch set floods the system with such warnings, then
> it means the proposed means the fallback for "caller handles
> allocation failure" is not making progress.

Again, we don't have reserves with this series, we only have a small
pool to compensate for OOM kills getting stuck behind the allocation.
They are an extension of the OOM killer that can not live on their own
right now, so the best we could do at this point is give you a way to
annotate NOFS allocations to OOM kill (and then try the OOM reserves)
before failing the allocation.

However, since that can still fail, what would be your ultimate course
of action?  The reason I'm asking is because the message you are
quoting is from this piece of code:

void *
kmem_alloc(size_t size, xfs_km_flags_t flags)
{
	int	retries = 0;
	gfp_t	lflags = kmem_flags_convert(flags);
	void	*ptr;

	do {
		ptr = kmalloc(size, lflags);
		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
			return ptr;
		if (!(++retries % 100))
			xfs_err(NULL,
		"possible memory allocation deadlock in %s (mode:0x%x)",
					__func__, lflags);
		congestion_wait(BLK_RW_ASYNC, HZ/50);
	} while (1);
}

and that does not implement a fallback strategy.  The only way to not
trigger those warnings is continuing to loop in the allocator, so you
might as well use __GFP_NOFAIL here.  This is not the sort of callsite
that "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
had in mind, because it will continue to lock up on OOM either way.
Only instead of in the allocator, it will lock up in the xfs code.

This is a NOFAIL caller, so it would benefit from those changes in the
series that make __GFP_NOFAIL more reliable.

But what about your other NOFS callers?  Are there any that have
actual fallback code?  Those that the allocator should fail rather
than hang if it runs out of memory?  Those are what 11/12 is about.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-03-27 15:05   ` Johannes Weiner
@ 2015-03-30  0:32     ` Dave Chinner
  2015-03-30 19:31       ` Johannes Weiner
  2015-04-01 15:19       ` Michal Hocko
  0 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2015-03-30  0:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Michal Hocko, Theodore Ts'o

On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> > > Hi everybody,
> > > 
> > > in the recent past we've had several reports and discussions on how to
> > > deal with allocations hanging in the allocator upon OOM.
> > > 
> > > The idea of this series is mainly to make the mechanism of detecting
> > > OOM situations reliable enough that we can be confident about failing
> > > allocations, and then leave the fallback strategy to the caller rather
> > > than looping forever in the allocator.
> > > 
> > > The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> > > least for the short term while we don't have a reservation system yet.
> > 
> > A valid goal, but I think this series goes about it the wrong way.
> > i.e. it forces us to use __GFP_NOFAIL rather than providing us a
> > valid fallback mechanism to access reserves.
> 
> I think you misunderstood the goal.
> 
> While I agree that reserves would be the optimal fallback strategy,
> this series is about avoiding deadlocks in existing callsites that
> currently can not fail.  This is about getting the best out of our
> existing mechanisms until we have universal reservation coverage,
> which will take time to devise and transition our codebase to.

That might be the goal, but it looks like the wrong path to me.

> GFP_NOFS sites are currently one of the sites that can deadlock inside
> the allocator, even though many of them seem to have fallback code.
> My reasoning here is that if you *have* an exit strategy for failing
> allocations that is smarter than hanging, we should probably use that.

We already do that for allocations where we can handle failure in
GFP_NOFS conditions. It is, however, somewhat useless if we can't
tell the allocator to try really hard if we've already had a failure
and we are already in memory reclaim conditions (e.g. a shrinker
trying to clean dirty objects so they can be reclaimed).

>From that perspective, I think that this patch set aims force us
away from handling fallbacks ourselves because a) it makes GFP_NOFS
more likely to fail, and b) provides no mechanism to "try harder"
when we really need the allocation to succeed.

> > >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > > 
> > > An exacerbation of the victim-stuck-behind-allocation scenario are
> > > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > > just the OOM reserves but also the system's emergency reserves.
> > > 
> > > This is basically a poor man's reservation system, which could or
> > > should be replaced later on with an explicit reservation system that
> > > e.g. filesystems have control over for use by transactions.
> > > 
> > > It's obviously not bulletproof and might still lock up, but it should
> > > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > > been using this successfully for some time.
> > 
> > So, if we want GFP_NOFS allocations to be able to dip into a
> > small extra reservation to make progress at ENOMEM, we have to use
> > use __GFP_NOFAIL because looping ourselves won't allow use of these
> > extra reserves?
> 
> As I said, this series is not about providing reserves just yet.  It
> is about using the fallback strategies you already implemented.  And
> where you don't have any, it's about making the allocator's last way
> of forward progress, the OOM killer, more reliable.

Sure - but you're doing that by adding a special reserve for
GFP_NOFAIL allocations to dip into when the OOM killer is active.
That can only be accessed by GFP_NOFAIL allocations - anyone who
has a fallback but really needs the allocation to succeed if at all
possible (i.e. should only fail to avoid a deadlock situation) can't
communicate that fact to the allocator....

....

> > > This patch makes NOFS allocations fail if reclaim can't free anything.
> > > 
> > > It would be good if the filesystem people could weigh in on whether
> > > they can deal with failing GFP_NOFS allocations, or annotate the
> > > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > > ground is required that allows using the OOM killer before giving up.
> > 
> > ... which looks to me like a catch-22 situation for us: We
> > have reserves, but callers need to use __GFP_NOFAIL to access them.
> > GFP_NOFS is going to fail more often, so callers need to handle that
> > in some way, either by looping or erroring out.
> > 
> > But if we loop manually because we try to handle ENOMEM situations
> > gracefully (e.g. try a number of times before erroring out) we can't
> > dip into the reserves because the only semantics being provided are
> > "try-once-without-reserves" or "try-forever-with-reserves".  i.e.
> > what we actually need here is "try-once-with-reserves" semantics so
> > that we can make progress after a failing GFP_NOFS
> > "try-once-without-reserves" allocation.
> >
> > IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
> > __GFP_USE_RESERVE that we need on the failure fallback path. Which,
> > incidentally, is trivial to add to the XFS allocation code. Indeed,
> > I'll request that you test series like this on metadata intensive
> > filesystem workloads on XFS under memory stress and quantify how
> > many new "XFS: possible deadlock in memory allocation" warnings are
> > emitted. If the patch set floods the system with such warnings, then
> > it means the proposed means the fallback for "caller handles
> > allocation failure" is not making progress.
> 
> Again, we don't have reserves with this series, we only have a small
> pool to compensate for OOM kills getting stuck behind the allocation.

Which is, in effect, a reserve pool to make progress and prevent
deadlocks.

> They are an extension of the OOM killer that can not live on their own
> right now, so the best we could do at this point is give you a way to
> annotate NOFS allocations to OOM kill (and then try the OOM reserves)
> before failing the allocation.

Well, yes, that's exactly what I'm saying we need, especially if you
are making GFP_NOFS more likely to fail. And the patchset that makes
GFP_NOFS more likely to fail also needs to add those "retry harder"
flags to subsystems that are adversely affected by making GFP_NOFS
fail more easily.

> However, since that can still fail, what would be your ultimate course
> of action?

Ultimately, that is my problem and not yours. The allocation/reclaim
subsystem provides mechanism, not policy.

> The reason I'm asking is because the message you are
> quoting is from this piece of code:
> 
> void *
> kmem_alloc(size_t size, xfs_km_flags_t flags)
> {
> 	int	retries = 0;
> 	gfp_t	lflags = kmem_flags_convert(flags);
> 	void	*ptr;
> 
> 	do {
> 		ptr = kmalloc(size, lflags);
> 		if (ptr || (flags & (KM_MAYFAIL|KM_NOSLEEP)))
> 			return ptr;
> 		if (!(++retries % 100))
> 			xfs_err(NULL,
> 		"possible memory allocation deadlock in %s (mode:0x%x)",
> 					__func__, lflags);
> 		congestion_wait(BLK_RW_ASYNC, HZ/50);
> 	} while (1);
> }
> 
> and that does not implement a fallback strategy.

You're wrong - there's a clear fallback strategy in that code.

Look at the ((flags & (KM_MAYFAIL|KM_NOSLEEP)) check - KM_MAYFAIL is
the *XFS annotation* that tells the *XFS allocator* that failure can
be handled by the caller. There are quite a few places where it is
used and we are slowly adding more as we add code to handle ENOMEM
into the higher layers of XFS.

Indeed, a prime example is this code in xfs_iflush_cluster():

	ilist = kmem_alloc(ilist_size, KM_MAYFAIL|KM_NOFS);
	if (!ilist)
		goto out_put;

That allocation context is used because it can be called by the
inode reclaim shrinker to clean dirty inodes that need to be
reclaimed. We need to be in GFP_NOFS context because we can't
recurse into FS reclaim again, but we can allow failures because the
shrinker will come back and try to reclaim the inode again later.

i.e. this is a case where we are using GFP_NOFS to prevent reclaim
recursion deadlocks rather than a place where we are using GFP_NOFS
because we hold filesystem locks or are in a filesystem transaction
context.

However, because of it's context in the memory reclaim path, we need
to be able to make use of all possible memory reserves here because
the only memory that is reclaimable might be dirty inodes that need
writeback. Hence making GFP_NOFS allocations fail more easily will
make it harder to write back dirty inodes when there is severe
memory pressure. Therefore we really need to be able to fail, then
try harder (i.e. the "use oom reserves" falgs), and then if that
fails we then return failure.

FWIW, in the IRC discussion on #xfs between xfs developers, we
talked about configurable error behaviour (e.g thru sysfs) for
handling different types of errors. That came about through this RFC
for handling IO errors in different ways (e.g. for optimising error
handling on thinp devices):

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

ENOMEM handling was another error we talked about in this context,
allowing the admin of the system to determine how many allocation
retries should be allowed before we consider the allocation a
"failed" and hence should fail and shutdown the fs rather than try
forever. Hence admins can chose to loop forever and hope that it
doesn't deadlock, or if deadlocks are occurring they can be broken
by allowing the fs to fail and potentially shut down. i.e. the admin
can chose the lesser evil for their production system workloads....

i.e. yet another potential fallback strategy, but one that is also
reliant on being able to ask the mm subsytem to only fail if there
really is no memory available at all....

> The only way to not
> trigger those warnings is continuing to loop in the allocator, so you
> might as well use __GFP_NOFAIL here.

You're assuming we want those warnings to go away. We don't - we
want to see those warnings because it's a telltale sign to XFS
developers during problem triage that low memory problems are
occuring and may be caused by or related to filesystem issues...

> This is not the sort of callsite
> that "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> had in mind, because it will continue to lock up on OOM either way.
> Only instead of in the allocator, it will lock up in the xfs code.

Precisely why I said that you need to test changes like like this
against filesystem workloads to determine if you're making things
worse or not. Changing GFP_NOFS allocation behaviour will have
unexpected side effects for a lot of people and subsystems, so it
should not be done without a *lot* of testing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear
  2015-03-26 19:50       ` David Rientjes
@ 2015-03-30 14:48         ` Michal Hocko
  2015-04-02 23:01         ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
  1 sibling, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-03-30 14:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Dave Chinner, Theodore Ts'o

On Thu 26-03-15 12:50:20, David Rientjes wrote:
[...]
> android, lmk: avoid setting TIF_MEMDIE if process has already exited
> 
> TIF_MEMDIE should not be set on a process if it does not have a valid 
> ->mm, and this is protected by task_lock().
> 
> If TIF_MEMDIE gets set after the mm has detached, and the process fails to 
> exit, then the oom killer will defer forever waiting for it to exit.
> 
> Make sure that the mm is still valid before setting TIF_MEMDIE by way of 
> mark_tsk_oom_victim().

I would prefer if lmk didn't abuse mark_tsk_oom_victim much more. But
this is correct as well.

> Signed-off-by: David Rientjes <rientjes@google.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -156,20 +156,27 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
>  			     p->pid, p->comm, oom_score_adj, tasksize);
>  	}
>  	if (selected) {
> -		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
> -			     selected->pid, selected->comm,
> -			     selected_oom_score_adj, selected_tasksize);
> -		lowmem_deathpending_timeout = jiffies + HZ;
> +		task_lock(selected);
> +		if (!selected->mm) {
> +			/* Already exited, cannot do mark_tsk_oom_victim() */
> +			task_unlock(selected);
> +			goto out;
> +		}
>  		/*
>  		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
>  		 * infrastructure. There is no real reason why the selected
>  		 * task should have access to the memory reserves.
>  		 */
>  		mark_tsk_oom_victim(selected);
> +		task_unlock(selected);
> +		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
> +			     selected->pid, selected->comm,
> +			     selected_oom_score_adj, selected_tasksize);
> +		lowmem_deathpending_timeout = jiffies + HZ;
>  		send_sig(SIGKILL, selected, 0);
>  		rem += selected_tasksize;
>  	}
> -
> +out:
>  	lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
>  		     sc->nr_to_scan, sc->gfp_mask, rem);
>  	rcu_read_unlock();

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-03-30  0:32     ` Dave Chinner
@ 2015-03-30 19:31       ` Johannes Weiner
  2015-04-01 15:19       ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-03-30 19:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Michal Hocko, Theodore Ts'o

On Mon, Mar 30, 2015 at 11:32:40AM +1100, Dave Chinner wrote:
> On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > the allocator, even though many of them seem to have fallback code.
> > My reasoning here is that if you *have* an exit strategy for failing
> > allocations that is smarter than hanging, we should probably use that.
> 
> We already do that for allocations where we can handle failure in
> GFP_NOFS conditions. It is, however, somewhat useless if we can't
> tell the allocator to try really hard if we've already had a failure
> and we are already in memory reclaim conditions (e.g. a shrinker
> trying to clean dirty objects so they can be reclaimed).

What do you mean you already do that?  These allocations currently
won't fail.  They loop forever in the allocator.  Fallback code is
dead code right now.  (Unless you do order-4 and up, which I doubt.)

> From that perspective, I think that this patch set aims force us
> away from handling fallbacks ourselves because a) it makes GFP_NOFS
> more likely to fail, and b) provides no mechanism to "try harder"
> when we really need the allocation to succeed.

If by "more likely" you mean "at all possible", then yes.

However, as far as trying harder goes, that sounds like a good idea.
It should be possible for NOFS contexts to use the OOM killer and its
reserves.  But still, they should be allowed to propagate allocation
failures rather than just hanging in the allocator.

> > > >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > > > 
> > > > An exacerbation of the victim-stuck-behind-allocation scenario are
> > > > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > > > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > > > just the OOM reserves but also the system's emergency reserves.
> > > > 
> > > > This is basically a poor man's reservation system, which could or
> > > > should be replaced later on with an explicit reservation system that
> > > > e.g. filesystems have control over for use by transactions.
> > > > 
> > > > It's obviously not bulletproof and might still lock up, but it should
> > > > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > > > been using this successfully for some time.
> > > 
> > > So, if we want GFP_NOFS allocations to be able to dip into a
> > > small extra reservation to make progress at ENOMEM, we have to use
> > > use __GFP_NOFAIL because looping ourselves won't allow use of these
> > > extra reserves?
> > 
> > As I said, this series is not about providing reserves just yet.  It
> > is about using the fallback strategies you already implemented.  And
> > where you don't have any, it's about making the allocator's last way
> > of forward progress, the OOM killer, more reliable.
> 
> Sure - but you're doing that by adding a special reserve for
> GFP_NOFAIL allocations to dip into when the OOM killer is active.
> That can only be accessed by GFP_NOFAIL allocations - anyone who
> has a fallback but really needs the allocation to succeed if at all
> possible (i.e. should only fail to avoid a deadlock situation) can't
> communicate that fact to the allocator....

Hm?  It's not restricted to NOFAIL at all, look closer at my patch
series.  What you are describing is exactly how I propose the
allocator should handle all regular allocations: exhaust reclaimable
pages, use the OOM killer, dip into OOM reserves, but ultimately fail.
The only thing __GFP_NOFAIL does in *addition* to that is use the last
emergency reserves of the system in an attempt to avoid deadlocking.

[ Once those reserves are depleted, however, the system will deadlock,
  so we can only give them to allocations that would otherwise lock up
  anyway, i.e. __GFP_NOFAIL.  It would be silly to risk a system
  deadlock for an allocation that has a fallback strategy.  That is
  why you have to let the allocator know whether you can fall back. ]

The notable exception to this behavior are NOFS callers because of its
current OOM kill restrictions.  But as I said, I'm absolutely open to
addressing this and either let them generally use the OOM killer after
some time, or provide you with another annotation that lets you come
back to try harder.  I don't really care which way, that depends on
your requirements.

> > > > This patch makes NOFS allocations fail if reclaim can't free anything.
> > > > 
> > > > It would be good if the filesystem people could weigh in on whether
> > > > they can deal with failing GFP_NOFS allocations, or annotate the
> > > > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > > > ground is required that allows using the OOM killer before giving up.
> > > 
> > > ... which looks to me like a catch-22 situation for us: We
> > > have reserves, but callers need to use __GFP_NOFAIL to access them.
> > > GFP_NOFS is going to fail more often, so callers need to handle that
> > > in some way, either by looping or erroring out.
> > > 
> > > But if we loop manually because we try to handle ENOMEM situations
> > > gracefully (e.g. try a number of times before erroring out) we can't
> > > dip into the reserves because the only semantics being provided are
> > > "try-once-without-reserves" or "try-forever-with-reserves".  i.e.
> > > what we actually need here is "try-once-with-reserves" semantics so
> > > that we can make progress after a failing GFP_NOFS
> > > "try-once-without-reserves" allocation.
> > >
> > > IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
> > > __GFP_USE_RESERVE that we need on the failure fallback path. Which,
> > > incidentally, is trivial to add to the XFS allocation code. Indeed,
> > > I'll request that you test series like this on metadata intensive
> > > filesystem workloads on XFS under memory stress and quantify how
> > > many new "XFS: possible deadlock in memory allocation" warnings are
> > > emitted. If the patch set floods the system with such warnings, then
> > > it means the proposed means the fallback for "caller handles
> > > allocation failure" is not making progress.
> > 
> > Again, we don't have reserves with this series, we only have a small
> > pool to compensate for OOM kills getting stuck behind the allocation.
> 
> Which is, in effect, a reserve pool to make progress and prevent
> deadlocks.

No, it's absolutely not.  They are meant to serve single allocations
when the allocation contexts themselves are blocking the OOM killer.
They can not guarantee forward progress for a series of allocations
from a single context, because they are not sized to that worst case.

This is is only about increasing the probability of success.  In terms
of semantics, nothing has changed: our only options continue to be to
either fail an allocation or risk deadlocking it.

> > They are an extension of the OOM killer that can not live on their own
> > right now, so the best we could do at this point is give you a way to
> > annotate NOFS allocations to OOM kill (and then try the OOM reserves)
> > before failing the allocation.
> 
> Well, yes, that's exactly what I'm saying we need, especially if you
> are making GFP_NOFS more likely to fail. And the patchset that makes
> GFP_NOFS more likely to fail also needs to add those "retry harder"
> flags to subsystems that are adversely affected by making GFP_NOFS
> fail more easily.

Okay, I think we agree on that.

> > However, since that can still fail, what would be your ultimate course
> > of action?
> 
> Ultimately, that is my problem and not yours. The allocation/reclaim
> subsystem provides mechanism, not policy.

I'm trying to hand back that control to you, but that means you have
to actually deal with allocation failures and not just emit warnings
in an endless loop.  In case of a deadlock, simply retrying without
dropping any locks doesn't magically make the allocation work, so
whether you warn after one or after 1000 loops doesn't matter.

You can see how it could be hard to sanitize the allocator behavior
when your proposed litmus test punishes the allocator for propagating
an allocation failure, and rewards hanging in the allocator instead?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-03-30  0:32     ` Dave Chinner
  2015-03-30 19:31       ` Johannes Weiner
@ 2015-04-01 15:19       ` Michal Hocko
  2015-04-01 21:39         ` Dave Chinner
  2015-04-07 14:18         ` Johannes Weiner
  1 sibling, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-01 15:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
[...]
> > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > the allocator, even though many of them seem to have fallback code.
> > My reasoning here is that if you *have* an exit strategy for failing
> > allocations that is smarter than hanging, we should probably use that.
> 
> We already do that for allocations where we can handle failure in
> GFP_NOFS conditions. It is, however, somewhat useless if we can't
> tell the allocator to try really hard if we've already had a failure
> and we are already in memory reclaim conditions (e.g. a shrinker
> trying to clean dirty objects so they can be reclaimed).
> 
> From that perspective, I think that this patch set aims force us
> away from handling fallbacks ourselves because a) it makes GFP_NOFS
> more likely to fail, and b) provides no mechanism to "try harder"
> when we really need the allocation to succeed.

You can ask for this "try harder" by __GFP_HIGH flag. Would that help
in your fallback case?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-01 15:19       ` Michal Hocko
@ 2015-04-01 21:39         ` Dave Chinner
  2015-04-02  7:29           ` Michal Hocko
  2015-04-07 14:18         ` Johannes Weiner
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-04-01 21:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> [...]
> > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > the allocator, even though many of them seem to have fallback code.
> > > My reasoning here is that if you *have* an exit strategy for failing
> > > allocations that is smarter than hanging, we should probably use that.
> > 
> > We already do that for allocations where we can handle failure in
> > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > tell the allocator to try really hard if we've already had a failure
> > and we are already in memory reclaim conditions (e.g. a shrinker
> > trying to clean dirty objects so they can be reclaimed).
> > 
> > From that perspective, I think that this patch set aims force us
> > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > more likely to fail, and b) provides no mechanism to "try harder"
> > when we really need the allocation to succeed.
> 
> You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> in your fallback case?

That dips into GFP_ATOMIC reserves, right? What is the impact on the
GFP_ATOMIC allocations that need it? We typically see network cards
fail GFP_ATOMIC allocations before XFS starts complaining about
allocation failures, so i suspect that this might just make things
worse rather than better...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-01 21:39         ` Dave Chinner
@ 2015-04-02  7:29           ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-02  7:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Thu 02-04-15 08:39:02, Dave Chinner wrote:
> On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > [...]
> > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > the allocator, even though many of them seem to have fallback code.
> > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > allocations that is smarter than hanging, we should probably use that.
> > > 
> > > We already do that for allocations where we can handle failure in
> > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > tell the allocator to try really hard if we've already had a failure
> > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > trying to clean dirty objects so they can be reclaimed).
> > > 
> > > From that perspective, I think that this patch set aims force us
> > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > more likely to fail, and b) provides no mechanism to "try harder"
> > > when we really need the allocation to succeed.
> > 
> > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > in your fallback case?
> 
> That dips into GFP_ATOMIC reserves, right? What is the impact on the
> GFP_ATOMIC allocations that need it?

Yes the memory reserve is shared but the flag would be used only after
previous GFP_NOFS allocation has failed which means that that the system
is close to the OOM and chances for GFP_ATOMIC allocations (which are
GFP_NOWAIT and cannot perform any reclaim) success are quite low already.

> We typically see network cards fail GFP_ATOMIC allocations before XFS
> starts complaining about allocation failures, so i suspect that this
> might just make things worse rather than better...

My understanding is that GFP_ATOMIC allocation would fallback to
GFP_WAIT type of allocation in the deferred context in the networking
code. There would be some performance hit but again we are talking
about close to OOM conditions here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited
  2015-03-26 19:50       ` David Rientjes
  2015-03-30 14:48         ` Michal Hocko
@ 2015-04-02 23:01         ` David Rientjes
  2015-04-28 22:50           ` [patch resend] " David Rientjes
  1 sibling, 1 reply; 69+ messages in thread
From: David Rientjes @ 2015-04-02 23:01 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Arve Hjønnevåg, Riley Andrews, Andrew Morton,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm, devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2040 bytes --]

TIF_MEMDIE should not be set on a process if it does not have a valid 
->mm, and this is protected by task_lock().

If TIF_MEMDIE gets set after the mm has detached, and the process fails to 
exit, then the oom killer will defer forever waiting for it to exit.

Make sure that the mm is still valid before setting TIF_MEMDIE by way of 
mark_tsk_oom_victim().

Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Riley Andrews <riandrews@android.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -156,20 +156,27 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     p->pid, p->comm, oom_score_adj, tasksize);
 	}
 	if (selected) {
-		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
-			     selected->pid, selected->comm,
-			     selected_oom_score_adj, selected_tasksize);
-		lowmem_deathpending_timeout = jiffies + HZ;
+		task_lock(selected);
+		if (!selected->mm) {
+			/* Already exited, cannot do mark_tsk_oom_victim() */
+			task_unlock(selected);
+			goto out;
+		}
 		/*
 		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
 		 * infrastructure. There is no real reason why the selected
 		 * task should have access to the memory reserves.
 		 */
 		mark_tsk_oom_victim(selected);
+		task_unlock(selected);
+		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
+			     selected->pid, selected->comm,
+			     selected_oom_score_adj, selected_tasksize);
+		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
-
+out:
 	lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
 		     sc->nr_to_scan, sc->gfp_mask, rem);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-01 15:19       ` Michal Hocko
  2015-04-01 21:39         ` Dave Chinner
@ 2015-04-07 14:18         ` Johannes Weiner
  2015-04-11  7:29           ` Tetsuo Handa
  2015-04-13 12:46           ` Michal Hocko
  1 sibling, 2 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-04-07 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> [...]
> > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > the allocator, even though many of them seem to have fallback code.
> > > My reasoning here is that if you *have* an exit strategy for failing
> > > allocations that is smarter than hanging, we should probably use that.
> > 
> > We already do that for allocations where we can handle failure in
> > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > tell the allocator to try really hard if we've already had a failure
> > and we are already in memory reclaim conditions (e.g. a shrinker
> > trying to clean dirty objects so they can be reclaimed).
> > 
> > From that perspective, I think that this patch set aims force us
> > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > more likely to fail, and b) provides no mechanism to "try harder"
> > when we really need the allocation to succeed.
> 
> You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> in your fallback case?

I would think __GFP_REPEAT would be more suitable here.  From the doc:

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
are allowed to use the OOM killer and dip into the OOM reserves.

My question here would be: are there any NOFS allocations that *don't*
want this behavior?  Does it even make sense to require this separate
annotation or should we just make it the default?

The argument here was always that NOFS allocations are very limited in
their reclaim powers and will trigger OOM prematurely.  However, the
way we limit dirty memory these days forces most cache to be clean at
all times, and direct reclaim in general hasn't been allowed to issue
page writeback for quite some time.  So these days, NOFS reclaim isn't
really weaker than regular direct reclaim.  The only exception is that
it might block writeback, so we'd go OOM if the only reclaimables left
were dirty pages against that filesystem.  That should be acceptable.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47981c5e54c3..fe3cb2b0b85b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-07 14:18         ` Johannes Weiner
@ 2015-04-11  7:29           ` Tetsuo Handa
  2015-04-13 12:49             ` Michal Hocko
  2015-04-13 12:46           ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Tetsuo Handa @ 2015-04-11  7:29 UTC (permalink / raw)
  To: hannes, mhocko
  Cc: david, linux-mm, linux-fsdevel, linux-kernel, torvalds, akpm,
	ying.huang, aarcange, tytso

Johannes Weiner wrote:
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim.  The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		/* The OOM killer does not needlessly kill tasks for lowmem */
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
> -		/* The OOM killer does not compensate for IO-less reclaim */
> -		if (!(gfp_mask & __GFP_FS)) {
> -			/*
> -			 * XXX: Page reclaim didn't yield anything,
> -			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per tradition.
> -			 */
> -			*did_some_progress = 1;
> -			goto out;
> -		}
>  		if (pm_suspended_storage())
>  			goto out;
>  		/* The OOM killer may not free memory on a specific node */
> 

I think this change will allow calling out_of_memory() which results in
"oom_kill_process() is trivially called via pagefault_out_of_memory()"
problem described in https://lkml.org/lkml/2015/3/18/219 .

I myself think that we should trigger OOM killer for !__GFP_FS allocation
in order to make forward progress in case the OOM victim is blocked.
So, my question about this change is whether we can accept involving OOM
killer from page fault, no matter how trivially OOM killer will kill some
process?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-07 14:18         ` Johannes Weiner
  2015-04-11  7:29           ` Tetsuo Handa
@ 2015-04-13 12:46           ` Michal Hocko
  2015-04-14  0:11             ` Dave Chinner
  2015-04-14 10:36             ` Johannes Weiner
  1 sibling, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-13 12:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

[Sorry for a late reply]

On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > [...]
> > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > the allocator, even though many of them seem to have fallback code.
> > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > allocations that is smarter than hanging, we should probably use that.
> > > 
> > > We already do that for allocations where we can handle failure in
> > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > tell the allocator to try really hard if we've already had a failure
> > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > trying to clean dirty objects so they can be reclaimed).
> > > 
> > > From that perspective, I think that this patch set aims force us
> > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > more likely to fail, and b) provides no mechanism to "try harder"
> > > when we really need the allocation to succeed.
> > 
> > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > in your fallback case?
> 
> I would think __GFP_REPEAT would be more suitable here.  From the doc:
> 
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
> 
> so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> are allowed to use the OOM killer and dip into the OOM reserves.

__GFP_REPEAT is quite subtle already. It makes a difference only for
high order allocations and it is not clear to me why it should imply OOM
killer for small orders now. Or did you suggest making it special only
with GFP_NOFS? That sounds even more ugly.

AFAIU, David wasn't asking for the OOM killer as much as he was
interested in getting access to a small amount of reserves in order to
make a progress. __GFP_HIGH is there for this purpose.

> My question here would be: are there any NOFS allocations that *don't*
> want this behavior?  Does it even make sense to require this separate
> annotation or should we just make it the default?
> 
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim. 

What about [di]cache and some others fs specific shrinkers (and heavy
metadata loads)?

> The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.

OOM killer is hardly acceptable by most users I've heard from. OOM
killer is the _last_ resort and if the allocation is restricted then
we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
get access to memory reserves if it can fail or __GFP_NOFAIL if it
cannot. With your patches the NOFAIL case would get an access to memory
reserves as well. So I do not really see a reason to change GFP_NOFS vs.
OOM killer semantic.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		/* The OOM killer does not needlessly kill tasks for lowmem */
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
> -		/* The OOM killer does not compensate for IO-less reclaim */
> -		if (!(gfp_mask & __GFP_FS)) {
> -			/*
> -			 * XXX: Page reclaim didn't yield anything,
> -			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per tradition.
> -			 */
> -			*did_some_progress = 1;
> -			goto out;
> -		}
>  		if (pm_suspended_storage())
>  			goto out;
>  		/* The OOM killer may not free memory on a specific node */

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-11  7:29           ` Tetsuo Handa
@ 2015-04-13 12:49             ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-13 12:49 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, david, linux-mm, linux-fsdevel, linux-kernel, torvalds,
	akpm, ying.huang, aarcange, tytso

On Sat 11-04-15 16:29:26, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim.  The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 47981c5e54c3..fe3cb2b0b85b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  		/* The OOM killer does not needlessly kill tasks for lowmem */
> >  		if (ac->high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> > -		/* The OOM killer does not compensate for IO-less reclaim */
> > -		if (!(gfp_mask & __GFP_FS)) {
> > -			/*
> > -			 * XXX: Page reclaim didn't yield anything,
> > -			 * and the OOM killer can't be invoked, but
> > -			 * keep looping as per tradition.
> > -			 */
> > -			*did_some_progress = 1;
> > -			goto out;
> > -		}
> >  		if (pm_suspended_storage())
> >  			goto out;
> >  		/* The OOM killer may not free memory on a specific node */
> > 
> 
> I think this change will allow calling out_of_memory() which results in
> "oom_kill_process() is trivially called via pagefault_out_of_memory()"
> problem described in https://lkml.org/lkml/2015/3/18/219 .
> 
> I myself think that we should trigger OOM killer for !__GFP_FS allocation
> in order to make forward progress in case the OOM victim is blocked.
> So, my question about this change is whether we can accept involving OOM
> killer from page fault, no matter how trivially OOM killer will kill some
> process?

We trigger OOM killer from the page fault path for ages. In fact the memcg
will trigger memcg OOM killer _only_ from the page fault path because
this context is safe as we do not sit on any locks at the time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-13 12:46           ` Michal Hocko
@ 2015-04-14  0:11             ` Dave Chinner
  2015-04-14  7:20               ` Michal Hocko
  2015-04-14 10:36             ` Johannes Weiner
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-04-14  0:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about [di]cache and some others fs specific shrinkers (and heavy
> metadata loads)?

We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
either.

*HOWEVER*

The shrinker reclaim we can not execute is deferred to the next
context that can do the reclaim, which is usually kswapd. So the
reclaim gets done according to the GFP_NOFS memory pressure that is
occurring, it is just done in a different context...

> > The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> 
> OOM killer is hardly acceptable by most users I've heard from. OOM
> killer is the _last_ resort and if the allocation is restricted then
> we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> get access to memory reserves if it can fail or __GFP_NOFAIL if it
> cannot. With your patches the NOFAIL case would get an access to memory
> reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> OOM killer semantic.

So, really, what we want is something like:

#define __GFP_USE_LOWMEM_RESERVE	__GFP_HIGH

So that it documents the code that is using it effectively and we
can find them easily with cscope/grep?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-14  0:11             ` Dave Chinner
@ 2015-04-14  7:20               ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-14  7:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Tue 14-04-15 10:11:18, Dave Chinner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> > [Sorry for a late reply]
> > 
> > On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
> either.

Yeah but we invoke fs shrinkers for the _regular_ direct reclaim (with
__GFP_FS), which was the point I've tried to make here.

> *HOWEVER*
> 
> The shrinker reclaim we can not execute is deferred to the next
> context that can do the reclaim, which is usually kswapd. So the
> reclaim gets done according to the GFP_NOFS memory pressure that is
> occurring, it is just done in a different context...

Right, deferring to kswapd is the reason why I think the direct reclaim
shouldn't invoke OOM killer in this context because that would be
premature - as kswapd still can make some progress. Sorry for not being
more clear.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> > get access to memory reserves if it can fail or __GFP_NOFAIL if it
> > cannot. With your patches the NOFAIL case would get an access to memory
> > reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> > OOM killer semantic.
> 
> So, really, what we want is something like:
> 
> #define __GFP_USE_LOWMEM_RESERVE	__GFP_HIGH
> 
> So that it documents the code that is using it effectively and we
> can find them easily with cscope/grep?

I wouldn't be opposed. To be honest I was never fond of __GFP_HIGH. The
naming is counterintuitive. So I would rather go with renaminag it. We do
not have that many users in the tree.
git grep "GFP_HIGH\>" | wc -l
40
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-13 12:46           ` Michal Hocko
  2015-04-14  0:11             ` Dave Chinner
@ 2015-04-14 10:36             ` Johannes Weiner
  2015-04-14 14:23               ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2015-04-14 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > > [...]
> > > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > > the allocator, even though many of them seem to have fallback code.
> > > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > > allocations that is smarter than hanging, we should probably use that.
> > > > 
> > > > We already do that for allocations where we can handle failure in
> > > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > > tell the allocator to try really hard if we've already had a failure
> > > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > > trying to clean dirty objects so they can be reclaimed).
> > > > 
> > > > From that perspective, I think that this patch set aims force us
> > > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > > more likely to fail, and b) provides no mechanism to "try harder"
> > > > when we really need the allocation to succeed.
> > > 
> > > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > > in your fallback case?
> > 
> > I would think __GFP_REPEAT would be more suitable here.  From the doc:
> > 
> >  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> >  * _might_ fail.  This depends upon the particular VM implementation.
> > 
> > so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> > are allowed to use the OOM killer and dip into the OOM reserves.
> 
> __GFP_REPEAT is quite subtle already.  It makes a difference only
> for high order allocations

That's an implementation detail, owed to the fact that smaller orders
already imply that behavior.  That doesn't change the semantics.  And
people currently *use* it all over the tree for small orders, because
of how the flag is defined in gfp.h; not because of how it's currently
implemented.

> and it is not clear to me why it should imply OOM killer for small
> orders now.  Or did you suggest making it special only with
> GFP_NOFS?  That sounds even more ugly.

Small orders already invoke the OOM killer.  I suggested using this
flag to override the specialness of GFP_NOFS not OOM killing - in
response to whether we can provide an annotation to make some GFP_NOFS
sites more robust.

This is exactly what __GFP_REPEAT is: try the allocation harder than
you would without this flag.  It identifies a caller that is willing
to put in extra effort or be more aggressive because the allocation is
more important than other allocations of the otherwise same gfp_mask.

> AFAIU, David wasn't asking for the OOM killer as much as he was
> interested in getting access to a small amount of reserves in order to
> make a progress. __GFP_HIGH is there for this purpose.

That's not just any reserve pool available to the generic caller, it's
the reserve pool for interrupts, which can not wait and replenish it.
It relies on kswapd to run soon after the interrupt, or right away on
SMP.  But locks held in the filesystem can hold up kswapd (the reason
we even still perform direct reclaim) so NOFS allocs shouldn't use it.

[hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
39
[hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
4324

Interrupts have *no other option*.  It's misguided to deplete their
reserves, cause loss of network packets, loss of input events, from
allocations that can actually perform reclaim and have perfectly
acceptable fallback strategies in the caller.

Generally, for any reserve system there must be a way to replenish it.
For interrupts it's kswapd, for the OOM reserves I proposed it's the
OOM victim exiting soon after the allocation, if not right away.

__GFP_NOFAIL is the odd one out here because accessing the system's
emergency reserves without any prospect of near-future replenishing is
just slightly better than deadlocking right away.  Which is why this
reserve access can not be separated out: if you can do *anything*
better than hanging, do it.  If not, use __GFP_NOFAIL.

> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about [di]cache and some others fs specific shrinkers (and heavy
> metadata loads)?

My bad, I forgot about those.  But it doesn't really change the basic
question of whether we want to change the GFP_NOFS default or merely
annotate individual sites that want to try harder.

> > The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> 
> OOM killer is hardly acceptable by most users I've heard from. OOM
> killer is the _last_ resort and if the allocation is restricted then
> we shouldn't use the big hammer.

We *are* talking about the last resort for these allocations!  There
is nothing else we can do to avoid allocation failure at this point.
Absent a reservation system, we have the choice between failing after
reclaim - which Dave said was too fragile for XFS - or OOM killing.

Or continue to deadlock in the allocator of course if we can't figure
out what the filesystem side actually wants given the current options.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy
  2015-04-14 10:36             ` Johannes Weiner
@ 2015-04-14 14:23               ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-14 14:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Linus Torvalds, Andrew Morton, Tetsuo Handa, Huang Ying,
	Andrea Arcangeli, Theodore Ts'o

On Tue 14-04-15 06:36:25, Johannes Weiner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
[...]
> > AFAIU, David wasn't asking for the OOM killer as much as he was
> > interested in getting access to a small amount of reserves in order to
> > make a progress. __GFP_HIGH is there for this purpose.
> 
> That's not just any reserve pool available to the generic caller, it's
> the reserve pool for interrupts, which can not wait and replenish it.
> It relies on kswapd to run soon after the interrupt, or right away on
> SMP.  But locks held in the filesystem can hold up kswapd (the reason
> we even still perform direct reclaim) so NOFS allocs shouldn't use it.
> 
> [hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
> 39
> [hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
> 4324
> 
> Interrupts have *no other option*. 

Atomic context in general can ALLOC_HARDER so it has an access to
additional reserves wrt. __GFP_HIGH|__GFP_WAIT.

> It's misguided to deplete their
> reserves, cause loss of network packets, loss of input events, from
> allocations that can actually perform reclaim and have perfectly
> acceptable fallback strategies in the caller.

OK, I thought that it was clear that the proposed __GFP_HIGH is a
fallback strategy for those paths which cannot do much better. Not a
random solution for "this shouldn't fail to eagerly".

> Generally, for any reserve system there must be a way to replenish it.
> For interrupts it's kswapd, for the OOM reserves I proposed it's the
> OOM victim exiting soon after the allocation, if not right away.

And my understanding was that the fallback mode would be used in the
context which would lead to release of the fs pressure thus releasing a
memory as well.

> __GFP_NOFAIL is the odd one out here because accessing the system's
> emergency reserves without any prospect of near-future replenishing is
> just slightly better than deadlocking right away.  Which is why this
> reserve access can not be separated out: if you can do *anything*
> better than hanging, do it.  If not, use __GFP_NOFAIL.

Agreed.
 
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> My bad, I forgot about those.  But it doesn't really change the basic
> question of whether we want to change the GFP_NOFS default or merely
> annotate individual sites that want to try harder.

My understanding was the later one. If you look at page cache allocations
which use mapping_gfp_mask (e.g. xfs is using GFP_NOFS for that context
all the time) then those do not really have to try harder.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer.
> 
> We *are* talking about the last resort for these allocations!  There
> is nothing else we can do to avoid allocation failure at this point.
> Absent a reservation system, we have the choice between failing after
> reclaim - which Dave said was too fragile for XFS - or OOM killing.

As per other emails in this thread (e.g.
http://marc.info/?l=linux-mm&m=142897087230385&w=2), I understood that
the access to a small portion of emergency pool would be sufficient to
release the pressure and that sounds preferable to me over a destructive
reclaim attempts.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations
  2015-03-25  6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
@ 2015-04-14 16:49   ` Michal Hocko
  2015-04-24 19:13     ` Johannes Weiner
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2015-04-14 16:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

[Sorry for the late reply]

On Wed 25-03-15 02:17:13, Johannes Weiner wrote:
> The OOM killer connects random tasks in the system with unknown
> dependencies between them, and the OOM victim might well get blocked
> behind the task that is trying to allocate.  That means that while
> allocations can issue OOM kills to improve the low memory situation,
> which generally frees more than they are going to take out, they can
> not rely on their *own* OOM kills to make forward progress for them.
> 
> Secondly, we want to avoid a racing allocation swooping in to steal
> the work of the OOM killing allocation, causing spurious allocation
> failures.  The one that put in the work must have priority - if its
> efforts are enough to serve both allocations that's fine, otherwise
> concurrent allocations should be forced to issue their own OOM kills.
> 
> Keep some pages below the min watermark reserved for OOM-killing
> allocations to protect them from blocking victims and concurrent
> allocations not pulling their weight.

Yes, this makes a lot of sense. I am just not sure I am happy about some
details.

[...]
> @@ -3274,6 +3290,7 @@ void show_free_areas(unsigned int filter)
>  		show_node(zone);
>  		printk("%s"
>  			" free:%lukB"
> +			" oom:%lukB"
>  			" min:%lukB"
>  			" low:%lukB"
>  			" high:%lukB"
> @@ -3306,6 +3323,7 @@ void show_free_areas(unsigned int filter)
>  			"\n",
>  			zone->name,
>  			K(zone_page_state(zone, NR_FREE_PAGES)),
> +			K(oom_wmark_pages(zone)),
>  			K(min_wmark_pages(zone)),
>  			K(low_wmark_pages(zone)),
>  			K(high_wmark_pages(zone)),

Do we really need to export the new watermark into the userspace?
How would it help user/admin? OK, maybe show_free_areas could be helpful
for oom reports but why to export it in /proc/zoneinfo which is done
further down?

> @@ -5747,17 +5765,18 @@ static void __setup_per_zone_wmarks(void)
>  
>  			min_pages = zone->managed_pages / 1024;
>  			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
> -			zone->watermark[WMARK_MIN] = min_pages;
> +			zone->watermark[WMARK_OOM] = min_pages;
>  		} else {
>  			/*
>  			 * If it's a lowmem zone, reserve a number of pages
>  			 * proportionate to the zone's size.
>  			 */
> -			zone->watermark[WMARK_MIN] = tmp;
> +			zone->watermark[WMARK_OOM] = tmp;
>  		}
>  
> -		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
> -		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
> +		zone->watermark[WMARK_MIN]  = oom_wmark_pages(zone) + (tmp >> 3);
> +		zone->watermark[WMARK_LOW]  = oom_wmark_pages(zone) + (tmp >> 2);
> +		zone->watermark[WMARK_HIGH] = oom_wmark_pages(zone) + (tmp >> 1);

This will basically elevate the min watermark, right? And that might lead
to subtle performance differences even when OOM killer is not invoked
because the direct reclaim will start sooner.
Shouldn't we rather give WMARK_OOM half of WMARK_MIN instead?

>  
>  		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
>  			high_wmark_pages(zone) - low_wmark_pages(zone) -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1fd0886a389f..a62f16ef524c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1188,6 +1188,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  	seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
>  	seq_printf(m,
>  		   "\n  pages free     %lu"
> +		   "\n        oom      %lu"
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> @@ -1196,6 +1197,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
>  		   zone_page_state(zone, NR_FREE_PAGES),
> +		   oom_wmark_pages(zone),
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
  2015-03-25  6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
@ 2015-04-14 16:55   ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2015-04-14 16:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Wed 25-03-15 02:17:14, Johannes Weiner wrote:
> __GFP_NOFAIL allocations can deadlock the OOM killer when they're
> holding locks that the OOM victim might need to exit.  When that
> happens the allocation may never complete, which has disastrous
> effects on things like in-flight filesystem transactions.
> 
> When the system is OOM, allow __GFP_NOFAIL allocations to dip into the
> emergency reserves in the hope that this will allow transactions and
> writeback to complete and the deadlock can be avoided.

This one slipped through. Sorry.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/page_alloc.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c165016175d..832ad1c7cd4f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2403,9 +2403,17 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	 * from exiting.  While allocations can use OOM kills to free
>  	 * memory, they can not necessarily rely on their *own* kills
>  	 * to make forward progress.
> +	 *
> +	 * This last point is crucial for __GFP_NOFAIL allocations.
> +	 * Since they can't quit, they might actually deadlock, so
> +	 * give them hail mary access to the emergency reserves.
>  	 */
> -	alloc_flags &= ~ALLOC_WMARK_MASK;
> -	alloc_flags |= ALLOC_WMARK_OOM;
> +	if (gfp_mask & __GFP_NOFAIL) {
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +	} else {
> +		alloc_flags &= ~ALLOC_WMARK_MASK;
> +		alloc_flags |= ALLOC_WMARK_OOM;
> +	}
>  out:
>  	mutex_unlock(&oom_lock);
>  alloc:
> -- 
> 2.3.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations
  2015-04-14 16:49   ` Michal Hocko
@ 2015-04-24 19:13     ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2015-04-24 19:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, linux-kernel, Linus Torvalds,
	Andrew Morton, Tetsuo Handa, Huang Ying, Andrea Arcangeli,
	Dave Chinner, Theodore Ts'o

On Tue, Apr 14, 2015 at 06:49:40PM +0200, Michal Hocko wrote:
> On Wed 25-03-15 02:17:13, Johannes Weiner wrote:
> > @@ -5747,17 +5765,18 @@ static void __setup_per_zone_wmarks(void)
> >  
> >  			min_pages = zone->managed_pages / 1024;
> >  			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
> > -			zone->watermark[WMARK_MIN] = min_pages;
> > +			zone->watermark[WMARK_OOM] = min_pages;
> >  		} else {
> >  			/*
> >  			 * If it's a lowmem zone, reserve a number of pages
> >  			 * proportionate to the zone's size.
> >  			 */
> > -			zone->watermark[WMARK_MIN] = tmp;
> > +			zone->watermark[WMARK_OOM] = tmp;
> >  		}
> >  
> > -		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
> > -		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
> > +		zone->watermark[WMARK_MIN]  = oom_wmark_pages(zone) + (tmp >> 3);
> > +		zone->watermark[WMARK_LOW]  = oom_wmark_pages(zone) + (tmp >> 2);
> > +		zone->watermark[WMARK_HIGH] = oom_wmark_pages(zone) + (tmp >> 1);
> 
> This will basically elevate the min watermark, right? And that might lead
> to subtle performance differences even when OOM killer is not invoked
> because the direct reclaim will start sooner.

It will move the min watermark a bit closer to the kswapd watermarks,
so I guess the risk of entering direct reclaim when kswapd won't wake
up fast enough before concurrent allocator slowpaths deplete the zone
from low to min is marginally increased.  That seems like a farfetched
worry, especially given that waking up a sleeping kswapd is not a high
frequency event in the first place.

> Shouldn't we rather give WMARK_OOM half of WMARK_MIN instead?

I guess conceptually that would work as well, since an OOM killing
task is technically reclaiming memory and this reserve is meant to
help reclaiming tasks make forward progress.

That being said, the separate OOM reserve was designed for when the
allocation can actually fail: deplete our own little reserve before
returning failure.  But it looks like neither the low-order nor the
GFP_NOFS deadlock fixes got any traction, and so right now all OOM
killing allocations still have the potential to deadlock.  Is there a
reason we shouldn't just let them do an ALLOC_NO_WATERMARK allocation
after the OOM victim exited (or timed out)?

Otherwise, I'll just do that in the next iteration.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch resend] android, lmk: avoid setting TIF_MEMDIE if process has already exited
  2015-04-02 23:01         ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
@ 2015-04-28 22:50           ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2015-04-28 22:50 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Arve Hjønnevåg, Riley Andrews, Andrew Morton,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm, devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2040 bytes --]

TIF_MEMDIE should not be set on a process if it does not have a valid 
->mm, and this is protected by task_lock().

If TIF_MEMDIE gets set after the mm has detached, and the process fails to 
exit, then the oom killer will defer forever waiting for it to exit.

Make sure that the mm is still valid before setting TIF_MEMDIE by way of 
mark_tsk_oom_victim().

Cc: "Arve Hjønnevåg" <arve@android.com>
Cc: Riley Andrews <riandrews@android.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/staging/android/lowmemorykiller.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -156,20 +156,27 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
 			     p->pid, p->comm, oom_score_adj, tasksize);
 	}
 	if (selected) {
-		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
-			     selected->pid, selected->comm,
-			     selected_oom_score_adj, selected_tasksize);
-		lowmem_deathpending_timeout = jiffies + HZ;
+		task_lock(selected);
+		if (!selected->mm) {
+			/* Already exited, cannot do mark_tsk_oom_victim() */
+			task_unlock(selected);
+			goto out;
+		}
 		/*
 		 * FIXME: lowmemorykiller shouldn't abuse global OOM killer
 		 * infrastructure. There is no real reason why the selected
 		 * task should have access to the memory reserves.
 		 */
 		mark_tsk_oom_victim(selected);
+		task_unlock(selected);
+		lowmem_print(1, "send sigkill to %d (%s), adj %hd, size %d\n",
+			     selected->pid, selected->comm,
+			     selected_oom_score_adj, selected_tasksize);
+		lowmem_deathpending_timeout = jiffies + HZ;
 		send_sig(SIGKILL, selected, 0);
 		rem += selected_tasksize;
 	}
-
+out:
 	lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
 		     sc->nr_to_scan, sc->gfp_mask, rem);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2015-04-28 22:50 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-25  6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
2015-03-25  6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
2015-03-26  0:51   ` David Rientjes
2015-03-26 11:51     ` Michal Hocko
2015-03-26 13:18       ` Michal Hocko
2015-03-26 19:30         ` David Rientjes
2015-03-26 11:43   ` Michal Hocko
2015-03-26 20:05   ` David Rientjes
2015-03-25  6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
2015-03-26  3:34   ` David Rientjes
2015-03-26 11:54   ` Michal Hocko
2015-03-25  6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
2015-03-26  3:31   ` David Rientjes
2015-03-26 11:05     ` Johannes Weiner
2015-03-26 19:50       ` David Rientjes
2015-03-30 14:48         ` Michal Hocko
2015-04-02 23:01         ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
2015-04-28 22:50           ` [patch resend] " David Rientjes
2015-03-26 11:57   ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Michal Hocko
2015-03-25  6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
2015-03-26 12:53   ` Michal Hocko
2015-03-26 13:01     ` Michal Hocko
2015-03-26 15:10       ` Johannes Weiner
2015-03-26 15:04     ` Johannes Weiner
2015-03-25  6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
2015-03-26 13:03   ` Michal Hocko
2015-03-25  6:17 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Johannes Weiner
2015-03-26 13:31   ` Michal Hocko
2015-03-26 15:17     ` Johannes Weiner
2015-03-26 16:07       ` Michal Hocko
2015-03-25  6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
2015-03-26 14:11   ` Michal Hocko
2015-03-26 15:18     ` Johannes Weiner
2015-03-25  6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
2015-03-25 14:15   ` Tetsuo Handa
2015-03-25 17:01     ` Vlastimil Babka
2015-03-26 11:28       ` Johannes Weiner
2015-03-26 11:24     ` Johannes Weiner
2015-03-26 14:32       ` Michal Hocko
2015-03-26 15:23         ` Johannes Weiner
2015-03-26 15:38           ` Michal Hocko
2015-03-26 18:17             ` Johannes Weiner
2015-03-27 14:01             ` [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying Tetsuo Handa
2015-03-26 15:58   ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
2015-03-26 18:23     ` Johannes Weiner
2015-03-25  6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
2015-04-14 16:49   ` Michal Hocko
2015-04-24 19:13     ` Johannes Weiner
2015-03-25  6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
2015-04-14 16:55   ` Michal Hocko
2015-03-25  6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
2015-03-26 14:50   ` Michal Hocko
2015-03-25  6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
2015-03-26 15:32   ` Michal Hocko
2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
2015-03-27 15:05   ` Johannes Weiner
2015-03-30  0:32     ` Dave Chinner
2015-03-30 19:31       ` Johannes Weiner
2015-04-01 15:19       ` Michal Hocko
2015-04-01 21:39         ` Dave Chinner
2015-04-02  7:29           ` Michal Hocko
2015-04-07 14:18         ` Johannes Weiner
2015-04-11  7:29           ` Tetsuo Handa
2015-04-13 12:49             ` Michal Hocko
2015-04-13 12:46           ` Michal Hocko
2015-04-14  0:11             ` Dave Chinner
2015-04-14  7:20               ` Michal Hocko
2015-04-14 10:36             ` Johannes Weiner
2015-04-14 14:23               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).