LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [patch] mm, thp: do not cause memcg oom for thp
@ 2018-03-19 21:10 David Rientjes
  2018-03-20  7:16 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-19 21:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Michal Hocko, Vlastimil Babka, linux-kernel,
	linux-mm

Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect thp
allocations based on __GFP_NORETRY.

It did not, however, modify the mem cgroup try_charge() path to avoid oom
kill for either khugepaged collapsing or thp faulting.  It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim is
governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and
charging) should fallback instead of oom killing processes.

Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/huge_memory.c | 5 +++--
 mm/khugepaged.c  | 8 ++++++--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,7 +555,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, gfp | __GFP_NORETRY, &memcg,
+				  true)) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1316,7 +1317,7 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	}
 
 	if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
-					huge_gfp, &memcg, true))) {
+				huge_gfp | __GFP_NORETRY, &memcg, true))) {
 		put_page(new_page);
 		split_huge_pmd(vma, vmf->pmd, vmf->address);
 		if (page)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -960,7 +960,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 		goto out_nolock;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+	/* Do not oom kill for khugepaged charges */
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
+					   &memcg, true))) {
 		result = SCAN_CGROUP_CHARGE_FAIL;
 		goto out_nolock;
 	}
@@ -1319,7 +1321,9 @@ static void collapse_shmem(struct mm_struct *mm,
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+	/* Do not oom kill for khugepaged charges */
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
+					   &memcg, true))) {
 		result = SCAN_CGROUP_CHARGE_FAIL;
 		goto out;
 	}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-19 21:10 [patch] mm, thp: do not cause memcg oom for thp David Rientjes
@ 2018-03-20  7:16 ` Michal Hocko
  2018-03-20 20:25   ` David Rientjes
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-03-20  7:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Mon 19-03-18 14:10:05, David Rientjes wrote:
> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> madvised allocations") changed the page allocator to no longer detect thp
> allocations based on __GFP_NORETRY.
> 
> It did not, however, modify the mem cgroup try_charge() path to avoid oom
> kill for either khugepaged collapsing or thp faulting.  It is never
> expected to oom kill a process to allocate a hugepage for thp; reclaim is
> governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and
> charging) should fallback instead of oom killing processes.

For some reason I thought that the charging path simply bails out for
costly orders - effectively the same thing as for the global OOM killer.
But we do not. Is there any reason to not do that though? Why don't we
simply do


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1a917b5b7b7..08accbcd1a18 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 
 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom)
+	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
 		return;
 	/*
 	 * We are in the middle of the charge context here, so we
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-20  7:16 ` Michal Hocko
@ 2018-03-20 20:25   ` David Rientjes
  2018-03-21  8:22     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-20 20:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Tue, 20 Mar 2018, Michal Hocko wrote:

> > Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> > madvised allocations") changed the page allocator to no longer detect thp
> > allocations based on __GFP_NORETRY.
> > 
> > It did not, however, modify the mem cgroup try_charge() path to avoid oom
> > kill for either khugepaged collapsing or thp faulting.  It is never
> > expected to oom kill a process to allocate a hugepage for thp; reclaim is
> > governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and
> > charging) should fallback instead of oom killing processes.
> 
> For some reason I thought that the charging path simply bails out for
> costly orders - effectively the same thing as for the global OOM killer.
> But we do not. Is there any reason to not do that though? Why don't we
> simply do
> 

I'm not sure of the expectation of high-order memcg charging without 
__GFP_NORETRY, I only know that khugepaged can now cause memcg oom kills 
when trying to collapse memory, and then subsequently found that the same 
situation exists for faulting instead of falling back to small pages.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d1a917b5b7b7..08accbcd1a18 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom)
> +	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
>  		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we

That may make sense as an additional patch, but for thp allocations we 
don't want to retry reclaim nr_retries times anyway; we want the old 
behavior of __GFP_NORETRY before commit 2516035499b9.  So the above would 
be a follow-up patch that wouldn't replace mine.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-20 20:25   ` David Rientjes
@ 2018-03-21  8:22     ` Michal Hocko
  2018-03-21 19:37       ` David Rientjes
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-03-21  8:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Tue 20-03-18 13:25:23, David Rientjes wrote:
> On Tue, 20 Mar 2018, Michal Hocko wrote:
> 
> > > Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> > > madvised allocations") changed the page allocator to no longer detect thp
> > > allocations based on __GFP_NORETRY.
> > > 
> > > It did not, however, modify the mem cgroup try_charge() path to avoid oom
> > > kill for either khugepaged collapsing or thp faulting.  It is never
> > > expected to oom kill a process to allocate a hugepage for thp; reclaim is
> > > governed by the thp defrag mode and MADV_HUGEPAGE, but allocations (and
> > > charging) should fallback instead of oom killing processes.
> > 
> > For some reason I thought that the charging path simply bails out for
> > costly orders - effectively the same thing as for the global OOM killer.
> > But we do not. Is there any reason to not do that though? Why don't we
> > simply do
> > 
> 
> I'm not sure of the expectation of high-order memcg charging without 
> __GFP_NORETRY,

It should be semantically compatible with the allocation path.

> I only know that khugepaged can now cause memcg oom kills 
> when trying to collapse memory, and then subsequently found that the same 
> situation exists for faulting instead of falling back to small pages.

And that is clearly a bug because page allocator doesn't oom kill while
the memcg charge does for the same gfp flag. That should be fixed.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d1a917b5b7b7..08accbcd1a18 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> >  
> >  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -	if (!current->memcg_may_oom)
> > +	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> >  		return;
> >  	/*
> >  	 * We are in the middle of the charge context here, so we
> 
> That may make sense as an additional patch, but for thp allocations we 
> don't want to retry reclaim nr_retries times anyway; we want the old 
> behavior of __GFP_NORETRY before commit 2516035499b9.

Why? Allocation and the charge path should use the same gfp mask unless
there is a strong reason for it. If you have one then please mention it
in the changelog.

> So the above would be a follow-up patch that wouldn't replace mine.

Unless there is a strong reason to use different gfp mask for the
allocation and the charge then your fix is actually wrong.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-21  8:22     ` Michal Hocko
@ 2018-03-21 19:37       ` David Rientjes
  2018-03-21 20:53         ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-21 19:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Wed, 21 Mar 2018, Michal Hocko wrote:

> > I'm not sure of the expectation of high-order memcg charging without 
> > __GFP_NORETRY,
> 
> It should be semantically compatible with the allocation path.
> 

That doesn't make sense, the allocation path needs to allocate contiguous 
memory for the high order, the charging path just needs to charge a number 
of pages.  Why would the allocation and charging path be compatible when 
one needs to reclaim contiguous memory or compact memory and the the other 
just needs to reclaim any memory?

> > I only know that khugepaged can now cause memcg oom kills 
> > when trying to collapse memory, and then subsequently found that the same 
> > situation exists for faulting instead of falling back to small pages.
> 
> And that is clearly a bug because page allocator doesn't oom kill while
> the memcg charge does for the same gfp flag. That should be fixed.
> 

It's fixed with my patch, yes.  The page allocator doesn't oom kill for 
orders over PAGE_ALLOC_COSTLY_ORDER only because it is unlikely to free 
order-4 and higher contiguous memory as a result; it's in the name, it's a 
costly order for the page allocator.  Using it as a heuristic in the memcg 
charging path seems strange.

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index d1a917b5b7b7..08accbcd1a18 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1493,7 +1493,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > >  
> > >  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > >  {
> > > -	if (!current->memcg_may_oom)
> > > +	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> > >  		return;
> > >  	/*
> > >  	 * We are in the middle of the charge context here, so we
> > 
> > That may make sense as an additional patch, but for thp allocations we 
> > don't want to retry reclaim nr_retries times anyway; we want the old 
> > behavior of __GFP_NORETRY before commit 2516035499b9.
> 
> Why? Allocation and the charge path should use the same gfp mask unless
> there is a strong reason for it. If you have one then please mention it
> in the changelog.
> 

It shouldn't use the same gfp mask for thp allocations because the page 
allocator needs to allocate contiguous memory and mem cgroup just needs to 
charge a number of pages.  Khugepaged will fail the allocation without 
reclaim or compaction if its defrag setting does not allow it.  If defrag 
is allowed, the page allocator policy is that oom kill is unlikely to free 
order-4 and above contiguous memory without killing multiple victims.  
That's not the case with the memcg charging path: oom killing a process 
will always uncharge memory, it need not be contiguous.  When we lost 
__GFP_NORETRY because of a page allocator change to better distinguish thp 
allocations, it left the door open to oom killing for thp through the 
charge path when fallback is possible.

Specifying __GFP_NORETRY for the page allocator for thp allocations would 
prematurely cause them to fail depending on the defrag settings.  The page 
allocator implementation always prevents oom kill for these allocations 
with or without the bit.  Specifying it for the charging path allows it to 
fail without oom kill and relies specifically on the bit.  Trying to 
introduce a page allocator-like heuristic to the charge path, which 
doesn't require contiguous memory, based on order so it wouldn't need 
__GFP_NORETRY would be a separate change.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-21 19:37       ` David Rientjes
@ 2018-03-21 20:53         ` Michal Hocko
  2018-03-21 21:27           ` David Rientjes
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-03-21 20:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Wed 21-03-18 12:37:10, David Rientjes wrote:
> On Wed, 21 Mar 2018, Michal Hocko wrote:
> 
> > > I'm not sure of the expectation of high-order memcg charging without 
> > > __GFP_NORETRY,
> > 
> > It should be semantically compatible with the allocation path.
> > 
> 
> That doesn't make sense, the allocation path needs to allocate contiguous 
> memory for the high order, the charging path just needs to charge a number 
> of pages.  Why would the allocation and charging path be compatible when 
> one needs to reclaim contiguous memory or compact memory and the the other 
> just needs to reclaim any memory?

Because you do not want to see surprises. E.g. seeing unexpected OOMs
for large allocatations. Just think about it. Do you really want to have
a different reclaim policy for the allocation and charging for all
allocating paths? THP is by no means special. We do have different gfp
masks for THP to express how hard to try. Why should the charge path
behave any different?

You are right that the allocation path involves compaction and that is
different from the charging path. But that is an implementation detail
of the current implementation. Semantically, it is the gfp mask to tell
how hard to try and treating just because of how the current code works
is simply wrong.

> > > I only know that khugepaged can now cause memcg oom kills 
> > > when trying to collapse memory, and then subsequently found that the same 
> > > situation exists for faulting instead of falling back to small pages.
> > 
> > And that is clearly a bug because page allocator doesn't oom kill while
> > the memcg charge does for the same gfp flag. That should be fixed.
> > 
> 
> It's fixed with my patch, yes.

Your patch only fixes up the current situation. Anytime a new THP
allocation emerges that code path has to be careful to add
__GFP_NORETRY to not regress again. That is just too error prone.

> The page allocator doesn't oom kill for 
> orders over PAGE_ALLOC_COSTLY_ORDER only because it is unlikely to free 
> order-4 and higher contiguous memory as a result; it's in the name, it's a 
> costly order for the page allocator.  Using it as a heuristic in the memcg 
> charging path seems strange.

It is not strange at all. We have the concept that large allocations are
OK to fail rather than cause disruptive actions. And the same applies
for charges as well. There is no reason to over reclaim or even OOM kill
for a large charge if we have a fallback.

Seriously. Making different polices to the allocation and the memcg
charge will lead to both unexpected behavior and a maintenance mess.
And there is no good reason for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-21 20:53         ` Michal Hocko
@ 2018-03-21 21:27           ` David Rientjes
  2018-03-22  8:11             ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2018-03-21 21:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Wed, 21 Mar 2018, Michal Hocko wrote:

> > That doesn't make sense, the allocation path needs to allocate contiguous 
> > memory for the high order, the charging path just needs to charge a number 
> > of pages.  Why would the allocation and charging path be compatible when 
> > one needs to reclaim contiguous memory or compact memory and the the other 
> > just needs to reclaim any memory?
> 
> Because you do not want to see surprises. E.g. seeing unexpected OOMs
> for large allocatations. Just think about it. Do you really want to have
> a different reclaim policy for the allocation and charging for all
> allocating paths?

It depends on the use of __GFP_NORETRY.  If the high-order charge is 
__GFP_NORETRY, it does not oom kill.  It is left to the caller.  Just 
because thp allocations have been special cased in the page allocator to 
be able to remove __GFP_NORETRY without fixing the memcg charge path does 
not mean memcg needs a special heuristic for high-order memory when it 
does not require contiguous memory.  You say you don't want any surprises, 
but now you are changing behavior needlessly for all charges with
order > PAGE_ALLOC_COSTLY_ORDER that do not use __GFP_NORETRY.

> You are right that the allocation path involves compaction and that is
> different from the charging path. But that is an implementation detail
> of the current implementation.
> 

Lol, the fact that the page allocator requires contiguous memory is not an 
implementation detail of the current implementation.

> Your patch only fixes up the current situation. Anytime a new THP
> allocation emerges that code path has to be careful to add
> __GFP_NORETRY to not regress again. That is just too error prone.
> 

We could certainly handle it by adding helpers similar to 
alloc_hugepage_direct_gfpmask() and alloc_hugepage_khugepaged_gfpmask() 
which are employed for the same purpose for the page allocator gfp mask.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch] mm, thp: do not cause memcg oom for thp
  2018-03-21 21:27           ` David Rientjes
@ 2018-03-22  8:11             ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2018-03-22  8:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Kirill A. Shutemov, Vlastimil Babka, linux-kernel,
	linux-mm

On Wed 21-03-18 14:27:10, David Rientjes wrote:
> On Wed, 21 Mar 2018, Michal Hocko wrote:
> 
> > > That doesn't make sense, the allocation path needs to allocate contiguous 
> > > memory for the high order, the charging path just needs to charge a number 
> > > of pages.  Why would the allocation and charging path be compatible when 
> > > one needs to reclaim contiguous memory or compact memory and the the other 
> > > just needs to reclaim any memory?
> > 
> > Because you do not want to see surprises. E.g. seeing unexpected OOMs
> > for large allocatations. Just think about it. Do you really want to have
> > a different reclaim policy for the allocation and charging for all
> > allocating paths?
> 
> It depends on the use of __GFP_NORETRY.  If the high-order charge is 
> __GFP_NORETRY, it does not oom kill.  It is left to the caller.

How does the caller say it when the charge path is hidden inside the
allocator - e.g. inside kmalloc?

> Just 
> because thp allocations have been special cased in the page allocator to 
> be able to remove __GFP_NORETRY without fixing the memcg charge path does 
> not mean memcg needs a special heuristic for high-order memory when it 
> does not require contiguous memory.  You say you don't want any surprises, 
> but now you are changing behavior needlessly for all charges with
> order > PAGE_ALLOC_COSTLY_ORDER that do not use __GFP_NORETRY.

Not really. Only the #PF path is allowed to trigger the oom killer now
so high order allocations (mostly coming from kmalloc) do not trigger
OOM killer anyway. But this is the thing that might change in future and
therefore I think it is essential to have a different oom behavior than
the allocator.

> > You are right that the allocation path involves compaction and that is
> > different from the charging path. But that is an implementation detail
> > of the current implementation.
> > 
> 
> Lol, the fact that the page allocator requires contiguous memory is not an 
> implementation detail of the current implementation.

The underlying mechanism might be different in future. So your lol is
not really appropriate.

> > Your patch only fixes up the current situation. Anytime a new THP
> > allocation emerges that code path has to be careful to add
> > __GFP_NORETRY to not regress again. That is just too error prone.
> > 
> 
> We could certainly handle it by adding helpers similar to 
> alloc_hugepage_direct_gfpmask() and alloc_hugepage_khugepaged_gfpmask() 
> which are employed for the same purpose for the page allocator gfp mask.

This doesn't solve the problem in general (e.g. kmalloc).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-22  8:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-19 21:10 [patch] mm, thp: do not cause memcg oom for thp David Rientjes
2018-03-20  7:16 ` Michal Hocko
2018-03-20 20:25   ` David Rientjes
2018-03-21  8:22     ` Michal Hocko
2018-03-21 19:37       ` David Rientjes
2018-03-21 20:53         ` Michal Hocko
2018-03-21 21:27           ` David Rientjes
2018-03-22  8:11             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).