LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage
       [not found] <1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com>
@ 2018-01-29  6:30 ` Naoya Horiguchi
  2018-01-29  9:54   ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2018-01-29  6:30 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi

My apology, I forgot to CC to the mailing lists.

On Mon, Jan 29, 2018 at 03:28:03PM +0900, Naoya Horiguchi wrote:
> Recently the following BUG was reported:
> 
>     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
>     Memory failure: 0x3c0000: recovery action for huge page: Recovered
>     BUG: unable to handle kernel paging request at ffff8dfcc0003000
>     IP: gup_pgd_range+0x1f0/0xc20
>     PGD 17ae72067 P4D 17ae72067 PUD 0
>     Oops: 0000 [#1] SMP PTI
>     ...
>     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> 
> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> of a migration entry on pud that was created in the 1st madvise() event.
> 
> I think that conversion to pud-aligned migration entry is working,
> but other MM code walking over page table isn't prepared for it.
> We need some time and effort to make all this work properly, so
> this patch avoids the reported bug by just disabling error handling
> for 1GB hugepage.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
>  include/linux/mm.h  | 1 +
>  mm/memory-failure.c | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> index 63f7ba1..166864e 100644
> --- v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h
> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> @@ -2607,6 +2607,7 @@ enum mf_action_page_type {
>  	MF_MSG_POISONED_HUGE,
>  	MF_MSG_HUGE,
>  	MF_MSG_FREE_HUGE,
> +	MF_MSG_GIGANTIC,
>  	MF_MSG_UNMAP_FAILED,
>  	MF_MSG_DIRTY_SWAPCACHE,
>  	MF_MSG_CLEAN_SWAPCACHE,
> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> index d530ac1..c497588 100644
> --- v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c
> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> @@ -508,6 +508,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> +	[MF_MSG_GIGANTIC]		= "gigantic page",
>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> @@ -1090,6 +1091,12 @@ static int memory_failure_hugetlb(unsigned long pfn, int trapno, int flags)
>  		return 0;
>  	}
>  
> +	if (hstate_is_gigantic(page_hstate(head))) {
> +		action_result(pfn, MF_MSG_GIGANTIC, MF_IGNORED);
> +		res = -EBUSY;
> +		goto out;
> +	}
> +
>  	if (!hwpoison_user_mappings(p, pfn, trapno, flags, &head)) {
>  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
>  		res = -EBUSY;
> -- 
> 2.7.0
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-29  6:30 ` [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage Naoya Horiguchi
@ 2018-01-29  9:54   ` Michal Hocko
  2018-01-29 18:08     ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2018-01-29  9:54 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: Andrew Morton, Mike Kravetz, linux-mm, linux-kernel

On Mon 29-01-18 06:30:55, Naoya Horiguchi wrote:
> My apology, I forgot to CC to the mailing lists.
> 
> On Mon, Jan 29, 2018 at 03:28:03PM +0900, Naoya Horiguchi wrote:
> > Recently the following BUG was reported:
> > 
> >     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
> >     Memory failure: 0x3c0000: recovery action for huge page: Recovered
> >     BUG: unable to handle kernel paging request at ffff8dfcc0003000
> >     IP: gup_pgd_range+0x1f0/0xc20
> >     PGD 17ae72067 P4D 17ae72067 PUD 0
> >     Oops: 0000 [#1] SMP PTI
> >     ...
> >     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
> >     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> > 
> > You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> > a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> > of a migration entry on pud that was created in the 1st madvise() event.

Do pgd size pages work properly?

> > I think that conversion to pud-aligned migration entry is working,
> > but other MM code walking over page table isn't prepared for it.
> > We need some time and effort to make all this work properly, so
> > this patch avoids the reported bug by just disabling error handling
> > for 1GB hugepage.

Can we also get some documentation which would describe all requirements
for HWPoison pages to work properly please?

> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Acked-by: Michal Hocko <mhocko@suse.com>

We probably want a backport to stable as well. Although regular process
cannot get giga pages easily without admin help it is still not nice to
oops like this.

> > ---
> >  include/linux/mm.h  | 1 +
> >  mm/memory-failure.c | 7 +++++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> > index 63f7ba1..166864e 100644
> > --- v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h
> > +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> > @@ -2607,6 +2607,7 @@ enum mf_action_page_type {
> >  	MF_MSG_POISONED_HUGE,
> >  	MF_MSG_HUGE,
> >  	MF_MSG_FREE_HUGE,
> > +	MF_MSG_GIGANTIC,
> >  	MF_MSG_UNMAP_FAILED,
> >  	MF_MSG_DIRTY_SWAPCACHE,
> >  	MF_MSG_CLEAN_SWAPCACHE,
> > diff --git v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> > index d530ac1..c497588 100644
> > --- v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c
> > +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> > @@ -508,6 +508,7 @@ static const char * const action_page_types[] = {
> >  	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
> >  	[MF_MSG_HUGE]			= "huge page",
> >  	[MF_MSG_FREE_HUGE]		= "free huge page",
> > +	[MF_MSG_GIGANTIC]		= "gigantic page",
> >  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
> >  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
> >  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> > @@ -1090,6 +1091,12 @@ static int memory_failure_hugetlb(unsigned long pfn, int trapno, int flags)
> >  		return 0;
> >  	}
> >  
> > +	if (hstate_is_gigantic(page_hstate(head))) {
> > +		action_result(pfn, MF_MSG_GIGANTIC, MF_IGNORED);
> > +		res = -EBUSY;
> > +		goto out;
> > +	}
> > +
> >  	if (!hwpoison_user_mappings(p, pfn, trapno, flags, &head)) {
> >  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
> >  		res = -EBUSY;
> > -- 
> > 2.7.0
> > 
> > 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-29  9:54   ` Michal Hocko
@ 2018-01-29 18:08     ` Mike Kravetz
  2018-01-30  1:39       ` Naoya Horiguchi
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-01-29 18:08 UTC (permalink / raw)
  To: Michal Hocko, Naoya Horiguchi
  Cc: Andrew Morton, linux-mm, linux-kernel, Anshuman Khandual,
	Aneesh Kumar K.V

On 01/29/2018 01:54 AM, Michal Hocko wrote:
> On Mon 29-01-18 06:30:55, Naoya Horiguchi wrote:
>> My apology, I forgot to CC to the mailing lists.
>>
>> On Mon, Jan 29, 2018 at 03:28:03PM +0900, Naoya Horiguchi wrote:
>>> Recently the following BUG was reported:
>>>
>>>     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
>>>     Memory failure: 0x3c0000: recovery action for huge page: Recovered
>>>     BUG: unable to handle kernel paging request at ffff8dfcc0003000
>>>     IP: gup_pgd_range+0x1f0/0xc20
>>>     PGD 17ae72067 P4D 17ae72067 PUD 0
>>>     Oops: 0000 [#1] SMP PTI
>>>     ...
>>>     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
>>>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
>>>
>>> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
>>> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
>>> of a migration entry on pud that was created in the 1st madvise() event.
> 
> Do pgd size pages work properly?

Adding Anshuman and Aneesh as they added pgd support for power.  And,
this patch will disable that as well IIUC.

This patch makes sense for x86.  My only concern/question is for other
archs which may have huge page sizes defined which are > MAX_ORDER and
< PUD_SIZE.  These would also be classified as gigantic and impacted
by this patch.  Do these also have the same issue?

-- 
Mike Kravetz

>>> I think that conversion to pud-aligned migration entry is working,
>>> but other MM code walking over page table isn't prepared for it.
>>> We need some time and effort to make all this work properly, so
>>> this patch avoids the reported bug by just disabling error handling
>>> for 1GB hugepage.
> 
> Can we also get some documentation which would describe all requirements
> for HWPoison pages to work properly please?
> 
>>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> We probably want a backport to stable as well. Although regular process
> cannot get giga pages easily without admin help it is still not nice to
> oops like this.
> 
>>> ---
>>>  include/linux/mm.h  | 1 +
>>>  mm/memory-failure.c | 7 +++++++
>>>  2 files changed, 8 insertions(+)
>>>
>>> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
>>> index 63f7ba1..166864e 100644
>>> --- v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h
>>> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
>>> @@ -2607,6 +2607,7 @@ enum mf_action_page_type {
>>>  	MF_MSG_POISONED_HUGE,
>>>  	MF_MSG_HUGE,
>>>  	MF_MSG_FREE_HUGE,
>>> +	MF_MSG_GIGANTIC,
>>>  	MF_MSG_UNMAP_FAILED,
>>>  	MF_MSG_DIRTY_SWAPCACHE,
>>>  	MF_MSG_CLEAN_SWAPCACHE,
>>> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
>>> index d530ac1..c497588 100644
>>> --- v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c
>>> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
>>> @@ -508,6 +508,7 @@ static const char * const action_page_types[] = {
>>>  	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
>>>  	[MF_MSG_HUGE]			= "huge page",
>>>  	[MF_MSG_FREE_HUGE]		= "free huge page",
>>> +	[MF_MSG_GIGANTIC]		= "gigantic page",
>>>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>>>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>>>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
>>> @@ -1090,6 +1091,12 @@ static int memory_failure_hugetlb(unsigned long pfn, int trapno, int flags)
>>>  		return 0;
>>>  	}
>>>  
>>> +	if (hstate_is_gigantic(page_hstate(head))) {
>>> +		action_result(pfn, MF_MSG_GIGANTIC, MF_IGNORED);
>>> +		res = -EBUSY;
>>> +		goto out;
>>> +	}
>>> +
>>>  	if (!hwpoison_user_mappings(p, pfn, trapno, flags, &head)) {
>>>  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
>>>  		res = -EBUSY;
>>> -- 
>>> 2.7.0
>>>
>>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-29 18:08     ` Mike Kravetz
@ 2018-01-30  1:39       ` Naoya Horiguchi
  2018-01-30  3:54         ` [PATCH v2] " Naoya Horiguchi
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2018-01-30  1:39 UTC (permalink / raw)
  To: Michal Hocko, Mike Kravetz
  Cc: Andrew Morton, linux-mm, linux-kernel, Anshuman Khandual,
	Aneesh Kumar K.V

Hi Michal, Mike,

On Mon, Jan 29, 2018 at 10:08:53AM -0800, Mike Kravetz wrote:
> On 01/29/2018 01:54 AM, Michal Hocko wrote:
> > On Mon 29-01-18 06:30:55, Naoya Horiguchi wrote:
> >> My apology, I forgot to CC to the mailing lists.
> >>
> >> On Mon, Jan 29, 2018 at 03:28:03PM +0900, Naoya Horiguchi wrote:
> >>> Recently the following BUG was reported:
> >>>
> >>>     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
> >>>     Memory failure: 0x3c0000: recovery action for huge page: Recovered
> >>>     BUG: unable to handle kernel paging request at ffff8dfcc0003000
> >>>     IP: gup_pgd_range+0x1f0/0xc20
> >>>     PGD 17ae72067 P4D 17ae72067 PUD 0
> >>>     Oops: 0000 [#1] SMP PTI
> >>>     ...
> >>>     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
> >>>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> >>>
> >>> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> >>> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> >>> of a migration entry on pud that was created in the 1st madvise() event.
> > 
> > Do pgd size pages work properly?

PGD size is unsupported now too, and this patch is also disabling that size.

> 
> Adding Anshuman and Aneesh as they added pgd support for power.  And,
> this patch will disable that as well IIUC.

Thanks Mike, I want to have some feedback from PowerPC developers too.

> 
> This patch makes sense for x86.  My only concern/question is for other
> archs which may have huge page sizes defined which are > MAX_ORDER and
> < PUD_SIZE.  These would also be classified as gigantic and impacted
> by this patch.  Do these also have the same issue?

Maybe one clearer way is to use more explicit condition like "page size > PMD_SIZE".

> 
> -- 
> Mike Kravetz
> 
> >>> I think that conversion to pud-aligned migration entry is working,
> >>> but other MM code walking over page table isn't prepared for it.
> >>> We need some time and effort to make all this work properly, so
> >>> this patch avoids the reported bug by just disabling error handling
> >>> for 1GB hugepage.
> > 
> > Can we also get some documentation which would describe all requirements
> > for HWPoison pages to work properly please?

OK, I'll add this.

> > 
> >>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > We probably want a backport to stable as well. Although regular process
> > cannot get giga pages easily without admin help it is still not nice to
> > oops like this.

I'll add CC to stable.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-30  1:39       ` Naoya Horiguchi
@ 2018-01-30  3:54         ` Naoya Horiguchi
  2018-01-30 23:56           ` Mike Kravetz
  2018-02-05 15:05           ` Punit Agrawal
  0 siblings, 2 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2018-01-30  3:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Michal Hocko, Mike Kravetz, Aneesh Kumar K.V,
	Anshuman Khandual, linux-kernel

Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.

I think that conversion to pud-aligned migration entry is working,
but other MM code walking over page table isn't prepared for it.
We need some time and effort to make all this work properly, so
this patch avoids the reported bug by just disabling error handling
for 1GB hugepage.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com> // for v1
Cc: <stable@vger.kernel.org>
---
ChangeLog v1 -> v2:
- add comment about what we need to support hwpoision for pud-sized hugetlb
- use "page size > PMD_SIZE" condition instead of hstate_is_gigantic()
---
 include/linux/mm.h  |  1 +
 mm/memory-failure.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
index 63f7ba1..6b3df81 100644
--- v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h
+++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
@@ -2607,6 +2607,7 @@ enum mf_action_page_type {
 	MF_MSG_POISONED_HUGE,
 	MF_MSG_HUGE,
 	MF_MSG_FREE_HUGE,
+	MF_MSG_NON_PMD_HUGE,
 	MF_MSG_UNMAP_FAILED,
 	MF_MSG_DIRTY_SWAPCACHE,
 	MF_MSG_CLEAN_SWAPCACHE,
diff --git v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
index d530ac1..264e020 100644
--- v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c
+++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
@@ -508,6 +508,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
+	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
 	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
 	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
 	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
@@ -1090,6 +1091,21 @@ static int memory_failure_hugetlb(unsigned long pfn, int trapno, int flags)
 		return 0;
 	}
 
+	/*
+	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
+	 * simply disable it. In order to make it work properly, we need
+	 * make sure that:
+	 *  - conversion of a pud that maps an error hugetlb into hwpoison
+	 *    entry properly works, and
+	 *  - other mm code walking over page table is aware of pud-aligned
+	 *    hwpoison entries.
+	 */
+	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
+		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
+		res = -EBUSY;
+		goto out;
+	}
+
 	if (!hwpoison_user_mappings(p, pfn, trapno, flags, &head)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-- 
2.7.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-30  3:54         ` [PATCH v2] " Naoya Horiguchi
@ 2018-01-30 23:56           ` Mike Kravetz
  2018-02-05 15:05           ` Punit Agrawal
  1 sibling, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-01-30 23:56 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel

On 01/29/2018 07:54 PM, Naoya Horiguchi wrote:
> Recently the following BUG was reported:
> 
>     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
>     Memory failure: 0x3c0000: recovery action for huge page: Recovered
>     BUG: unable to handle kernel paging request at ffff8dfcc0003000
>     IP: gup_pgd_range+0x1f0/0xc20
>     PGD 17ae72067 P4D 17ae72067 PUD 0
>     Oops: 0000 [#1] SMP PTI
>     ...
>     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> 
> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> of a migration entry on pud that was created in the 1st madvise() event.
> 
> I think that conversion to pud-aligned migration entry is working,
> but other MM code walking over page table isn't prepared for it.
> We need some time and effort to make all this work properly, so
> this patch avoids the reported bug by just disabling error handling
> for 1GB hugepage.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Michal Hocko <mhocko@suse.com> // for v1
> Cc: <stable@vger.kernel.org>
> ---
> ChangeLog v1 -> v2:
> - add comment about what we need to support hwpoision for pud-sized hugetlb
> - use "page size > PMD_SIZE" condition instead of hstate_is_gigantic()
> ---
>  include/linux/mm.h  |  1 +
>  mm/memory-failure.c | 16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> index 63f7ba1..6b3df81 100644
> --- v4.15-rc8-mmotm-2018-01-18-16-31/include/linux/mm.h
> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/include/linux/mm.h
> @@ -2607,6 +2607,7 @@ enum mf_action_page_type {
>  	MF_MSG_POISONED_HUGE,
>  	MF_MSG_HUGE,
>  	MF_MSG_FREE_HUGE,
> +	MF_MSG_NON_PMD_HUGE,
>  	MF_MSG_UNMAP_FAILED,
>  	MF_MSG_DIRTY_SWAPCACHE,
>  	MF_MSG_CLEAN_SWAPCACHE,
> diff --git v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> index d530ac1..264e020 100644
> --- v4.15-rc8-mmotm-2018-01-18-16-31/mm/memory-failure.c
> +++ v4.15-rc8-mmotm-2018-01-18-16-31_patched/mm/memory-failure.c
> @@ -508,6 +508,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> +	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> @@ -1090,6 +1091,21 @@ static int memory_failure_hugetlb(unsigned long pfn, int trapno, int flags)
>  		return 0;
>  	}
>  
> +	/*
> +	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
> +	 * simply disable it. In order to make it work properly, we need
> +	 * make sure that:
> +	 *  - conversion of a pud that maps an error hugetlb into hwpoison
> +	 *    entry properly works, and
> +	 *  - other mm code walking over page table is aware of pud-aligned
> +	 *    hwpoison entries.
> +	 */
> +	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
> +		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
> +		res = -EBUSY;
> +		goto out;
> +	}
> +
>  	if (!hwpoison_user_mappings(p, pfn, trapno, flags, &head)) {
>  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
>  		res = -EBUSY;
> 

Thanks, that does catch all those other huge page sizes.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

It would really be helpful to get some comments from the powerpc folks
as this does seem to impact them most.  Perhaps arm64 as well?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-01-30  3:54         ` [PATCH v2] " Naoya Horiguchi
  2018-01-30 23:56           ` Mike Kravetz
@ 2018-02-05 15:05           ` Punit Agrawal
  2018-02-07  1:14             ` Naoya Horiguchi
  1 sibling, 1 reply; 19+ messages in thread
From: Punit Agrawal @ 2018-02-05 15:05 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Anshuman Khandual, linux-kernel

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Recently the following BUG was reported:
>
>     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
>     Memory failure: 0x3c0000: recovery action for huge page: Recovered
>     BUG: unable to handle kernel paging request at ffff8dfcc0003000
>     IP: gup_pgd_range+0x1f0/0xc20
>     PGD 17ae72067 P4D 17ae72067 PUD 0
>     Oops: 0000 [#1] SMP PTI
>     ...
>     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
>
> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> of a migration entry on pud that was created in the 1st madvise() event.

Maybe I'm doing something wrong but I wasn't able to reproduce the issue
using the test at the end. I get -

    $ sudo ./hugepage

    Poisoning page...once
    [  121.295771] Injecting memory failure for pfn 0x8300000 at process virtual address 0x400000000000
    [  121.386450] Memory failure: 0x8300000: recovery action for huge page: Recovered

    Poisoning page...once again
    madvise: Bad address

What am I missing?


--------- >8 ---------
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
	int flags = MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE;
	int prot = PROT_READ | PROT_WRITE;
	size_t hugepage_sz;
	void *hugepage;
	int ret;

	hugepage_sz = 1024 * 1024 * 1024; /* 1GB */
	hugepage = mmap(NULL, hugepage_sz, prot, flags, -1, 0);
	if (hugepage == MAP_FAILED) {
		perror("mmap");
		return 1;
	}

	memset(hugepage, 'b', hugepage_sz);
	getchar();

	printf("Poisoning page...once\n");
	ret = madvise(hugepage, hugepage_sz, MADV_HWPOISON);
	if (ret) {
		perror("madvise");
		return 1;
	}
	getchar();

	printf("Poisoning page...once again\n");
	ret = madvise(hugepage, hugepage_sz, MADV_HWPOISON);
	if (ret) {
		perror("madvise");
		return 1;
	}
	getchar();

	memset(hugepage, 'c', hugepage_sz);
	ret = munmap(hugepage, hugepage_sz);
	if (ret) {
		perror("munmap");
		return 1;
	}
	
	return 0;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-02-05 15:05           ` Punit Agrawal
@ 2018-02-07  1:14             ` Naoya Horiguchi
  2018-02-08 12:30               ` Punit Agrawal
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2018-02-07  1:14 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: linux-mm, Andrew Morton, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Anshuman Khandual, linux-kernel

Hi Punit,

On Mon, Feb 05, 2018 at 03:05:43PM +0000, Punit Agrawal wrote:
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
> > Recently the following BUG was reported:
> >
> >     Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
> >     Memory failure: 0x3c0000: recovery action for huge page: Recovered
> >     BUG: unable to handle kernel paging request at ffff8dfcc0003000
> >     IP: gup_pgd_range+0x1f0/0xc20
> >     PGD 17ae72067 P4D 17ae72067 PUD 0
> >     Oops: 0000 [#1] SMP PTI
> >     ...
> >     CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
> >     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> >
> > You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
> > a 1GB hugepage. This happens because get_user_pages_fast() is not aware
> > of a migration entry on pud that was created in the 1st madvise() event.
> 
> Maybe I'm doing something wrong but I wasn't able to reproduce the issue
> using the test at the end. I get -
> 
>     $ sudo ./hugepage
> 
>     Poisoning page...once
>     [  121.295771] Injecting memory failure for pfn 0x8300000 at process virtual address 0x400000000000
>     [  121.386450] Memory failure: 0x8300000: recovery action for huge page: Recovered
> 
>     Poisoning page...once again
>     madvise: Bad address
> 
> What am I missing?

The test program below is exactly what I intended, so you did right testing.
I try to guess what could happen. The related code is like below:

  static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                           int write, struct page **pages, int *nr)
  {
          ...
          do {
                  pud_t pud = READ_ONCE(*pudp);

                  next = pud_addr_end(addr, end);
                  if (pud_none(pud))
                          return 0;
                  if (unlikely(pud_huge(pud))) {
                          if (!gup_huge_pud(pud, pudp, addr, next, write,
                                            pages, nr))
                                  return 0;

pud_none() always returns false for hwpoison entry in any arch.
I guess that pud_huge() could behave in undefined manner for hwpoison entry
because pud_huge() assumes that a given pud has the present bit set, which
is not true for hwpoison entry. As a result, pud_huge() checks an irrelevant
bit used for other purpose depending on non-present page table format of
each arch.
If pud_huge() returns false for hwpoison entry, we try to go to the lower
level and the kernel highly likely to crash. So I guess your kernel fell back
the slow path and somehow ended up with returning EFAULT.

So I don't think that the above test result means that errors are properly
handled, and the proposed patch should help for arm64.

Thanks,
Naoya Horiguchi

> 
> 
> --------- >8 ---------
> #include <stdio.h>
> #include <string.h>
> #include <sys/mman.h>
> 
> int main(int argc, char *argv[])
> {
> 	int flags = MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE;
> 	int prot = PROT_READ | PROT_WRITE;
> 	size_t hugepage_sz;
> 	void *hugepage;
> 	int ret;
> 
> 	hugepage_sz = 1024 * 1024 * 1024; /* 1GB */
> 	hugepage = mmap(NULL, hugepage_sz, prot, flags, -1, 0);
> 	if (hugepage == MAP_FAILED) {
> 		perror("mmap");
> 		return 1;
> 	}
> 
> 	memset(hugepage, 'b', hugepage_sz);
> 	getchar();
> 
> 	printf("Poisoning page...once\n");
> 	ret = madvise(hugepage, hugepage_sz, MADV_HWPOISON);
> 	if (ret) {
> 		perror("madvise");
> 		return 1;
> 	}
> 	getchar();
> 
> 	printf("Poisoning page...once again\n");
> 	ret = madvise(hugepage, hugepage_sz, MADV_HWPOISON);
> 	if (ret) {
> 		perror("madvise");
> 		return 1;
> 	}
> 	getchar();
> 
> 	memset(hugepage, 'c', hugepage_sz);
> 	ret = munmap(hugepage, hugepage_sz);
> 	if (ret) {
> 		perror("munmap");
> 		return 1;
> 	}
> 	
> 	return 0;
> }
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-02-07  1:14             ` Naoya Horiguchi
@ 2018-02-08 12:30               ` Punit Agrawal
  2018-02-08 20:17                 ` Andrew Morton
  2018-02-09  1:17                 ` Naoya Horiguchi
  0 siblings, 2 replies; 19+ messages in thread
From: Punit Agrawal @ 2018-02-08 12:30 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Anshuman Khandual, linux-kernel

Horiguchi-san,

Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:

> Hi Punit,
>
> On Mon, Feb 05, 2018 at 03:05:43PM +0000, Punit Agrawal wrote:
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
>> 

[...]

>> >
>> > You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
>> > a 1GB hugepage. This happens because get_user_pages_fast() is not aware
>> > of a migration entry on pud that was created in the 1st madvise() event.
>> 
>> Maybe I'm doing something wrong but I wasn't able to reproduce the issue
>> using the test at the end. I get -
>> 
>>     $ sudo ./hugepage
>> 
>>     Poisoning page...once
>>     [  121.295771] Injecting memory failure for pfn 0x8300000 at process virtual address 0x400000000000
>>     [  121.386450] Memory failure: 0x8300000: recovery action for huge page: Recovered
>> 
>>     Poisoning page...once again
>>     madvise: Bad address
>> 
>> What am I missing?
>
> The test program below is exactly what I intended, so you did right
> testing.

Thanks for the confirmation. And the flow outline below. 

> I try to guess what could happen. The related code is like below:
>
>   static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                            int write, struct page **pages, int *nr)
>   {
>           ...
>           do {
>                   pud_t pud = READ_ONCE(*pudp);
>
>                   next = pud_addr_end(addr, end);
>                   if (pud_none(pud))
>                           return 0;
>                   if (unlikely(pud_huge(pud))) {
>                           if (!gup_huge_pud(pud, pudp, addr, next, write,
>                                             pages, nr))
>                                   return 0;
>
> pud_none() always returns false for hwpoison entry in any arch.
> I guess that pud_huge() could behave in undefined manner for hwpoison entry
> because pud_huge() assumes that a given pud has the present bit set, which
> is not true for hwpoison entry.

This is where the arm64 helpers behaves differently (though more by
chance then design). A poisoned pud passes pud_huge() as it doesn't seem
to be explicitly checking for the present bit.

    int pud_huge(pud_t pud)
    {
            return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT);
    }


This doesn't lead to a crash as the first thing gup_huge_pud() does is
check for pud_access_permitted() which does check for the present bit.

I was able to crash the kernel by changing pud_huge() to check for the
present bit.

> As a result, pud_huge() checks an irrelevant bit used for other
> purpose depending on non-present page table format of each arch. If
> pud_huge() returns false for hwpoison entry, we try to go to the lower
> level and the kernel highly likely to crash. So I guess your kernel
> fell back the slow path and somehow ended up with returning EFAULT.

Makes sense. Due to the difference above on arm64, it ends up falling
back to the slow path which eventually returns -EFAULT (via
follow_hugetlb_page) for poisoned pages.

>
> So I don't think that the above test result means that errors are properly
> handled, and the proposed patch should help for arm64.

Although, the deviation of pud_huge() avoids a kernel crash the code
would be easier to maintain and reason about if arm64 helpers are
consistent with expectations by core code.

I'll look to update the arm64 helpers once this patch gets merged. But
it would be helpful if there was a clear expression of semantics for
pud_huge() for various cases. Is there any version that can be used as
reference?

Also, do you know what the plans are for re-enabling hugepage poisoning
disabled here?

Thanks,
Punit

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-02-08 12:30               ` Punit Agrawal
@ 2018-02-08 20:17                 ` Andrew Morton
       [not found]                   ` <87wozhvc49.fsf@concordia.ellerman.id.au>
  2018-02-09  1:17                 ` Naoya Horiguchi
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2018-02-08 20:17 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Naoya Horiguchi, linux-mm, Michal Hocko, Mike Kravetz,
	Aneesh Kumar K.V, Anshuman Khandual, linux-kernel,
	Benjamin Herrenschmidt, Michael Ellerman

On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:

> >
> > So I don't think that the above test result means that errors are properly
> > handled, and the proposed patch should help for arm64.
> 
> Although, the deviation of pud_huge() avoids a kernel crash the code
> would be easier to maintain and reason about if arm64 helpers are
> consistent with expectations by core code.
> 
> I'll look to update the arm64 helpers once this patch gets merged. But
> it would be helpful if there was a clear expression of semantics for
> pud_huge() for various cases. Is there any version that can be used as
> reference?

Is that an ack or tested-by?

Mike keeps plaintively asking the powerpc developers to take a look,
but they remain steadfastly in hiding.

Folks, this patch fixes a BUG and is marked for -stable.  Can we please
prioritize it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-02-08 12:30               ` Punit Agrawal
  2018-02-08 20:17                 ` Andrew Morton
@ 2018-02-09  1:17                 ` Naoya Horiguchi
  1 sibling, 0 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2018-02-09  1:17 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Michal Hocko,
	Mike Kravetz, Aneesh Kumar K.V, Anshuman Khandual, linux-kernel

On 02/08/2018 09:30 PM, Punit Agrawal wrote:
> Horiguchi-san,
> 
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
> 
>> Hi Punit,
>>
>> On Mon, Feb 05, 2018 at 03:05:43PM +0000, Punit Agrawal wrote:
>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> writes:
>>>
> 
> [...]
> 
>>>>
>>>> You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
>>>> a 1GB hugepage. This happens because get_user_pages_fast() is not aware
>>>> of a migration entry on pud that was created in the 1st madvise() event.
>>>
>>> Maybe I'm doing something wrong but I wasn't able to reproduce the issue
>>> using the test at the end. I get -
>>>
>>>     $ sudo ./hugepage
>>>
>>>     Poisoning page...once
>>>     [  121.295771] Injecting memory failure for pfn 0x8300000 at process virtual address 0x400000000000
>>>     [  121.386450] Memory failure: 0x8300000: recovery action for huge page: Recovered
>>>
>>>     Poisoning page...once again
>>>     madvise: Bad address
>>>
>>> What am I missing?
>>
>> The test program below is exactly what I intended, so you did right
>> testing.
> 
> Thanks for the confirmation. And the flow outline below. 
> 
>> I try to guess what could happen. The related code is like below:
>>
>>   static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>>                            int write, struct page **pages, int *nr)
>>   {
>>           ...
>>           do {
>>                   pud_t pud = READ_ONCE(*pudp);
>>
>>                   next = pud_addr_end(addr, end);
>>                   if (pud_none(pud))
>>                           return 0;
>>                   if (unlikely(pud_huge(pud))) {
>>                           if (!gup_huge_pud(pud, pudp, addr, next, write,
>>                                             pages, nr))
>>                                   return 0;
>>
>> pud_none() always returns false for hwpoison entry in any arch.
>> I guess that pud_huge() could behave in undefined manner for hwpoison entry
>> because pud_huge() assumes that a given pud has the present bit set, which
>> is not true for hwpoison entry.
> 
> This is where the arm64 helpers behaves differently (though more by
> chance then design). A poisoned pud passes pud_huge() as it doesn't seem
> to be explicitly checking for the present bit.
> 
>     int pud_huge(pud_t pud)
>     {
>             return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT);
>     }
> 
> 
> This doesn't lead to a crash as the first thing gup_huge_pud() does is
> check for pud_access_permitted() which does check for the present bit.
> 
> I was able to crash the kernel by changing pud_huge() to check for the
> present bit.
> 
>> As a result, pud_huge() checks an irrelevant bit used for other
>> purpose depending on non-present page table format of each arch. If
>> pud_huge() returns false for hwpoison entry, we try to go to the lower
>> level and the kernel highly likely to crash. So I guess your kernel
>> fell back the slow path and somehow ended up with returning EFAULT.
> 
> Makes sense. Due to the difference above on arm64, it ends up falling
> back to the slow path which eventually returns -EFAULT (via
> follow_hugetlb_page) for poisoned pages.
> 
>>
>> So I don't think that the above test result means that errors are properly
>> handled, and the proposed patch should help for arm64.
> 
> Although, the deviation of pud_huge() avoids a kernel crash the code
> would be easier to maintain and reason about if arm64 helpers are
> consistent with expectations by core code.
> 
> I'll look to update the arm64 helpers once this patch gets merged. But
> it would be helpful if there was a clear expression of semantics for
> pud_huge() for various cases. Is there any version that can be used as
> reference?

Sorry if I misunderstand you, but with this patch there is no non-present
pud entry, so I feel that you don't have to change pud_huge() in arm64.

When we get to have non-present pud entries (by enabling hwpoison or 1GB
hugepage migration), we need to explicitly check pud_present in every page
table walk. So I think the current semantics is like:

  if (pud_none(pud))
          /* skip this entry */
  else if (pud_huge(pud))
          /* do something for pud-hugetlb */
  else
          /* go to next (pmd) level */

and after enabling hwpoison or migartion:

  if (pud_none(pud))
          /* skip this entry */
  else if (!pud_present(pud))
          /* do what we need to handle peculiar cases */
  else if (pud_huge(pud))
          /* do something for pud-hugetlb */
  else
          /* go to next (pmd) level */

What we did for pmd can also be a reference to what we do for pud.

> 
> Also, do you know what the plans are for re-enabling hugepage poisoning
> disabled here?

I'd like to say yes, but it's not specific one because breaking pud isn't
a easy/simple task. But 1GB hugetlb is becoming more important, so we
might have to have code for it.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
       [not found]                   ` <87wozhvc49.fsf@concordia.ellerman.id.au>
@ 2018-02-13 22:33                     ` Mike Kravetz
  2019-05-28  9:49                       ` Wanpeng Li
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-02-13 22:33 UTC (permalink / raw)
  To: Michael Ellerman, Andrew Morton, Punit Agrawal
  Cc: Naoya Horiguchi, linux-mm, Michal Hocko, Aneesh Kumar K.V,
	Anshuman Khandual, linux-kernel, Benjamin Herrenschmidt,
	linuxppc-dev

On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
>>
>>>>
>>>> So I don't think that the above test result means that errors are properly
>>>> handled, and the proposed patch should help for arm64.
>>>
>>> Although, the deviation of pud_huge() avoids a kernel crash the code
>>> would be easier to maintain and reason about if arm64 helpers are
>>> consistent with expectations by core code.
>>>
>>> I'll look to update the arm64 helpers once this patch gets merged. But
>>> it would be helpful if there was a clear expression of semantics for
>>> pud_huge() for various cases. Is there any version that can be used as
>>> reference?
>>
>> Is that an ack or tested-by?
>>
>> Mike keeps plaintively asking the powerpc developers to take a look,
>> but they remain steadfastly in hiding.
> 
> Cc'ing linuxppc-dev is always a good idea :)
> 

Thanks Michael,

I was mostly concerned about use cases for soft/hard offline of huge pages
larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
huge pages, and soft/hard offline support was specifically added for this.
See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
at PGD level"

This patch will disable that functionality.  So, at a minimum this is a
'heads up'.  If there are actual use cases that depend on this, then more
work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
support, I can not tell if there is a real use case or this is just a
'nice to have'.

-- 
Mike Kravetz

>> Folks, this patch fixes a BUG and is marked for -stable.  Can we please
>> prioritize it?
> 
> It's not crashing for me (on 4.16-rc1):
> 
>   # ./huge-poison 
>   Poisoning page...once
>   Poisoning page...once again
>   madvise: Bad address
> 
> And I guess the above is the expected behaviour?
> 
> Looking at the function trace it looks like the 2nd madvise is going
> down reasonable code paths, but I don't know for sure:
> 
>   8)               |  SyS_madvise() {
>   8)               |    capable() {
>   8)               |      ns_capable_common() {
>   8)   0.094 us    |        cap_capable();
>   8)   0.516 us    |      }
>   8)   1.052 us    |    }
>   8)               |    get_user_pages_fast() {
>   8)   0.354 us    |      gup_pgd_range();
>   8)               |      get_user_pages_unlocked() {
>   8)   0.050 us    |        down_read();
>   8)               |        __get_user_pages() {
>   8)               |          find_extend_vma() {
>   8)               |            find_vma() {
>   8)   0.148 us    |              vmacache_find();
>   8)   0.622 us    |            }
>   8)   1.064 us    |          }
>   8)   0.028 us    |          arch_vma_access_permitted();
>   8)               |          follow_hugetlb_page() {
>   8)               |            huge_pte_offset() {
>   8)   0.128 us    |              __find_linux_pte();
>   8)   0.580 us    |            }
>   8)   0.048 us    |            _raw_spin_lock();
>   8)               |            hugetlb_fault() {
>   8)               |              huge_pte_offset() {
>   8)   0.034 us    |                __find_linux_pte();
>   8)   0.434 us    |              }
>   8)   0.028 us    |              is_hugetlb_entry_migration();
>   8)   0.032 us    |              is_hugetlb_entry_hwpoisoned();
>   8)   2.118 us    |            }
>   8)   4.940 us    |          }
>   8)   7.468 us    |        }
>   8)   0.056 us    |        up_read();
>   8)   8.722 us    |      }
>   8) + 10.264 us   |    }
>   8) + 12.212 us   |  }
> 
> 
> cheers
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2018-02-13 22:33                     ` Mike Kravetz
@ 2019-05-28  9:49                       ` Wanpeng Li
  2019-05-29 23:31                         ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Wanpeng Li @ 2019-05-28  9:49 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michael Ellerman, Andrew Morton, Punit Agrawal, Naoya Horiguchi,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

Cc Paolo,
Hi all,
On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> >
> >> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> >>
> >>>>
> >>>> So I don't think that the above test result means that errors are properly
> >>>> handled, and the proposed patch should help for arm64.
> >>>
> >>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>> would be easier to maintain and reason about if arm64 helpers are
> >>> consistent with expectations by core code.
> >>>
> >>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>> it would be helpful if there was a clear expression of semantics for
> >>> pud_huge() for various cases. Is there any version that can be used as
> >>> reference?
> >>
> >> Is that an ack or tested-by?
> >>
> >> Mike keeps plaintively asking the powerpc developers to take a look,
> >> but they remain steadfastly in hiding.
> >
> > Cc'ing linuxppc-dev is always a good idea :)
> >
>
> Thanks Michael,
>
> I was mostly concerned about use cases for soft/hard offline of huge pages
> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> huge pages, and soft/hard offline support was specifically added for this.
> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> at PGD level"
>
> This patch will disable that functionality.  So, at a minimum this is a
> 'heads up'.  If there are actual use cases that depend on this, then more
> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> support, I can not tell if there is a real use case or this is just a
> 'nice to have'.

1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
encounter gup_pud_range() panic several times in product environment.
Is there any plan to reenable and fix arch codes?

In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
The memory in guest can be 1GB/2MB/4K, though the host-backed memory
are 1GB hugetlbfs pages, after above PUD panic is fixed,
try_to_unmap() which is called in MCA recovery path will mark the PUD
hwpoison entry. The guest will vmexit and retry endlessly when
accessing any memory in the guest which is backed by this 1GB poisoned
hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
will be delivered to VM at page fault next time for the offensive
SPTE. Is this proposal acceptable?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-05-28  9:49                       ` Wanpeng Li
@ 2019-05-29 23:31                         ` Mike Kravetz
  2019-06-10 23:50                           ` Naoya Horiguchi
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2019-05-29 23:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Michael Ellerman, Andrew Morton, Punit Agrawal, Naoya Horiguchi,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On 5/28/19 2:49 AM, Wanpeng Li wrote:
> Cc Paolo,
> Hi all,
> On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
>>> Andrew Morton <akpm@linux-foundation.org> writes:
>>>
>>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
>>>>
>>>>>>
>>>>>> So I don't think that the above test result means that errors are properly
>>>>>> handled, and the proposed patch should help for arm64.
>>>>>
>>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
>>>>> would be easier to maintain and reason about if arm64 helpers are
>>>>> consistent with expectations by core code.
>>>>>
>>>>> I'll look to update the arm64 helpers once this patch gets merged. But
>>>>> it would be helpful if there was a clear expression of semantics for
>>>>> pud_huge() for various cases. Is there any version that can be used as
>>>>> reference?
>>>>
>>>> Is that an ack or tested-by?
>>>>
>>>> Mike keeps plaintively asking the powerpc developers to take a look,
>>>> but they remain steadfastly in hiding.
>>>
>>> Cc'ing linuxppc-dev is always a good idea :)
>>>
>>
>> Thanks Michael,
>>
>> I was mostly concerned about use cases for soft/hard offline of huge pages
>> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
>> huge pages, and soft/hard offline support was specifically added for this.
>> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
>> at PGD level"
>>
>> This patch will disable that functionality.  So, at a minimum this is a
>> 'heads up'.  If there are actual use cases that depend on this, then more
>> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
>> support, I can not tell if there is a real use case or this is just a
>> 'nice to have'.
> 
> 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> encounter gup_pud_range() panic several times in product environment.
> Is there any plan to reenable and fix arch codes?

I too am aware of slightly more interest in 1G huge pages.  Suspect that as
Intel MMU capacity increases to handle more TLB entries there will be more
and more interest.

Personally, I am not looking at this issue.  Perhaps Naoya will comment as
he know most about this code.

> In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> are 1GB hugetlbfs pages, after above PUD panic is fixed,
> try_to_unmap() which is called in MCA recovery path will mark the PUD
> hwpoison entry. The guest will vmexit and retry endlessly when
> accessing any memory in the guest which is backed by this 1GB poisoned
> hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> will be delivered to VM at page fault next time for the offensive
> SPTE. Is this proposal acceptable?

I am not sure of the error handling design, but this does sound reasonable.
That block of code which potentially dissolves a huge page on memory error
is hard to understand and I'm not sure if that is even the 'normal'
functionality.  Certainly, we would hate to waste/poison an entire 1G page
for an error on a small subsection.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-05-29 23:31                         ` Mike Kravetz
@ 2019-06-10 23:50                           ` Naoya Horiguchi
  2019-06-11  8:42                             ` Wanpeng Li
  2019-08-20  7:03                             ` Wanpeng Li
  0 siblings, 2 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2019-06-10 23:50 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Wanpeng Li, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > Cc Paolo,
> > Hi all,
> > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> >>> Andrew Morton <akpm@linux-foundation.org> writes:
> >>>
> >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> >>>>
> >>>>>>
> >>>>>> So I don't think that the above test result means that errors are properly
> >>>>>> handled, and the proposed patch should help for arm64.
> >>>>>
> >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>>>> would be easier to maintain and reason about if arm64 helpers are
> >>>>> consistent with expectations by core code.
> >>>>>
> >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>>>> it would be helpful if there was a clear expression of semantics for
> >>>>> pud_huge() for various cases. Is there any version that can be used as
> >>>>> reference?
> >>>>
> >>>> Is that an ack or tested-by?
> >>>>
> >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> >>>> but they remain steadfastly in hiding.
> >>>
> >>> Cc'ing linuxppc-dev is always a good idea :)
> >>>
> >>
> >> Thanks Michael,
> >>
> >> I was mostly concerned about use cases for soft/hard offline of huge pages
> >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> >> huge pages, and soft/hard offline support was specifically added for this.
> >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> >> at PGD level"
> >>
> >> This patch will disable that functionality.  So, at a minimum this is a
> >> 'heads up'.  If there are actual use cases that depend on this, then more
> >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> >> support, I can not tell if there is a real use case or this is just a
> >> 'nice to have'.
> > 
> > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > encounter gup_pud_range() panic several times in product environment.
> > Is there any plan to reenable and fix arch codes?
> 
> I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> Intel MMU capacity increases to handle more TLB entries there will be more
> and more interest.
> 
> Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> he know most about this code.

Thanks for forwarding this to me, I'm feeling that memory error handling
on 1GB hugepage is demanded as real use case.

> 
> > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > hwpoison entry. The guest will vmexit and retry endlessly when
> > accessing any memory in the guest which is backed by this 1GB poisoned
> > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > will be delivered to VM at page fault next time for the offensive
> > SPTE. Is this proposal acceptable?
> 
> I am not sure of the error handling design, but this does sound reasonable.

I agree that that's better.

> That block of code which potentially dissolves a huge page on memory error
> is hard to understand and I'm not sure if that is even the 'normal'
> functionality.  Certainly, we would hate to waste/poison an entire 1G page
> for an error on a small subsection.

Yes, that's not practical, so we need at first establish the code base for
2GB hugetlb splitting and then extending it to 1GB next.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-06-10 23:50                           ` Naoya Horiguchi
@ 2019-06-11  8:42                             ` Wanpeng Li
  2019-08-20  7:03                             ` Wanpeng Li
  1 sibling, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2019-06-11  8:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu

On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I'm working on this, thanks for the inputs.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-06-10 23:50                           ` Naoya Horiguchi
  2019-06-11  8:42                             ` Wanpeng Li
@ 2019-08-20  7:03                             ` Wanpeng Li
  2019-08-21  5:39                               ` ##freemail## " Naoya Horiguchi
  1 sibling, 1 reply; 19+ messages in thread
From: Wanpeng Li @ 2019-08-20  7:03 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

Cc Mel Gorman, Kirill, Dave Hansen,
On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I found it is not easy to split. There is a unique hugetlb page size
that is associated with a mounted hugetlbfs filesystem, file remap to
2MB/4KB will break this. How about hard offline 1GB hugetlb page as
what has already done in soft offline, replace the corrupted 1GB page
by new 1GB page through page migration, the offending/corrupted area
in the original 1GB page doesn't need to be copied into the new page,
the offending/corrupted area in new page can keep full zero just as it
is clear during hugetlb page fault, other sub-pages of the original
1GB page can be freed to buddy system. The sigbus signal is sent to
userspace w/ offending/corrupted virtual address, and signal code,
userspace should take care this.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-08-20  7:03                             ` Wanpeng Li
@ 2019-08-21  5:39                               ` Naoya Horiguchi
  2019-08-21  7:15                                 ` Wanpeng Li
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2019-08-21  5:39 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> Cc Mel Gorman, Kirill, Dave Hansen,
> On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > Cc Paolo,
> > > > Hi all,
> > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >>
> > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > > >>>
> > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > > >>>>
> > > >>>>>>
> > > >>>>>> So I don't think that the above test result means that errors are properly
> > > >>>>>> handled, and the proposed patch should help for arm64.
> > > >>>>>
> > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > >>>>> consistent with expectations by core code.
> > > >>>>>
> > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > > >>>>> it would be helpful if there was a clear expression of semantics for
> > > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > > >>>>> reference?
> > > >>>>
> > > >>>> Is that an ack or tested-by?
> > > >>>>
> > > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > > >>>> but they remain steadfastly in hiding.
> > > >>>
> > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > >>>
> > > >>
> > > >> Thanks Michael,
> > > >>
> > > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > > >> huge pages, and soft/hard offline support was specifically added for this.
> > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > > >> at PGD level"
> > > >>
> > > >> This patch will disable that functionality.  So, at a minimum this is a
> > > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > > >> support, I can not tell if there is a real use case or this is just a
> > > >> 'nice to have'.
> > > >
> > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > encounter gup_pud_range() panic several times in product environment.
> > > > Is there any plan to reenable and fix arch codes?
> > >
> > > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > > Intel MMU capacity increases to handle more TLB entries there will be more
> > > and more interest.
> > >
> > > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > > he know most about this code.
> >
> > Thanks for forwarding this to me, I'm feeling that memory error handling
> > on 1GB hugepage is demanded as real use case.
> >
> > >
> > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > > will be delivered to VM at page fault next time for the offensive
> > > > SPTE. Is this proposal acceptable?
> > >
> > > I am not sure of the error handling design, but this does sound reasonable.
> >
> > I agree that that's better.
> >
> > > That block of code which potentially dissolves a huge page on memory error
> > > is hard to understand and I'm not sure if that is even the 'normal'
> > > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > > for an error on a small subsection.
> >
> > Yes, that's not practical, so we need at first establish the code base for
> > 2GB hugetlb splitting and then extending it to 1GB next.
> 
> I found it is not easy to split. There is a unique hugetlb page size
> that is associated with a mounted hugetlbfs filesystem, file remap to
> 2MB/4KB will break this. How about hard offline 1GB hugetlb page as
> what has already done in soft offline, replace the corrupted 1GB page
> by new 1GB page through page migration, the offending/corrupted area
> in the original 1GB page doesn't need to be copied into the new page,
> the offending/corrupted area in new page can keep full zero just as it
> is clear during hugetlb page fault, other sub-pages of the original
> 1GB page can be freed to buddy system. The sigbus signal is sent to
> userspace w/ offending/corrupted virtual address, and signal code,
> userspace should take care this.

Splitting hugetlb is simply hard, IMHO. THP splitting is done by years
of effort by many great kernel develpers, and I don't think doing similar
development on hugetlb is a good idea.  I thought of converting hugetlb
into thp, but maybe it's not an easy task either.
"Hard offlining via soft offlining" approach sounds new and promising to me.
I guess we don't need a large patchset to do this. So, thanks for the idea!

- Naoya

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage
  2019-08-21  5:39                               ` ##freemail## " Naoya Horiguchi
@ 2019-08-21  7:15                                 ` Wanpeng Li
  0 siblings, 0 replies; 19+ messages in thread
From: Wanpeng Li @ 2019-08-21  7:15 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Mike Kravetz, Michael Ellerman, Andrew Morton, Punit Agrawal,
	linux-mm, Michal Hocko, Aneesh Kumar K.V, Anshuman Khandual,
	linux-kernel, Benjamin Herrenschmidt, linuxppc-dev, kvm,
	Paolo Bonzini, Xiao Guangrong, lidongchen, yongkaiwu, Mel Gorman,
	Kirill A. Shutemov, Hansen, Dave, Hugh Dickins

On Wed, 21 Aug 2019 at 13:41, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
> On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> > Cc Mel Gorman, Kirill, Dave Hansen,
> > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> > >
> > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > > Cc Paolo,
> > > > > Hi all,
> > > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > > >>
> > > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > > >>> Andrew Morton <akpm@linux-foundation.org> writes:
> > > > >>>
> > > > >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@arm.com> wrote:
> > > > >>>>
> > > > >>>>>>
> > > > >>>>>> So I don't think that the above test result means that errors are properly
> > > > >>>>>> handled, and the proposed patch should help for arm64.
> > > > >>>>>
> > > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > > >>>>> consistent with expectations by core code.
> > > > >>>>>
> > > > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > > > >>>>> it would be helpful if there was a clear expression of semantics for
> > > > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > > > >>>>> reference?
> > > > >>>>
> > > > >>>> Is that an ack or tested-by?
> > > > >>>>
> > > > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > > > >>>> but they remain steadfastly in hiding.
> > > > >>>
> > > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > > >>>
> > > > >>
> > > > >> Thanks Michael,
> > > > >>
> > > > >> I was mostly concerned about use cases for soft/hard offline of huge pages
> > > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > > > >> huge pages, and soft/hard offline support was specifically added for this.
> > > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> > > > >> at PGD level"
> > > > >>
> > > > >> This patch will disable that functionality.  So, at a minimum this is a
> > > > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > > > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > > > >> support, I can not tell if there is a real use case or this is just a
> > > > >> 'nice to have'.
> > > > >
> > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > > encounter gup_pud_range() panic several times in product environment.
> > > > > Is there any plan to reenable and fix arch codes?
> > > >
> > > > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > > > Intel MMU capacity increases to handle more TLB entries there will be more
> > > > and more interest.
> > > >
> > > > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > > > he know most about this code.
> > >
> > > Thanks for forwarding this to me, I'm feeling that memory error handling
> > > on 1GB hugepage is demanded as real use case.
> > >
> > > >
> > > > > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > > > will be delivered to VM at page fault next time for the offensive
> > > > > SPTE. Is this proposal acceptable?
> > > >
> > > > I am not sure of the error handling design, but this does sound reasonable.
> > >
> > > I agree that that's better.
> > >
> > > > That block of code which potentially dissolves a huge page on memory error
> > > > is hard to understand and I'm not sure if that is even the 'normal'
> > > > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > > > for an error on a small subsection.
> > >
> > > Yes, that's not practical, so we need at first establish the code base for
> > > 2GB hugetlb splitting and then extending it to 1GB next.
> >
> > I found it is not easy to split. There is a unique hugetlb page size
> > that is associated with a mounted hugetlbfs filesystem, file remap to
> > 2MB/4KB will break this. How about hard offline 1GB hugetlb page as
> > what has already done in soft offline, replace the corrupted 1GB page
> > by new 1GB page through page migration, the offending/corrupted area
> > in the original 1GB page doesn't need to be copied into the new page,
> > the offending/corrupted area in new page can keep full zero just as it
> > is clear during hugetlb page fault, other sub-pages of the original
> > 1GB page can be freed to buddy system. The sigbus signal is sent to
> > userspace w/ offending/corrupted virtual address, and signal code,
> > userspace should take care this.
>
> Splitting hugetlb is simply hard, IMHO. THP splitting is done by years
> of effort by many great kernel develpers, and I don't think doing similar
> development on hugetlb is a good idea.  I thought of converting hugetlb
> into thp, but maybe it's not an easy task either.
> "Hard offlining via soft offlining" approach sounds new and promising to me.
> I guess we don't need a large patchset to do this. So, thanks for the idea!

Good, I will wait a while, and start to cook the patches if there is
no opposite of voice.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-08-21  7:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com>
2018-01-29  6:30 ` [PATCH v1] mm: hwpoison: disable memory error handling on 1GB hugepage Naoya Horiguchi
2018-01-29  9:54   ` Michal Hocko
2018-01-29 18:08     ` Mike Kravetz
2018-01-30  1:39       ` Naoya Horiguchi
2018-01-30  3:54         ` [PATCH v2] " Naoya Horiguchi
2018-01-30 23:56           ` Mike Kravetz
2018-02-05 15:05           ` Punit Agrawal
2018-02-07  1:14             ` Naoya Horiguchi
2018-02-08 12:30               ` Punit Agrawal
2018-02-08 20:17                 ` Andrew Morton
     [not found]                   ` <87wozhvc49.fsf@concordia.ellerman.id.au>
2018-02-13 22:33                     ` Mike Kravetz
2019-05-28  9:49                       ` Wanpeng Li
2019-05-29 23:31                         ` Mike Kravetz
2019-06-10 23:50                           ` Naoya Horiguchi
2019-06-11  8:42                             ` Wanpeng Li
2019-08-20  7:03                             ` Wanpeng Li
2019-08-21  5:39                               ` ##freemail## " Naoya Horiguchi
2019-08-21  7:15                                 ` Wanpeng Li
2018-02-09  1:17                 ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).