LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
@ 2020-02-16 19:18 Rafael Aquini
  2020-02-16 23:32 ` Mel Gorman
  2020-03-07  2:40 ` Qian Cai
  0 siblings, 2 replies; 24+ messages in thread
From: Rafael Aquini @ 2020-02-16 19:18 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, mgorman, akpm, mhocko, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
  A user reported a bug against a distribution kernel while running
  a proprietary workload described as "memory intensive that is not
  swapping" that is expected to apply to mainline kernels. The workload
  is read/write/modifying ranges of memory and checking the contents. They
  reported that within a few hours that a bad PMD would be reported followed
  by a memory corruption where expected data was all zeros.  A partial report
  of the bad PMD looked like

  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
  [ 5195.341184] ------------[ cut here ]------------
  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
  ....
  [ 5195.410033] Call Trace:
  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10

  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
  was a false detection. The bug does not trigger if automatic NUMA balancing
  or transparent huge pages is disabled.

  The bug is due a race in change_pmd_range between a pmd_trans_huge and
  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
  check, a parallel protection update under lock can have cleared the PMD
  and filled it with a prot_numa entry between the transhuge check and the
  pmd_none_or_clear_bad check.

  While this could be fixed with heavy locking, it's only necessary to
  make a copy of the PMD on the stack during change_pmd_range and avoid
  races. A new helper is created for this as the check if quite subtle and the
  existing similar helpful is not suitable. This passed 154 hours of testing
  (usually triggers between 20 minutes and 24 hours) without detecting bad
  PMDs or corruption. A basic test of an autonuma-intensive workload showed
  no significant change in behaviour.

Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
the race window aforementioned is still open, and we have reports of Linpack test reporting bad
residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.

 host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
 ....
 host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

The issue is observed on a v4.18-based distribution kernel, but the race window is
expected to be applicable to mainline kernels, as well.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7a8e84f86831..9ea8cc0ab2fd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return pages;
 }
 
+/*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+
+	if (pmd_none(pmdval))
+		return 1;
+	if (pmd_trans_huge(pmdval))
+		return 0;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+
+	return 0;
+}
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
-				&& pmd_none_or_clear_bad(pmd))
+
+		/*
+		 * Automatic NUMA balancing walks the tables with mmap_sem
+		 * held for read. It's possible a parallel update to occur
+		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
+		 * check leading to a false positive and clearing.
+		 * Hence, it's ecessary to atomically read the PMD value
+		 * for all the checks.
+		 */
+		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-02-16 19:18 [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Rafael Aquini
@ 2020-02-16 23:32 ` Mel Gorman
  2020-03-07  2:40 ` Qian Cai
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2020-02-16 23:32 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: linux-mm, linux-kernel, akpm, mhocko, vbabka, kirill.shutemov

On Sun, Feb 16, 2020 at 02:18:00PM -0500, Rafael Aquini wrote:
> From: Mel Gorman <mgorman@techsingularity.net>
>   A user reported a bug against a distribution kernel while running
>   a proprietary workload described as "memory intensive that is not
>   swapping" that is expected to apply to mainline kernels. The workload
>   is read/write/modifying ranges of memory and checking the contents. They
>   reported that within a few hours that a bad PMD would be reported followed
>   by a memory corruption where expected data was all zeros.  A partial report
>   of the bad PMD looked like
> 
>   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
>   [ 5195.341184] ------------[ cut here ]------------
>   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
>   ....
>   [ 5195.410033] Call Trace:
>   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
>   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
>   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
>   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
>   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
>   Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
>   was a false detection. The bug does not trigger if automatic NUMA balancing
>   or transparent huge pages is disabled.
> 
>   The bug is due a race in change_pmd_range between a pmd_trans_huge and
>   pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
>   check, a parallel protection update under lock can have cleared the PMD
>   and filled it with a prot_numa entry between the transhuge check and the
>   pmd_none_or_clear_bad check.
> 
>   While this could be fixed with heavy locking, it's only necessary to
>   make a copy of the PMD on the stack during change_pmd_range and avoid
>   races. A new helper is created for this as the check if quite subtle and the
>   existing similar helpful is not suitable. This passed 154 hours of testing
>   (usually triggers between 20 minutes and 24 hours) without detecting bad
>   PMDs or corruption. A basic test of an autonuma-intensive workload showed
>   no significant change in behaviour.
> 
> Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> 
>  host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
>  ....
>  host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
>  host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> 
> The issue is observed on a v4.18-based distribution kernel, but the race window is
> expected to be applicable to mainline kernels, as well.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org
> Signed-off-by: Rafael Aquini <aquini@redhat.com>

It's curious that it took so long for this to be caught again.
Unfortunately I cannot find exactly what it's racing against but maybe
it's not worth chasing down and the patch is simply the safer option :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-02-16 19:18 [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Rafael Aquini
  2020-02-16 23:32 ` Mel Gorman
@ 2020-03-07  2:40 ` Qian Cai
  2020-03-07  3:05   ` Rafael Aquini
  1 sibling, 1 reply; 24+ messages in thread
From: Qian Cai @ 2020-03-07  2:40 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Linux Memory Management List, linux-kernel, mgorman, akpm,
	mhocko, vbabka



> On Feb 16, 2020, at 2:18 PM, Rafael Aquini <aquini@redhat.com> wrote:
> 
> From: Mel Gorman <mgorman@techsingularity.net>
>  A user reported a bug against a distribution kernel while running
>  a proprietary workload described as "memory intensive that is not
>  swapping" that is expected to apply to mainline kernels. The workload
>  is read/write/modifying ranges of memory and checking the contents. They
>  reported that within a few hours that a bad PMD would be reported followed
>  by a memory corruption where expected data was all zeros.  A partial report
>  of the bad PMD looked like
> 
>  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
>  [ 5195.341184] ------------[ cut here ]------------
>  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
>  ....
>  [ 5195.410033] Call Trace:
>  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
>  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
>  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
>  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
>  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
>  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
>  was a false detection. The bug does not trigger if automatic NUMA balancing
>  or transparent huge pages is disabled.
> 
>  The bug is due a race in change_pmd_range between a pmd_trans_huge and
>  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
>  check, a parallel protection update under lock can have cleared the PMD
>  and filled it with a prot_numa entry between the transhuge check and the
>  pmd_none_or_clear_bad check.
> 
>  While this could be fixed with heavy locking, it's only necessary to
>  make a copy of the PMD on the stack during change_pmd_range and avoid
>  races. A new helper is created for this as the check if quite subtle and the
>  existing similar helpful is not suitable. This passed 154 hours of testing
>  (usually triggers between 20 minutes and 24 hours) without detecting bad
>  PMDs or corruption. A basic test of an autonuma-intensive workload showed
>  no significant change in behaviour.
> 
> Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> 
> host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
> ....
> host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
> host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> 
> The issue is observed on a v4.18-based distribution kernel, but the race window is
> expected to be applicable to mainline kernels, as well.

After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
with swapping on linux-next 20200306,

The trace is a bit off for some reasons, but it is basically,

kernel BUG at include/linux/page-flags.h:317!

end_swap_bio_write()
   SetPageError(page)
       VM_BUG_ON_PAGE(1 && PageCompound(page)) 

The page was allocated in,

do_huge_pmd_anonymous_page()

[ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
[ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
[ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
[ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
[ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
[ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
[ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
[ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
[ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
[ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
[ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
[ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
[ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
[ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
[ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
[ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
[ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
[ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
[ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
[ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
[ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
[ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
[ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
[ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
[ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
[ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
[ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
[ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
[ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
[ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
[ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
[ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
[ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
[ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
[ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
[ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
[ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
[ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
[ 9692.845443][   C25] page_owner tracks the page as allocated 
[ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
[ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
[ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
[ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
[ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
[ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
[ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
[ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
[ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
[ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
[ 9692.845786][  T213] page_owner tracks the page as allocated 
[ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
[ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
[ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
[ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
[ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
[ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
[ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
[ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
[ 9692.847250][   C25]  bio_endio+0x297/0x560 
[ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
[ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
[ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
[ 9692.847276][   C25]  blk_update_request+0x201/0x920 
[ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
[ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
[ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
[ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
[ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
[ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
[ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
[ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
[ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
[ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
[ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
[ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
[ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
[ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
[ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
[ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
[ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
[ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
[ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
[ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
[ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
[ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
[ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
[ 9692.849884][  T213]  libphy 
[ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
[ 9692.849952][  T213]  firmware_class 
[ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
[ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
[ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
[ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
[ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
[ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
[ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
[ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
[ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
[ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
[ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  

> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 7a8e84f86831..9ea8cc0ab2fd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> 	return pages;
> }
> 
> +/*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> 		pud_t *pud, unsigned long addr, unsigned long end,
> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> 		unsigned long this_pages;
> 
> 		next = pmd_addr_end(addr, end);
> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update to occur
> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> +		 * check leading to a false positive and clearing.
> +		 * Hence, it's ecessary to atomically read the PMD value
> +		 * for all the checks.
> +		 */
> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> 			goto next;
> 
> 		/* invoke the mmu notifier if the pmd is populated */
> -- 
> 2.24.1
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-07  2:40 ` Qian Cai
@ 2020-03-07  3:05   ` Rafael Aquini
  2020-03-08  3:20     ` Qian Cai
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Aquini @ 2020-03-07  3:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: Linux Memory Management List, linux-kernel, mgorman, akpm,
	mhocko, vbabka

On Fri, Mar 06, 2020 at 09:40:10PM -0500, Qian Cai wrote:
> 
> 
> > On Feb 16, 2020, at 2:18 PM, Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > From: Mel Gorman <mgorman@techsingularity.net>
> >  A user reported a bug against a distribution kernel while running
> >  a proprietary workload described as "memory intensive that is not
> >  swapping" that is expected to apply to mainline kernels. The workload
> >  is read/write/modifying ranges of memory and checking the contents. They
> >  reported that within a few hours that a bad PMD would be reported followed
> >  by a memory corruption where expected data was all zeros.  A partial report
> >  of the bad PMD looked like
> > 
> >  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> >  [ 5195.341184] ------------[ cut here ]------------
> >  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> >  ....
> >  [ 5195.410033] Call Trace:
> >  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> >  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> >  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> >  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> >  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> >  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> >  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> > 
> >  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> >  was a false detection. The bug does not trigger if automatic NUMA balancing
> >  or transparent huge pages is disabled.
> > 
> >  The bug is due a race in change_pmd_range between a pmd_trans_huge and
> >  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> >  check, a parallel protection update under lock can have cleared the PMD
> >  and filled it with a prot_numa entry between the transhuge check and the
> >  pmd_none_or_clear_bad check.
> > 
> >  While this could be fixed with heavy locking, it's only necessary to
> >  make a copy of the PMD on the stack during change_pmd_range and avoid
> >  races. A new helper is created for this as the check if quite subtle and the
> >  existing similar helpful is not suitable. This passed 154 hours of testing
> >  (usually triggers between 20 minutes and 24 hours) without detecting bad
> >  PMDs or corruption. A basic test of an autonuma-intensive workload showed
> >  no significant change in behaviour.
> > 
> > Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> > the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> > residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> > non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> > 
> > host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
> > ....
> > host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
> > host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> > 
> > The issue is observed on a v4.18-based distribution kernel, but the race window is
> > expected to be applicable to mainline kernels, as well.
> 
> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> with swapping on linux-next 20200306,
>

Would you mind sharing your .config ? Also it would be nice to know your
system and workload details so I can try to reproduce.

Cheers,
-- Rafael

> The trace is a bit off for some reasons, but it is basically,
> 
> kernel BUG at include/linux/page-flags.h:317!
> 
> end_swap_bio_write()
>    SetPageError(page)
>        VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> 
> The page was allocated in,
> 
> do_huge_pmd_anonymous_page()
> 
> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> [ 9692.849884][  T213]  libphy 
> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> [ 9692.849952][  T213]  firmware_class 
> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> 
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> > mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 7a8e84f86831..9ea8cc0ab2fd 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > 	return pages;
> > }
> > 
> > +/*
> > + * Used when setting automatic NUMA hinting protection where it is
> > + * critical that a numa hinting PMD is not confused with a bad PMD.
> > + */
> > +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> > +{
> > +	pmd_t pmdval = pmd_read_atomic(pmd);
> > +
> > +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +	barrier();
> > +#endif
> > +
> > +	if (pmd_none(pmdval))
> > +		return 1;
> > +	if (pmd_trans_huge(pmdval))
> > +		return 0;
> > +	if (unlikely(pmd_bad(pmdval))) {
> > +		pmd_clear_bad(pmd);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > 		pud_t *pud, unsigned long addr, unsigned long end,
> > 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > 		unsigned long this_pages;
> > 
> > 		next = pmd_addr_end(addr, end);
> > -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> > -				&& pmd_none_or_clear_bad(pmd))
> > +
> > +		/*
> > +		 * Automatic NUMA balancing walks the tables with mmap_sem
> > +		 * held for read. It's possible a parallel update to occur
> > +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> > +		 * check leading to a false positive and clearing.
> > +		 * Hence, it's ecessary to atomically read the PMD value
> > +		 * for all the checks.
> > +		 */
> > +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> > +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> > 			goto next;
> > 
> > 		/* invoke the mmu notifier if the pmd is populated */
> > -- 
> > 2.24.1
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-07  3:05   ` Rafael Aquini
@ 2020-03-08  3:20     ` Qian Cai
  2020-03-08 23:14       ` Rafael Aquini
  0 siblings, 1 reply; 24+ messages in thread
From: Qian Cai @ 2020-03-08  3:20 UTC (permalink / raw)
  To: Rafael Aquini, linus Torvalds
  Cc: Linux Memory Management List, LKML, mgorman, Andrew Morton,
	Michal Hocko, Vlastimil Babka

Not sure if it is really worth hurrying this into -rc5 now given...

> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
>> 
>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
>> with swapping on linux-next 20200306,
>> 
> 
> Would you mind sharing your .config ? Also it would be nice to know your
> system and workload details so I can try to reproduce.

LTP oom02 in a tight loop,

# i=0; while :; echo $((i++)); ./oom02; sleep 10; done

HPE ProLiant DL385 Gen10
AMD EPYC 7601 32-Core Processor
65536 MB memory, 400 GB disk space

Processors	128
Cores	64
Sockets	2

linux-next 20200306

> 
> Cheers,
> -- Rafael
> 
>> The trace is a bit off for some reasons, but it is basically,
>> 
>> kernel BUG at include/linux/page-flags.h:317!
>> 
>> end_swap_bio_write()
>>   SetPageError(page)
>>       VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> 
>> The page was allocated in,
>> 
>> do_huge_pmd_anonymous_page()
>> 
>> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
>> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
>> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
>> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
>> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
>> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
>> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
>> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
>> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
>> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
>> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
>> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
>> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
>> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
>> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
>> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
>> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
>> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
>> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
>> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
>> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
>> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
>> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
>> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
>> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
>> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
>> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
>> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
>> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
>> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
>> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
>> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
>> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
>> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
>> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
>> [ 9692.845443][   C25] page_owner tracks the page as allocated 
>> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
>> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
>> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
>> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
>> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
>> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
>> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
>> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
>> [ 9692.845786][  T213] page_owner tracks the page as allocated 
>> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
>> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
>> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
>> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
>> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
>> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
>> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
>> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
>> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
>> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
>> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
>> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
>> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
>> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
>> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
>> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
>> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
>> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
>> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
>> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
>> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
>> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
>> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
>> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
>> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
>> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
>> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
>> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
>> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
>> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
>> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
>> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
>> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
>> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
>> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
>> [ 9692.849884][  T213]  libphy 
>> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
>> [ 9692.849952][  T213]  firmware_class 
>> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
>> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
>> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
>> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
>> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
>> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
>> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
>> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
>> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
>> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
>> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
>> 
>>> 
>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
>>> ---
>>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 36 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 7a8e84f86831..9ea8cc0ab2fd 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>> 	return pages;
>>> }
>>> 
>>> +/*
>>> + * Used when setting automatic NUMA hinting protection where it is
>>> + * critical that a numa hinting PMD is not confused with a bad PMD.
>>> + */
>>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
>>> +{
>>> +	pmd_t pmdval = pmd_read_atomic(pmd);
>>> +
>>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	barrier();
>>> +#endif
>>> +
>>> +	if (pmd_none(pmdval))
>>> +		return 1;
>>> +	if (pmd_trans_huge(pmdval))
>>> +		return 0;
>>> +	if (unlikely(pmd_bad(pmdval))) {
>>> +		pmd_clear_bad(pmd);
>>> +		return 1;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>> 		pud_t *pud, unsigned long addr, unsigned long end,
>>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
>>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>> 		unsigned long this_pages;
>>> 
>>> 		next = pmd_addr_end(addr, end);
>>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>>> -				&& pmd_none_or_clear_bad(pmd))
>>> +
>>> +		/*
>>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
>>> +		 * held for read. It's possible a parallel update to occur
>>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
>>> +		 * check leading to a false positive and clearing.
>>> +		 * Hence, it's ecessary to atomically read the PMD value
>>> +		 * for all the checks.
>>> +		 */
>>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
>>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
>>> 			goto next;
>>> 
>>> 		/* invoke the mmu notifier if the pmd is populated */
>>> -- 
>>> 2.24.1
>>> 
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-08  3:20     ` Qian Cai
@ 2020-03-08 23:14       ` Rafael Aquini
  2020-03-09  3:27         ` Qian Cai
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Aquini @ 2020-03-08 23:14 UTC (permalink / raw)
  To: Qian Cai
  Cc: linus Torvalds, Linux Memory Management List, LKML, mgorman,
	Andrew Morton, Michal Hocko, Vlastimil Babka

On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
> Not sure if it is really worth hurrying this into -rc5 now given...
> 
> > On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
> >> 
> >> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> >> with swapping on linux-next 20200306,
> >> 
> > 
> > Would you mind sharing your .config ? Also it would be nice to know your
> > system and workload details so I can try to reproduce.
> 
> LTP oom02 in a tight loop,
> 
> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
> 
> HPE ProLiant DL385 Gen10
> AMD EPYC 7601 32-Core Processor
> 65536 MB memory, 400 GB disk space
> 
> Processors	128
> Cores	64
> Sockets	2
> 
> linux-next 20200306
>

Thx! what about your .config?

 

Cheers,
-- Rafael

> >> The trace is a bit off for some reasons, but it is basically,
> >> 
> >> kernel BUG at include/linux/page-flags.h:317!
> >> 
> >> end_swap_bio_write()
> >>   SetPageError(page)
> >>       VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> 
> >> The page was allocated in,
> >> 
> >> do_huge_pmd_anonymous_page()
> >> 
> >> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> >> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> >> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> >> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> >> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> >> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> >> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> >> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> >> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> >> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> >> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> >> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> >> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> >> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> >> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> >> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> >> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> >> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> >> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> >> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> >> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> >> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> >> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> >> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> >> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> >> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> >> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> >> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> >> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> >> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> >> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> >> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> >> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> >> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> >> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> >> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> >> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> >> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> >> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> >> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> >> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> >> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> >> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> >> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> >> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> >> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> >> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> >> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> >> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> >> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> >> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> >> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> >> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> >> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> >> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> >> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> >> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> >> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> >> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> >> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> >> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> >> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> >> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> >> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> >> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> >> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> >> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> >> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> >> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> >> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> >> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> >> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> >> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> >> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> >> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> >> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> >> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> >> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> >> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> >> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> >> [ 9692.849884][  T213]  libphy 
> >> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> >> [ 9692.849952][  T213]  firmware_class 
> >> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> >> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> >> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> >> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> >> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> >> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> >> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> >> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> >> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> >> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> >> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> >> 
> >>> 
> >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> >>> ---
> >>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> >>> 1 file changed, 36 insertions(+), 2 deletions(-)
> >>> 
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 7a8e84f86831..9ea8cc0ab2fd 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>> 	return pages;
> >>> }
> >>> 
> >>> +/*
> >>> + * Used when setting automatic NUMA hinting protection where it is
> >>> + * critical that a numa hinting PMD is not confused with a bad PMD.
> >>> + */
> >>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> >>> +{
> >>> +	pmd_t pmdval = pmd_read_atomic(pmd);
> >>> +
> >>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +	barrier();
> >>> +#endif
> >>> +
> >>> +	if (pmd_none(pmdval))
> >>> +		return 1;
> >>> +	if (pmd_trans_huge(pmdval))
> >>> +		return 0;
> >>> +	if (unlikely(pmd_bad(pmdval))) {
> >>> +		pmd_clear_bad(pmd);
> >>> +		return 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>> 		pud_t *pud, unsigned long addr, unsigned long end,
> >>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> >>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>> 		unsigned long this_pages;
> >>> 
> >>> 		next = pmd_addr_end(addr, end);
> >>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> >>> -				&& pmd_none_or_clear_bad(pmd))
> >>> +
> >>> +		/*
> >>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> >>> +		 * held for read. It's possible a parallel update to occur
> >>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> >>> +		 * check leading to a false positive and clearing.
> >>> +		 * Hence, it's ecessary to atomically read the PMD value
> >>> +		 * for all the checks.
> >>> +		 */
> >>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> >>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> >>> 			goto next;
> >>> 
> >>> 		/* invoke the mmu notifier if the pmd is populated */
> >>> -- 
> >>> 2.24.1
> >>> 
> >>> 
> >> 
> > 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-08 23:14       ` Rafael Aquini
@ 2020-03-09  3:27         ` Qian Cai
  2020-03-09 15:05           ` Rafael Aquini
  0 siblings, 1 reply; 24+ messages in thread
From: Qian Cai @ 2020-03-09  3:27 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Linus Torvalds, Linux Memory Management List, LKML, mgorman,
	Andrew Morton, Michal Hocko, Vlastimil Babka



> On Mar 8, 2020, at 7:14 PM, Rafael Aquini <aquini@redhat.com> wrote:
> 
> On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
>> Not sure if it is really worth hurrying this into -rc5 now given...
>> 
>>> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
>>>> 
>>>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
>>>> with swapping on linux-next 20200306,
>>>> 
>>> 
>>> Would you mind sharing your .config ? Also it would be nice to know your
>>> system and workload details so I can try to reproduce.
>> 
>> LTP oom02 in a tight loop,
>> 
>> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
>> 
>> HPE ProLiant DL385 Gen10
>> AMD EPYC 7601 32-Core Processor
>> 65536 MB memory, 400 GB disk space
>> 
>> Processors	128
>> Cores	64
>> Sockets	2
>> 
>> linux-next 20200306
>> 
> 
> Thx! what about your .config?

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

My theory is this race,

From 73d4e8dad6cb248970bade5049cc072cd9090d0f Mon Sep 17 00:00:00 2001
From: Qian Cai <cai@lca.pw>
Date: Sun, 8 Mar 2020 17:37:51 -0400
Subject: [PATCH] mm/mprotect: fix a race in change_pmd_range()

The recent commit increased the race window between,

CPU0:
task_numa_work()
change_prot_numa()
change_protection()
change_protection_range()
change_pmd_range()
pmd_none_or_clear_bad_unless_trans_huge(pmd) <-- PMD is NULL.
		goto next;
				CPU1:
				do_huge_pmd_anonymous_page()
				__do_huge_pmd_anonymous_page()
				PMD is transhuge.

where it will not calling __split_huge_pmd() properly and trigger a
crash later on,

kernel BUG at include/linux/page-flags.h:317!

end_swap_bio_write()
  SetPageError(page)
      VM_BUG_ON_PAGE(1 && PageCompound(page))

Fix it by using a similar check in pmd_trans_unstable().

Fixes: 8b272b3cbbb5 ("mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/mprotect.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 311c0dadf71c..00dd2c66a434 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,7 +165,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
  * Used when setting automatic NUMA hinting protection where it is
  * critical that a numa hinting PMD is not confused with a bad PMD.
  */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+static inline int pmd_clear_bad_unless_none_or_trans_huge(pmd_t *pmd)
 {
 	pmd_t pmdval = pmd_read_atomic(pmd);
 
@@ -174,9 +174,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	barrier();
 #endif
 
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
 		return 0;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
@@ -213,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * for all the checks.
 		 */
 		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
+		     pmd_clear_bad_unless_none_or_trans_huge(pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
-- 
2.21.0 (Apple Git-122.2)


> 
> 
> 
> Cheers,
> -- Rafael
> 
>>>> The trace is a bit off for some reasons, but it is basically,
>>>> 
>>>> kernel BUG at include/linux/page-flags.h:317!
>>>> 
>>>> end_swap_bio_write()
>>>>  SetPageError(page)
>>>>      VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>>>> 
>>>> The page was allocated in,
>>>> 
>>>> do_huge_pmd_anonymous_page()
>>>> 
>>>> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
>>>> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
>>>> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
>>>> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
>>>> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
>>>> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
>>>> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
>>>> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
>>>> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
>>>> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
>>>> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
>>>> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
>>>> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
>>>> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
>>>> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
>>>> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
>>>> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
>>>> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
>>>> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
>>>> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
>>>> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
>>>> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
>>>> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>>>> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
>>>> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>>>> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
>>>> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
>>>> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
>>>> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
>>>> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
>>>> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
>>>> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
>>>> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>>>> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
>>>> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
>>>> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
>>>> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
>>>> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>>>> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>>>> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
>>>> [ 9692.845443][   C25] page_owner tracks the page as allocated 
>>>> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
>>>> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>>>> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
>>>> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
>>>> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
>>>> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
>>>> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>>>> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
>>>> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
>>>> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
>>>> [ 9692.845786][  T213] page_owner tracks the page as allocated 
>>>> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
>>>> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>>>> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
>>>> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
>>>> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
>>>> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
>>>> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
>>>> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
>>>> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
>>>> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
>>>> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
>>>> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
>>>> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
>>>> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
>>>> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
>>>> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
>>>> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
>>>> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
>>>> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
>>>> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
>>>> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
>>>> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
>>>> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
>>>> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
>>>> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
>>>> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
>>>> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
>>>> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
>>>> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
>>>> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
>>>> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
>>>> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
>>>> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
>>>> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
>>>> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
>>>> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
>>>> [ 9692.849884][  T213]  libphy 
>>>> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
>>>> [ 9692.849952][  T213]  firmware_class 
>>>> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
>>>> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
>>>> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
>>>> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
>>>> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
>>>> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
>>>> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
>>>> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
>>>> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
>>>> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
>>>> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
>>>> 
>>>>> 
>>>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>>>> Cc: stable@vger.kernel.org
>>>>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
>>>>> ---
>>>>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
>>>>> 1 file changed, 36 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>>>> index 7a8e84f86831..9ea8cc0ab2fd 100644
>>>>> --- a/mm/mprotect.c
>>>>> +++ b/mm/mprotect.c
>>>>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>>>> 	return pages;
>>>>> }
>>>>> 
>>>>> +/*
>>>>> + * Used when setting automatic NUMA hinting protection where it is
>>>>> + * critical that a numa hinting PMD is not confused with a bad PMD.
>>>>> + */
>>>>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
>>>>> +{
>>>>> +	pmd_t pmdval = pmd_read_atomic(pmd);
>>>>> +
>>>>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +	barrier();
>>>>> +#endif
>>>>> +
>>>>> +	if (pmd_none(pmdval))
>>>>> +		return 1;
>>>>> +	if (pmd_trans_huge(pmdval))
>>>>> +		return 0;
>>>>> +	if (unlikely(pmd_bad(pmdval))) {
>>>>> +		pmd_clear_bad(pmd);
>>>>> +		return 1;
>>>>> +	}
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>>>> 		pud_t *pud, unsigned long addr, unsigned long end,
>>>>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
>>>>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>>>> 		unsigned long this_pages;
>>>>> 
>>>>> 		next = pmd_addr_end(addr, end);
>>>>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>>>>> -				&& pmd_none_or_clear_bad(pmd))
>>>>> +
>>>>> +		/*
>>>>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
>>>>> +		 * held for read. It's possible a parallel update to occur
>>>>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
>>>>> +		 * check leading to a false positive and clearing.
>>>>> +		 * Hence, it's ecessary to atomically read the PMD value
>>>>> +		 * for all the checks.
>>>>> +		 */
>>>>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
>>>>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
>>>>> 			goto next;
>>>>> 
>>>>> 		/* invoke the mmu notifier if the pmd is populated */
>>>>> -- 
>>>>> 2.24.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-09  3:27         ` Qian Cai
@ 2020-03-09 15:05           ` Rafael Aquini
  2020-03-11  0:04             ` Qian Cai
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Aquini @ 2020-03-09 15:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: Linus Torvalds, Linux Memory Management List, LKML, mgorman,
	Andrew Morton, Michal Hocko, Vlastimil Babka

On Sun, Mar 08, 2020 at 11:27:22PM -0400, Qian Cai wrote:
> 
> 
> > On Mar 8, 2020, at 7:14 PM, Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
> >> Not sure if it is really worth hurrying this into -rc5 now given...
> >> 
> >>> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
> >>>> 
> >>>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> >>>> with swapping on linux-next 20200306,
> >>>> 
> >>> 
> >>> Would you mind sharing your .config ? Also it would be nice to know your
> >>> system and workload details so I can try to reproduce.
> >> 
> >> LTP oom02 in a tight loop,
> >> 
> >> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
> >> 
> >> HPE ProLiant DL385 Gen10
> >> AMD EPYC 7601 32-Core Processor
> >> 65536 MB memory, 400 GB disk space
> >> 
> >> Processors	128
> >> Cores	64
> >> Sockets	2
> >> 
> >> linux-next 20200306
> >> 
> > 
> > Thx! what about your .config?
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 
> My theory is this race,
> 
> From 73d4e8dad6cb248970bade5049cc072cd9090d0f Mon Sep 17 00:00:00 2001
> From: Qian Cai <cai@lca.pw>
> Date: Sun, 8 Mar 2020 17:37:51 -0400
> Subject: [PATCH] mm/mprotect: fix a race in change_pmd_range()
> 
> The recent commit increased the race window between,
> 
> CPU0:
> task_numa_work()
> change_prot_numa()
> change_protection()
> change_protection_range()
> change_pmd_range()
> pmd_none_or_clear_bad_unless_trans_huge(pmd) <-- PMD is NULL.
> 		goto next;
> 				CPU1:
> 				do_huge_pmd_anonymous_page()
> 				__do_huge_pmd_anonymous_page()
> 				PMD is transhuge.
> 
> where it will not calling __split_huge_pmd() properly and trigger a
> crash later on,
> 
> kernel BUG at include/linux/page-flags.h:317!
> 
> end_swap_bio_write()
>   SetPageError(page)
>       VM_BUG_ON_PAGE(1 && PageCompound(page))
> 
> Fix it by using a similar check in pmd_trans_unstable().
> 
> Fixes: 8b272b3cbbb5 ("mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa")
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  mm/mprotect.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 311c0dadf71c..00dd2c66a434 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -165,7 +165,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>   * Used when setting automatic NUMA hinting protection where it is
>   * critical that a numa hinting PMD is not confused with a bad PMD.
>   */
> -static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +static inline int pmd_clear_bad_unless_none_or_trans_huge(pmd_t *pmd)
>  {
>  	pmd_t pmdval = pmd_read_atomic(pmd);
>  
> @@ -174,9 +174,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
>  	barrier();
>  #endif
>  
> -	if (pmd_none(pmdval))
> -		return 1;
> -	if (pmd_trans_huge(pmdval))
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))


I'm still waiting on a similar system to become available, so I can
work on your reproducer case, as well as to dig and wrap my head around it.

I still don't think that skipping the pmd_none() in the change-protection walk 
should cause a big fuss like you observed here (yet, it seems it does), and 
the fact that we need that race window to take the __split_huge_pmd() suggests,
at least to me, that we might be missing this proper split somewhere else.

>  	if (unlikely(pmd_bad(pmdval))) {
>  		pmd_clear_bad(pmd);
> @@ -213,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		 * for all the checks.
>  		 */
>  		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> -		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> +		     pmd_clear_bad_unless_none_or_trans_huge(pmd))
>  			goto next;
>  
>  		/* invoke the mmu notifier if the pmd is populated */
> -- 
> 2.21.0 (Apple Git-122.2)
> 
> 
> > 
> > 
> > 
> > Cheers,
> > -- Rafael
> > 
> >>>> The trace is a bit off for some reasons, but it is basically,
> >>>> 
> >>>> kernel BUG at include/linux/page-flags.h:317!
> >>>> 
> >>>> end_swap_bio_write()
> >>>>  SetPageError(page)
> >>>>      VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> 
> >>>> The page was allocated in,
> >>>> 
> >>>> do_huge_pmd_anonymous_page()
> >>>> 
> >>>> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> >>>> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> >>>> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> >>>> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> >>>> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> >>>> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> >>>> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> >>>> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> >>>> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> >>>> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> >>>> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> >>>> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> >>>> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> >>>> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> >>>> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> >>>> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> >>>> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> >>>> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> >>>> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> >>>> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> >>>> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> >>>> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> >>>> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> >>>> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> >>>> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> >>>> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> >>>> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> >>>> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> >>>> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> >>>> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> >>>> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >>>> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> >>>> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> >>>> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> >>>> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> >>>> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >>>> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> >>>> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> >>>> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> >>>> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >>>> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> >>>> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> >>>> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> >>>> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> >>>> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> >>>> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> >>>> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> >>>> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> >>>> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> >>>> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >>>> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> >>>> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> >>>> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> >>>> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> >>>> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> >>>> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> >>>> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> >>>> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> >>>> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> >>>> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> >>>> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> >>>> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> >>>> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> >>>> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> >>>> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> >>>> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> >>>> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> >>>> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> >>>> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> >>>> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> >>>> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> >>>> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> >>>> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> >>>> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> >>>> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> >>>> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> >>>> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> >>>> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> >>>> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> >>>> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> >>>> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> >>>> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> >>>> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> >>>> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> >>>> [ 9692.849884][  T213]  libphy 
> >>>> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> >>>> [ 9692.849952][  T213]  firmware_class 
> >>>> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> >>>> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> >>>> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> >>>> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> >>>> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> >>>> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> >>>> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> >>>> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> >>>> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> >>>> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> >>>> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> >>>> 
> >>>>> 
> >>>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >>>>> Cc: stable@vger.kernel.org
> >>>>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> >>>>> ---
> >>>>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> >>>>> 1 file changed, 36 insertions(+), 2 deletions(-)
> >>>>> 
> >>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>>>> index 7a8e84f86831..9ea8cc0ab2fd 100644
> >>>>> --- a/mm/mprotect.c
> >>>>> +++ b/mm/mprotect.c
> >>>>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>>>> 	return pages;
> >>>>> }
> >>>>> 
> >>>>> +/*
> >>>>> + * Used when setting automatic NUMA hinting protection where it is
> >>>>> + * critical that a numa hinting PMD is not confused with a bad PMD.
> >>>>> + */
> >>>>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> >>>>> +{
> >>>>> +	pmd_t pmdval = pmd_read_atomic(pmd);
> >>>>> +
> >>>>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +	barrier();
> >>>>> +#endif
> >>>>> +
> >>>>> +	if (pmd_none(pmdval))
> >>>>> +		return 1;
> >>>>> +	if (pmd_trans_huge(pmdval))
> >>>>> +		return 0;
> >>>>> +	if (unlikely(pmd_bad(pmdval))) {
> >>>>> +		pmd_clear_bad(pmd);
> >>>>> +		return 1;
> >>>>> +	}
> >>>>> +
> >>>>> +	return 0;
> >>>>> +}
> >>>>> +
> >>>>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>>>> 		pud_t *pud, unsigned long addr, unsigned long end,
> >>>>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> >>>>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>>>> 		unsigned long this_pages;
> >>>>> 
> >>>>> 		next = pmd_addr_end(addr, end);
> >>>>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> >>>>> -				&& pmd_none_or_clear_bad(pmd))
> >>>>> +
> >>>>> +		/*
> >>>>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> >>>>> +		 * held for read. It's possible a parallel update to occur
> >>>>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> >>>>> +		 * check leading to a false positive and clearing.
> >>>>> +		 * Hence, it's ecessary to atomically read the PMD value
> >>>>> +		 * for all the checks.
> >>>>> +		 */
> >>>>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> >>>>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> >>>>> 			goto next;
> >>>>> 
> >>>>> 		/* invoke the mmu notifier if the pmd is populated */
> >>>>> -- 
> >>>>> 2.24.1
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2020-03-09 15:05           ` Rafael Aquini
@ 2020-03-11  0:04             ` Qian Cai
  0 siblings, 0 replies; 24+ messages in thread
From: Qian Cai @ 2020-03-11  0:04 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Linus Torvalds, Linux Memory Management List, LKML, mgorman,
	Andrew Morton, Michal Hocko, Vlastimil Babka



> On Mar 9, 2020, at 11:05 AM, Rafael Aquini <aquini@redhat.com> wrote:
> I'm still waiting on a similar system to become available, so I can
> work on your reproducer case, as well as to dig and wrap my head around it.
> 
> I still don't think that skipping the pmd_none() in the change-protection walk 
> should cause a big fuss like you observed here (yet, it seems it does), and 
> the fact that we need that race window to take the __split_huge_pmd() suggests,
> at least to me, that we might be missing this proper split somewhere else.

I have sent out another patch which should be more correct,

https://lore.kernel.org/lkml/20200310235846.1319-1-cai@lca.pw/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-11  6:35             ` Vlastimil Babka
@ 2017-04-11 21:44               ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2017-04-11 21:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Zi Yan, Mel Gorman, Kirill A. Shutemov, Andrea Arcangeli,
	Rik van Riel, Michal Hocko, linux-mm, linux-kernel

On Tue, 11 Apr 2017 08:35:02 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> >> I have Kirrill's
> >>
> >> thp-reduce-indentation-level-in-change_huge_pmd.patch
> >> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> >> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch
> >> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> >> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> >>
> >> scheduled for 4.12-rc1.  It sounds like
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> >> 4.11 and stable?
> > 
> > thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
> > numa balancing problem reported in this thread.
> > 
> > mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
> > thp-fix-madv_dontneed-vs-madv_free-race.patch,
> > thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
> > thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> > 
> > are the fixes for other potential race problems similar to this one.
> > 
> > I think it is better to have all these patches applied.
> 
> Yeah we should get all such fixes to stable IMHO (after review :). It's
> not the first time that a fix for MADV_DONTNEED turned out to also fix a
> race that involved "normal operation" with THP, without such syscalls.

The presence of thp-reduce-indentation-level-in-change_huge_pmd.patch
is a pain in the ass but I've decided to keep it rather than churning
all the patches at a late stage.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 22:09         ` Andrew Morton
  2017-04-10 22:28           ` Zi Yan
@ 2017-04-11  8:29           ` Mel Gorman
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2017-04-11  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zi Yan, Kirill A. Shutemov, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, Vlastimil Babka, linux-mm, linux-kernel

On Mon, Apr 10, 2017 at 03:09:03PM -0700, Andrew Morton wrote:
> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> > > On 10 Apr 2017, at 12:20, Mel Gorman wrote:
> > > 
> > > > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > > >>> While this could be fixed with heavy locking, it's only necessary to
> > > >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> > > >>> races. A new helper is created for this as the check if quite subtle and the
> > > >>> existing similar helpful is not suitable. This passed 154 hours of testing
> > > >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> > > >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > > >>> no significant change in behaviour.
> > > >>>
> > > >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > >>> Cc: stable@vger.kernel.org
> > > >>
> > > >> Does this patch fix the same problem fixed by Kirill's patch here?
> > > >> https://lkml.org/lkml/2017/3/2/347
> > > >>
> > > >
> > > > I don't think so. The race I'm concerned with is due to locks not being
> > > > held and is in a different path.
> > > 
> > > I do not agree. Kirill's patch is fixing the same race problem but in
> > > zap_pmd_range().
> > > 
> > > The original autoNUMA code first clears PMD then sets it to protnone entry.
> > > pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> > > pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> > > Is this the problem you are trying solve?
> > > 
> > > Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> > > so pmd_trans_huge() will return TRUE. In this case, it also fixes
> > > your race problem in change_pmd_range().
> > > 
> > > Let me know if I miss anything.
> > > 
> > 
> > Ok, now I see. I think you're correct and I withdraw the patch.
> 
> I have Kirrill's
> 
> thp-reduce-indentation-level-in-change_huge_pmd.patch
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> thp-fix-madv_dontneed-vs-madv_free-race.patch
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> 
> scheduled for 4.12-rc1.  It sounds like
> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> 4.11 and stable?

Arguably all of them deal with different classes of race. The first two
should be tagged for any stable kernel after 3.15 because that's the
only one I know for certain occurs in the field albeit not on a mainline
kernel. There will be conflicts on older kernels due to changes in the
PMD locking API and it'll be up to the tree maintainers and patch owners
if they want to backport or not.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 22:28           ` Zi Yan
@ 2017-04-11  6:35             ` Vlastimil Babka
  2017-04-11 21:44               ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2017-04-11  6:35 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton
  Cc: Mel Gorman, Kirill A. Shutemov, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, linux-mm, linux-kernel

On 04/11/2017 12:28 AM, Zi Yan wrote:
> On 10 Apr 2017, at 17:09, Andrew Morton wrote:
> 
>> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>>> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
>>>> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
>>>>
>>>>> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>>>>>> While this could be fixed with heavy locking, it's only necessary to
>>>>>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>>>>>> races. A new helper is created for this as the check if quite subtle and the
>>>>>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>>>>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>>>>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>>>>>> no significant change in behaviour.
>>>>>>>
>>>>>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>
>>>>>> Does this patch fix the same problem fixed by Kirill's patch here?
>>>>>> https://lkml.org/lkml/2017/3/2/347
>>>>>>
>>>>>
>>>>> I don't think so. The race I'm concerned with is due to locks not being
>>>>> held and is in a different path.
>>>>
>>>> I do not agree. Kirill's patch is fixing the same race problem but in
>>>> zap_pmd_range().
>>>>
>>>> The original autoNUMA code first clears PMD then sets it to protnone entry.
>>>> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
>>>> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
>>>> Is this the problem you are trying solve?
>>>>
>>>> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
>>>> so pmd_trans_huge() will return TRUE. In this case, it also fixes
>>>> your race problem in change_pmd_range().
>>>>
>>>> Let me know if I miss anything.
>>>>
>>>
>>> Ok, now I see. I think you're correct and I withdraw the patch.
>>
>> I have Kirrill's
>>
>> thp-reduce-indentation-level-in-change_huge_pmd.patch
>> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
>> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
>> thp-fix-madv_dontneed-vs-madv_free-race.patch
>> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
>> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>>
>> scheduled for 4.12-rc1.  It sounds like
>> thp-fix-madv_dontneed-vs-madv_free-race.patch and
>> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
>> 4.11 and stable?
> 
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
> numa balancing problem reported in this thread.
> 
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
> thp-fix-madv_dontneed-vs-madv_free-race.patch,
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> 
> are the fixes for other potential race problems similar to this one.
> 
> I think it is better to have all these patches applied.

Yeah we should get all such fixes to stable IMHO (after review :). It's
not the first time that a fix for MADV_DONTNEED turned out to also fix a
race that involved "normal operation" with THP, without such syscalls.

> --
> Best Regards
> Yan Zi
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 22:09         ` Andrew Morton
@ 2017-04-10 22:28           ` Zi Yan
  2017-04-11  6:35             ` Vlastimil Babka
  2017-04-11  8:29           ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Zi Yan @ 2017-04-10 22:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Kirill A. Shutemov, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, Vlastimil Babka, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2882 bytes --]

On 10 Apr 2017, at 17:09, Andrew Morton wrote:

> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
>
>> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
>>> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
>>>
>>>> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>>>>> While this could be fixed with heavy locking, it's only necessary to
>>>>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>>>>> races. A new helper is created for this as the check if quite subtle and the
>>>>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>>>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>>>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>>>>> no significant change in behaviour.
>>>>>>
>>>>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>>>>> Cc: stable@vger.kernel.org
>>>>>
>>>>> Does this patch fix the same problem fixed by Kirill's patch here?
>>>>> https://lkml.org/lkml/2017/3/2/347
>>>>>
>>>>
>>>> I don't think so. The race I'm concerned with is due to locks not being
>>>> held and is in a different path.
>>>
>>> I do not agree. Kirill's patch is fixing the same race problem but in
>>> zap_pmd_range().
>>>
>>> The original autoNUMA code first clears PMD then sets it to protnone entry.
>>> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
>>> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
>>> Is this the problem you are trying solve?
>>>
>>> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
>>> so pmd_trans_huge() will return TRUE. In this case, it also fixes
>>> your race problem in change_pmd_range().
>>>
>>> Let me know if I miss anything.
>>>
>>
>> Ok, now I see. I think you're correct and I withdraw the patch.
>
> I have Kirrill's
>
> thp-reduce-indentation-level-in-change_huge_pmd.patch
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> thp-fix-madv_dontneed-vs-madv_free-race.patch
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>
> scheduled for 4.12-rc1.  It sounds like
> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> 4.11 and stable?

thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
numa balancing problem reported in this thread.

mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
thp-fix-madv_dontneed-vs-madv_free-race.patch,
thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch

are the fixes for other potential race problems similar to this one.

I think it is better to have all these patches applied.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 18:07       ` Mel Gorman
@ 2017-04-10 22:09         ` Andrew Morton
  2017-04-10 22:28           ` Zi Yan
  2017-04-11  8:29           ` Mel Gorman
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2017-04-10 22:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Zi Yan, Kirill A. Shutemov, Andrea Arcangeli, Rik van Riel,
	Michal Hocko, Vlastimil Babka, linux-mm, linux-kernel

On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> > On 10 Apr 2017, at 12:20, Mel Gorman wrote:
> > 
> > > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > >>> While this could be fixed with heavy locking, it's only necessary to
> > >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> > >>> races. A new helper is created for this as the check if quite subtle and the
> > >>> existing similar helpful is not suitable. This passed 154 hours of testing
> > >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> > >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > >>> no significant change in behaviour.
> > >>>
> > >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > >>> Cc: stable@vger.kernel.org
> > >>
> > >> Does this patch fix the same problem fixed by Kirill's patch here?
> > >> https://lkml.org/lkml/2017/3/2/347
> > >>
> > >
> > > I don't think so. The race I'm concerned with is due to locks not being
> > > held and is in a different path.
> > 
> > I do not agree. Kirill's patch is fixing the same race problem but in
> > zap_pmd_range().
> > 
> > The original autoNUMA code first clears PMD then sets it to protnone entry.
> > pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> > pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> > Is this the problem you are trying solve?
> > 
> > Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> > so pmd_trans_huge() will return TRUE. In this case, it also fixes
> > your race problem in change_pmd_range().
> > 
> > Let me know if I miss anything.
> > 
> 
> Ok, now I see. I think you're correct and I withdraw the patch.

I have Kirrill's

thp-reduce-indentation-level-in-change_huge_pmd.patch
thp-fix-madv_dontneed-vs-numa-balancing-race.patch
mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
thp-fix-madv_dontneed-vs-madv_free-race.patch
thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch

scheduled for 4.12-rc1.  It sounds like
thp-fix-madv_dontneed-vs-madv_free-race.patch and
thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
4.11 and stable?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 17:49     ` Zi Yan
@ 2017-04-10 18:07       ` Mel Gorman
  2017-04-10 22:09         ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2017-04-10 18:07 UTC (permalink / raw)
  To: Zi Yan
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Rik van Riel, Michal Hocko, Vlastimil Babka, linux-mm,
	linux-kernel

On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
> 
> > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> >>> While this could be fixed with heavy locking, it's only necessary to
> >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> >>> races. A new helper is created for this as the check if quite subtle and the
> >>> existing similar helpful is not suitable. This passed 154 hours of testing
> >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> >>> no significant change in behaviour.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >>> Cc: stable@vger.kernel.org
> >>
> >> Does this patch fix the same problem fixed by Kirill's patch here?
> >> https://lkml.org/lkml/2017/3/2/347
> >>
> >
> > I don't think so. The race I'm concerned with is due to locks not being
> > held and is in a different path.
> 
> I do not agree. Kirill's patch is fixing the same race problem but in
> zap_pmd_range().
> 
> The original autoNUMA code first clears PMD then sets it to protnone entry.
> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> Is this the problem you are trying solve?
> 
> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> so pmd_trans_huge() will return TRUE. In this case, it also fixes
> your race problem in change_pmd_range().
> 
> Let me know if I miss anything.
> 

Ok, now I see. I think you're correct and I withdraw the patch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 17:20   ` Mel Gorman
@ 2017-04-10 17:49     ` Zi Yan
  2017-04-10 18:07       ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2017-04-10 17:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Rik van Riel, Michal Hocko, Vlastimil Babka, linux-mm,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1533 bytes --]

On 10 Apr 2017, at 12:20, Mel Gorman wrote:

> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>> While this could be fixed with heavy locking, it's only necessary to
>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>> races. A new helper is created for this as the check if quite subtle and the
>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>> no significant change in behaviour.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>> Cc: stable@vger.kernel.org
>>
>> Does this patch fix the same problem fixed by Kirill's patch here?
>> https://lkml.org/lkml/2017/3/2/347
>>
>
> I don't think so. The race I'm concerned with is due to locks not being
> held and is in a different path.

I do not agree. Kirill's patch is fixing the same race problem but in
zap_pmd_range().

The original autoNUMA code first clears PMD then sets it to protnone entry.
pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
Is this the problem you are trying solve?

Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
so pmd_trans_huge() will return TRUE. In this case, it also fixes
your race problem in change_pmd_range().

Let me know if I miss anything.

Thanks.

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 13:53 ` Michal Hocko
@ 2017-04-10 17:38   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2017-04-10 17:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Vlastimil Babka,
	linux-mm, linux-kernel

On Mon, Apr 10, 2017 at 03:53:42PM +0200, Michal Hocko wrote:
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
> 
> s@helpful@helper@ I suspect
> 

Yes. I'll wait to see if there is more feedback and if not, resend unless
Andrew decides to pick it up and correct the mistake directly.

> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Cc: stable@vger.kernel.org
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thanks.

> you will probably win the_longest_function_name_contest but I do not
> have much better suggestion.
> 

I know, it's not a type of function that yields a snappy name.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 16:45 ` Zi Yan
@ 2017-04-10 17:20   ` Mel Gorman
  2017-04-10 17:49     ` Zi Yan
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2017-04-10 17:20 UTC (permalink / raw)
  To: Zi Yan
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Rik van Riel, Michal Hocko, Vlastimil Babka, linux-mm,
	linux-kernel

On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Cc: stable@vger.kernel.org
> 
> Does this patch fix the same problem fixed by Kirill's patch here?
> https://lkml.org/lkml/2017/3/2/347
> 

I don't think so. The race I'm concerned with is due to locks not being
held and is in a different path.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
                   ` (2 preceding siblings ...)
  2017-04-10 13:53 ` Michal Hocko
@ 2017-04-10 16:45 ` Zi Yan
  2017-04-10 17:20   ` Mel Gorman
  3 siblings, 1 reply; 24+ messages in thread
From: Zi Yan @ 2017-04-10 16:45 UTC (permalink / raw)
  To: Mel Gorman, Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	Vlastimil Babka, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2512 bytes --]

On 10 Apr 2017, at 4:48, Mel Gorman wrote:

> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros.  A partial report
> of the bad PMD looked like
>
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
>
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
>
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org

Does this patch fix the same problem fixed by Kirill's patch here?
https://lkml.org/lkml/2017/3/2/347

--
Best Regards
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 496 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
  2017-04-10 10:03 ` Vlastimil Babka
  2017-04-10 12:38 ` Rik van Riel
@ 2017-04-10 13:53 ` Michal Hocko
  2017-04-10 17:38   ` Mel Gorman
  2017-04-10 16:45 ` Zi Yan
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-04-10 13:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Vlastimil Babka,
	linux-mm, linux-kernel

On Mon 10-04-17 10:48:25, Mel Gorman wrote:
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros.  A partial report
> of the bad PMD looked like
> 
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
> 
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
> 
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing

s@helpful@helper@ I suspect

> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org

Acked-by: Michal Hocko <mhocko@suse.com>

you will probably win the_longest_function_name_contest but I do not
have much better suggestion.

> ---
>  include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
>  mm/mprotect.c                 | 12 ++++++++++--
>  2 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..597fa482cd4a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  }
>  
>  /*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +
> +/*
>   * This is a noop if Transparent Hugepage Support is not built into
>   * the kernel. Otherwise it is equivalent to
>   * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..821ff2904cdb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update
> +		 * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
> +		 * check leading to a false positive and clearing. Hence, it's
> +		 * necessary to atomically read the PMD value for all the
> +		 * checks.
> +		 */
> +		if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
>  			continue;
>  
>  		/* invoke the mmu notifier if the pmd is populated */

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
  2017-04-10 10:03 ` Vlastimil Babka
@ 2017-04-10 12:38 ` Rik van Riel
  2017-04-10 13:53 ` Michal Hocko
  2017-04-10 16:45 ` Zi Yan
  3 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2017-04-10 12:38 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Andrea Arcangeli, Michal Hocko, Vlastimil Babka, linux-mm, linux-kernel

On Mon, 2017-04-10 at 10:48 +0100, Mel Gorman wrote:
> 
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle
> and the
> existing similar helpful is not suitable. This passed 154 hours of
> testing
> (usually triggers between 20 minutes and 24 hours) without detecting
> bad
> PMDs or corruption. A basic test of an autonuma-intensive workload
> showed
> no significant change in behaviour.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10 10:03 ` Vlastimil Babka
@ 2017-04-10 12:19   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2017-04-10 12:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel, Michal Hocko,
	linux-mm, linux-kernel

On Mon, Apr 10, 2017 at 12:03:20PM +0200, Vlastimil Babka wrote:
> On 04/10/2017 11:48 AM, Mel Gorman wrote:
> > A user reported a bug against a distribution kernel while running
> > a proprietary workload described as "memory intensive that is not
> > swapping" that is expected to apply to mainline kernels. The workload
> > is read/write/modifying ranges of memory and checking the contents. They
> > reported that within a few hours that a bad PMD would be reported followed
> > by a memory corruption where expected data was all zeros.  A partial report
> > of the bad PMD looked like
> > 
> > [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> > [ 5195.341184] ------------[ cut here ]------------
> > [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> > ....
> > [ 5195.410033] Call Trace:
> > [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> > [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> > [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> > [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> > [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> > [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> > [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> > 
> > Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> > was a false detection. The bug does not trigger if automatic NUMA balancing
> > or transparent huge pages is disabled.
> > 
> > The bug is due a race in change_pmd_range between a pmd_trans_huge and
> > pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> > check, a parallel protection update under lock can have cleared the PMD
> > and filled it with a prot_numa entry between the transhuge check and the
> > pmd_none_or_clear_bad check.
> > 
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Cc: stable@vger.kernel.org
> 
> It would be better if there was a Fixes: tag, or at least version hint. Assuming
> it's since autonuma balancing was merged?
> 

Fair point. It's all the way back to 3.15 rather than all the way back to
the introduction of automatic NUMA balancing so

Cc: stable@vger.kernel.org # 3.15+

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
  2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
@ 2017-04-10 10:03 ` Vlastimil Babka
  2017-04-10 12:19   ` Mel Gorman
  2017-04-10 12:38 ` Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2017-04-10 10:03 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Michal Hocko, linux-mm, linux-kernel

On 04/10/2017 11:48 AM, Mel Gorman wrote:
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros.  A partial report
> of the bad PMD looked like
> 
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
> 
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
> 
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org

It would be better if there was a Fixes: tag, or at least version hint. Assuming
it's since autonuma balancing was merged?

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
>  mm/mprotect.c                 | 12 ++++++++++--
>  2 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..597fa482cd4a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
>  }
>  
>  /*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +
> +/*
>   * This is a noop if Transparent Hugepage Support is not built into
>   * the kernel. Otherwise it is equivalent to
>   * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..821ff2904cdb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		unsigned long this_pages;
>  
>  		next = pmd_addr_end(addr, end);
> -		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update
> +		 * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
> +		 * check leading to a false positive and clearing. Hence, it's
> +		 * necessary to atomically read the PMD value for all the
> +		 * checks.
> +		 */
> +		if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
>  			continue;
>  
>  		/* invoke the mmu notifier if the pmd is populated */
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
@ 2017-04-10  9:48 Mel Gorman
  2017-04-10 10:03 ` Vlastimil Babka
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Mel Gorman @ 2017-04-10  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Michal Hocko, Vlastimil Babka,
	linux-mm, linux-kernel

A user reported a bug against a distribution kernel while running
a proprietary workload described as "memory intensive that is not
swapping" that is expected to apply to mainline kernels. The workload
is read/write/modifying ranges of memory and checking the contents. They
reported that within a few hours that a bad PMD would be reported followed
by a memory corruption where expected data was all zeros.  A partial report
of the bad PMD looked like

[ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
[ 5195.341184] ------------[ cut here ]------------
[ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
....
[ 5195.410033] Call Trace:
[ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
[ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
[ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
[ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
[ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
[ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
[ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10

Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
was a false detection. The bug does not trigger if automatic NUMA balancing
or transparent huge pages is disabled.

The bug is due a race in change_pmd_range between a pmd_trans_huge and
pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
check, a parallel protection update under lock can have cleared the PMD
and filled it with a prot_numa entry between the transhuge check and the
pmd_none_or_clear_bad check.

While this could be fixed with heavy locking, it's only necessary to
make a copy of the PMD on the stack during change_pmd_range and avoid
races. A new helper is created for this as the check if quite subtle and the
existing similar helpful is not suitable. This passed 154 hours of testing
(usually triggers between 20 minutes and 24 hours) without detecting bad
PMDs or corruption. A basic test of an autonuma-intensive workload showed
no significant change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org
---
 include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
 mm/mprotect.c                 | 12 ++++++++++--
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1fad160f35de..597fa482cd4a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 }
 
 /*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+
+	if (pmd_none(pmdval))
+		return 1;
+	if (pmd_trans_huge(pmdval))
+		return 0;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+	return 0;
+}
+
+
+/*
  * This is a noop if Transparent Hugepage Support is not built into
  * the kernel. Otherwise it is equivalent to
  * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..821ff2904cdb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
-				&& pmd_none_or_clear_bad(pmd))
+
+		/*
+		 * Automatic NUMA balancing walks the tables with mmap_sem
+		 * held for read. It's possible a parallel update
+		 * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
+		 * check leading to a false positive and clearing. Hence, it's
+		 * necessary to atomically read the PMD value for all the
+		 * checks.
+		 */
+		if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			continue;
 
 		/* invoke the mmu notifier if the pmd is populated */

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-03-11  0:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-16 19:18 [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa Rafael Aquini
2020-02-16 23:32 ` Mel Gorman
2020-03-07  2:40 ` Qian Cai
2020-03-07  3:05   ` Rafael Aquini
2020-03-08  3:20     ` Qian Cai
2020-03-08 23:14       ` Rafael Aquini
2020-03-09  3:27         ` Qian Cai
2020-03-09 15:05           ` Rafael Aquini
2020-03-11  0:04             ` Qian Cai
  -- strict thread matches above, loose matches on Subject: below --
2017-04-10  9:48 [PATCH] mm, numa: Fix " Mel Gorman
2017-04-10 10:03 ` Vlastimil Babka
2017-04-10 12:19   ` Mel Gorman
2017-04-10 12:38 ` Rik van Riel
2017-04-10 13:53 ` Michal Hocko
2017-04-10 17:38   ` Mel Gorman
2017-04-10 16:45 ` Zi Yan
2017-04-10 17:20   ` Mel Gorman
2017-04-10 17:49     ` Zi Yan
2017-04-10 18:07       ` Mel Gorman
2017-04-10 22:09         ` Andrew Morton
2017-04-10 22:28           ` Zi Yan
2017-04-11  6:35             ` Vlastimil Babka
2017-04-11 21:44               ` Andrew Morton
2017-04-11  8:29           ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).