LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC PATCH 0/8] Drop mmap_sem during unmapping large map
@ 2018-03-20 21:31 Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel


Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large
memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.

Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this
series of patches are aimed to solve the mmap_sem issue by adopting akpm's
suggestion.


Approach:
A couple of approaches were explored.
#1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap()
can benefit from this change.

#2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range().
    In this way, I don't have to define a magic size for unmapping. But, there
    are two major issues:
      * mmap_sem may be acquired by down_write() or down_read() in all the
        possible call paths. So, the call path has to be checked to determine
        to use which variants, either _write or _read. It increases the
        complexity significantly.
      * The below race condition might be introduced:
          CPU A                         CPU B 
       ----------                     ---------- 
       do_munmap
     zap_pmd_range 
       up_write                         do_munmap
                                        down_write 
                                        ...... 
                                        remove_vma_list 
                                        up_write 
      down_write 
     access vmas  <-- use-after-free bug

        And, unmapping by section requires splitting vma, so the code has to
        deal with partial unmapped vma, it also increase the complexity
        significantly. 

#3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables
    /free vmas sequence atomic for every section. And, not only sys_munmap()
    can benefit, but also mremap and sysv shm. The only problem is it may not
    want to drop mmap_sem from some call paths. So, an extra parameter, called
    "atomic", is introduced to do_munmap(). The caller can pass "true" or "false"
    to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not
    drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem
    by _write, so I just need deal with one variant. And, when re-acquiring
    mmap_sem, just use down_write() for now since dealing with the return value
    of down_write_killable() sounds unnecessary.

    Other than these, a magic section size has to be defined explicitly, now
    HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good
    enough. This is also why down_write() is used for re-acquiring mmap_sem
    instead of down_write_killable(). Smaller size looks have to much overhead.

Regression and performance data:
Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory

Full LTP test is done, no regression issue.

Measurement of SyS_munmap() execution time:
  size        pristine        patched        delta
  80GB       5008377 us      4905841 us       -2%
  160GB      9129243 us      9145306 us       +0.18%
  320GB      17915310 us     17990174 us      +0.42%

Throughput of page faults (#/s) with vm-scalability:
                    pristine         patched         delta
mmap-pread-seq       554894           563517         +1.6%
mmap-pread-seq-mt    581232           580772         -0.079%
mmap-xread-seq-mt    99182            105400         +6.3%

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
        pristine         patched          delta
         100165           108396          +8.2%


There are 8 patches in this series.
1/8:
  Introduce “atomic” parameter and define do_munmap_range(), modify
  do_munmap() to call do_munmap() to unmap memory by section
2/8 - 6/8:
  modify do_munmap() call sites in mm/mmap.c, mm/mremap.c,
  fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter
7/8 - 8/8:
  modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter


Yang Shi (8):
      mm: mmap: unmap large mapping by section
      mm: mmap: pass atomic parameter to do_munmap() call sites
      mm: mremap: pass atomic parameter to do_munmap()
      mm: nommu: add atomic parameter to do_munmap()
      ipc: shm: pass atomic parameter to do_munmap()
      fs: proc/vmcore: pass atomic parameter to do_munmap()
      x86: mpx: pass atomic parameter to do_munmap()
      x86: vma: pass atomic parameter to do_munmap()

 arch/x86/entry/vdso/vma.c |  2 +-
 arch/x86/mm/mpx.c         |  2 +-
 fs/proc/vmcore.c          |  4 ++--
 include/linux/mm.h        |  2 +-
 ipc/shm.c                 |  9 ++++++---
 mm/mmap.c                 | 48 ++++++++++++++++++++++++++++++++++++++++++------
 mm/mremap.c               | 10 ++++++----
 mm/nommu.c                |  5 +++--
 8 files changed, 62 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-21 13:08   ` Michal Hocko
  2018-03-21 13:14   ` Michal Hocko
  2018-03-20 21:31 ` [RFC PATCH 2/8] mm: mmap: pass atomic parameter to do_munmap() call sites Yang Shi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem from very beginning to all the way
down to the end, and doesn't release it in the middle. When unmapping
large mapping, it may take long time (take ~18 seconds to unmap 320GB
mapping with every single page mapped on an idle machine).

Since unmapping does't require any atomicity, so here unmap large
mapping (> HPAGE_PUD_SIZE) section by section, and release mmap_sem for
unmapping every HPAGE_PUD_SIZE if mmap_sem is contended and the call
path is fine to be interrupted controlled by "atomic", newly added
parameter to do_munmap(). "false" means it is fine to do unlock/relock
to mmap_sem in the middle.

Not only munmap may benefit from this change, but also
mremap/shm since they all call do_munmap() to do the real work.

The below is some regression and performance data collected on a machine
with 32 cores of E5-2680 @ 2.70GHz and 384GB memory.

Measurement of SyS_munmap() execution time:
  size        pristine        patched        delta
  80GB       5008377 us      4905841 us       -2%
  160GB      9129243 us      9145306 us       +0.18%
  320GB      17915310 us     17990174 us      +0.42%

Throughput of page faults (#/s) with vm-scalability:
                    pristine         patched         delta
mmap-pread-seq       554894           563517         +1.6%
mmap-pread-seq-mt    581232           580772         -0.079%
mmap-xread-seq-mt    99182            105400         +6.3%

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
        pristine         patched          delta
         100165           108396          +8.2%

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/mm.h |  2 +-
 mm/mmap.c          | 40 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..2e447d4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2212,7 +2212,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
 	struct list_head *uf);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
-		     struct list_head *uf);
+		     struct list_head *uf, bool atomic);
 
 static inline unsigned long
 do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/mm/mmap.c b/mm/mmap.c
index 9efdc021..ad6ae7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2632,8 +2632,8 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge <jeremy@goop.org>
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
-	      struct list_head *uf)
+static int do_munmap_range(struct mm_struct *mm, unsigned long start,
+			   size_t len, struct list_head *uf)
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
@@ -2733,6 +2733,42 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+	      struct list_head *uf, bool atomic)
+{
+	int ret = 0;
+	size_t step = HPAGE_PUD_SIZE;
+
+	/*
+	 * unmap large mapping (> huge pud size) section by section
+	 * in order to give mmap_sem waiters a chance to acquire it.
+	 */
+	if (len <= step)
+		ret = do_munmap_range(mm, start, len, uf);
+	else {
+		do {
+			ret = do_munmap_range(mm, start, step, uf);
+			if (ret < 0)
+				break;
+
+			if (rwsem_is_contended(&mm->mmap_sem) && !atomic &&
+			    need_resched()) {
+				VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+				up_write(&mm->mmap_sem);
+				cond_resched();
+				down_write(&mm->mmap_sem);
+			}
+
+			start += step;
+			len -= step;
+			if (len <= step)
+				step = len;
+		} while (len > 0);
+	}
+
+	return ret;
+}
+
 int vm_munmap(unsigned long start, size_t len)
 {
 	int ret;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 2/8] mm: mmap: pass atomic parameter to do_munmap() call sites
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 3/8] mm: mremap: pass atomic parameter to do_munmap() Yang Shi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

It looks safe to release mmap_sem in the middle for vm_munmap and brk,
so passing "false" to do_munmap() call.
However it sounds not safe to mmap_region() which is called by
SyS_mmap().

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ad6ae7a..374e4ec 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -225,7 +225,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 
 	/* Always allow shrinking brk. */
 	if (brk <= mm->brk) {
-		if (!do_munmap(mm, newbrk, oldbrk-newbrk, &uf))
+		if (!do_munmap(mm, newbrk, oldbrk-newbrk, &uf, false))
 			goto set_brk;
 		goto out;
 	}
@@ -1643,7 +1643,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Clear old maps */
 	while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
 			      &rb_parent)) {
-		if (do_munmap(mm, addr, len, uf))
+		if (do_munmap(mm, addr, len, uf, true))
 			return -ENOMEM;
 	}
 
@@ -2778,7 +2778,7 @@ int vm_munmap(unsigned long start, size_t len)
 	if (down_write_killable(&mm->mmap_sem))
 		return -EINTR;
 
-	ret = do_munmap(mm, start, len, &uf);
+	ret = do_munmap(mm, start, len, &uf, false);
 	up_write(&mm->mmap_sem);
 	userfaultfd_unmap_complete(mm, &uf);
 	return ret;
@@ -2945,7 +2945,7 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long
 	 */
 	while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
 			      &rb_parent)) {
-		if (do_munmap(mm, addr, len, uf))
+		if (do_munmap(mm, addr, len, uf, false))
 			return -ENOMEM;
 	}
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 3/8] mm: mremap: pass atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 2/8] mm: mmap: pass atomic parameter to do_munmap() call sites Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 4/8] mm: nommu: add " Yang Shi
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

It sounds safe to do unlock/relock to mmap_sem in mremap, so passing
"false" here.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mremap.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470a..5f8fca4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -353,7 +353,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (unlikely(vma->vm_flags & VM_PFNMAP))
 		untrack_pfn_moved(vma);
 
-	if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
+	if (do_munmap(mm, old_addr, old_len, uf_unmap, false) < 0) {
 		/* OOM: unable to split vma, just get accounts right */
 		vm_unacct_memory(excess >> PAGE_SHIFT);
 		excess = 0;
@@ -462,12 +462,13 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (addr + old_len > new_addr && new_addr + new_len > addr)
 		goto out;
 
-	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
+	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early, false);
 	if (ret)
 		goto out;
 
 	if (old_len >= new_len) {
-		ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
+		ret = do_munmap(mm, addr+new_len, old_len - new_len,
+				uf_unmap, false);
 		if (ret && old_len != new_len)
 			goto out;
 		old_len = new_len;
@@ -568,7 +569,8 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
 	 * do_munmap does all the needed commit accounting
 	 */
 	if (old_len >= new_len) {
-		ret = do_munmap(mm, addr+new_len, old_len - new_len, &uf_unmap);
+		ret = do_munmap(mm, addr+new_len, old_len - new_len,
+				&uf_unmap, false);
 		if (ret && old_len != new_len)
 			goto out;
 		ret = addr;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 4/8] mm: nommu: add atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
                   ` (2 preceding siblings ...)
  2018-03-20 21:31 ` [RFC PATCH 3/8] mm: mremap: pass atomic parameter to do_munmap() Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 5/8] ipc: shm: pass " Yang Shi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

Just add atomic parameter to keep consistent with the API change and
pass "true" to the call site. Nommu code doesn't do the mmap_sem
unlock/relock.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/nommu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index ebb6e61..5954c08 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1578,7 +1578,8 @@ static int shrink_vma(struct mm_struct *mm,
  * - under NOMMU conditions the chunk to be unmapped must be backed by a single
  *   VMA, though it need not cover the whole VMA
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf)
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+	      struct list_head *ufi, bool atomic)
 {
 	struct vm_area_struct *vma;
 	unsigned long end;
@@ -1644,7 +1645,7 @@ int vm_munmap(unsigned long addr, size_t len)
 	int ret;
 
 	down_write(&mm->mmap_sem);
-	ret = do_munmap(mm, addr, len, NULL);
+	ret = do_munmap(mm, addr, len, NULL, true);
 	up_write(&mm->mmap_sem);
 	return ret;
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 5/8] ipc: shm: pass atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
                   ` (3 preceding siblings ...)
  2018-03-20 21:31 ` [RFC PATCH 4/8] mm: nommu: add " Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 6/8] fs: proc/vmcore: " Yang Shi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

It looks safe to do unlock/relock mmap_sem in the middle of shmat(), so
passing "false" here.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 ipc/shm.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 4643865..1617523 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1537,7 +1537,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			 */
 			file = vma->vm_file;
 			size = i_size_read(file_inode(vma->vm_file));
-			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+				  NULL, false);
 			/*
 			 * We discovered the size of the shm segment, so
 			 * break out of here and fall through to the next
@@ -1564,7 +1565,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 		if ((vma->vm_ops == &shm_vm_ops) &&
 		    ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
 		    (vma->vm_file == file))
-			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+				  NULL, false);
 		vma = next;
 	}
 
@@ -1573,7 +1575,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 	 * given
 	 */
 	if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {
-		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+		do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start,
+			  NULL, false);
 		retval = 0;
 	}
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 6/8] fs: proc/vmcore: pass atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
                   ` (4 preceding siblings ...)
  2018-03-20 21:31 ` [RFC PATCH 5/8] ipc: shm: pass " Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 7/8] x86: mpx: " Yang Shi
  2018-03-20 21:31 ` [RFC PATCH 8/8] x86: vma: " Yang Shi
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

Just pass "true" here since vmcore map is not a hot path there is not
too much gain to release mmap_sem in the middle.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 fs/proc/vmcore.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af..02683eb 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -388,7 +388,7 @@ static int remap_oldmem_pfn_checked(struct vm_area_struct *vma,
 	}
 	return 0;
 fail:
-	do_munmap(vma->vm_mm, from, len, NULL);
+	do_munmap(vma->vm_mm, from, len, NULL, true);
 	return -EAGAIN;
 }
 
@@ -481,7 +481,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 
 	return 0;
 fail:
-	do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
+	do_munmap(vma->vm_mm, vma->vm_start, len, NULL, true);
 	return -EAGAIN;
 }
 #else
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
                   ` (5 preceding siblings ...)
  2018-03-20 21:31 ` [RFC PATCH 6/8] fs: proc/vmcore: " Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  2018-03-20 22:35   ` Thomas Gleixner
  2018-03-20 21:31 ` [RFC PATCH 8/8] x86: vma: " Yang Shi
  7 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm
  Cc: yang.shi, linux-mm, linux-kernel, Ricardo Neri, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86

Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
manipulating mpx map.

This is API change only.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: x86@kernel.org
---
 arch/x86/mm/mpx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index e500949..a180e94 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
 	 * avoid recursion, do_munmap() will check whether it comes
 	 * from one bounds table through VM_MPX flag.
 	 */
-	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
+	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);
 }
 
 static int try_unmap_single_bt(struct mm_struct *mm,
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC PATCH 8/8] x86: vma: pass atomic parameter to do_munmap()
  2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
                   ` (6 preceding siblings ...)
  2018-03-20 21:31 ` [RFC PATCH 7/8] x86: mpx: " Yang Shi
@ 2018-03-20 21:31 ` Yang Shi
  7 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm
  Cc: yang.shi, linux-mm, linux-kernel, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

It sounds unnecessary to release mmap_sem in the middle of unmmaping
vdso map.

This is API change only.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
---
 arch/x86/entry/vdso/vma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 5b8b556..6359ae5 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -192,7 +192,7 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
-		do_munmap(mm, text_start, image->size, NULL);
+		do_munmap(mm, text_start, image->size, NULL, true);
 	} else {
 		current->mm->context.vdso = (void __user *)text_start;
 		current->mm->context.vdso_image = image;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()
  2018-03-20 21:31 ` [RFC PATCH 7/8] x86: mpx: " Yang Shi
@ 2018-03-20 22:35   ` Thomas Gleixner
  2018-03-21 16:53     ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2018-03-20 22:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: akpm, linux-mm, linux-kernel, Ricardo Neri, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, x86

On Wed, 21 Mar 2018, Yang Shi wrote:

Please CC everyone involved on the full patch set next time. I had to dig
the rest out from my lkml archive to get the context.

> Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
> manipulating mpx map.

> This is API change only.

This is wrong. You cannot change the function in one patch and then clean
up the users. That breaks bisectability.

Depending on the number of callers this wants to be a single patch changing
both the function and the callers or you need to create a new function
which has the extra argument and switch all users over to it and then
remove the old function.

> @@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
>  	 * avoid recursion, do_munmap() will check whether it comes
>  	 * from one bounds table through VM_MPX flag.
>  	 */
> -	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
> +	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);

But looking at the full context this is the wrong approach.

First of all the name of that parameter 'atomic' is completely
misleading. It suggests that this happens in fully atomic context, which is
not the case.

Secondly, conditional locking is frowned upon in general and rightfully so.

So the right thing to do is to leave do_munmap() alone and add a new
function do_munmap_huge() or whatever sensible name you come up with. Then
convert the places which are considered to be safe one by one with a proper
changelog which explains WHY this is safe.

That way you avoid the chasing game of all existing do_munmap() callers and
just use the new 'free in chunks' approach where it is appropriate and
safe. No suprises, no bisectability issues....

While at it please add proper kernel doc documentation to both do_munmap()
and the new function which explains the intricacies.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
@ 2018-03-21 13:08   ` Michal Hocko
  2018-03-21 16:31     ` Yang Shi
  2018-03-21 13:14   ` Michal Hocko
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2018-03-21 13:08 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, linux-mm, linux-kernel

On Wed 21-03-18 05:31:19, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> It is because munmap holds mmap_sem from very beginning to all the way
> down to the end, and doesn't release it in the middle. When unmapping
> large mapping, it may take long time (take ~18 seconds to unmap 320GB
> mapping with every single page mapped on an idle machine).

Yes, this definitely sucks. One way to work that around is to split the
unmap to two phases. One to drop all the pages. That would only need
mmap_sem for read and then tear down the mapping with the mmap_sem for
write. This wouldn't help for parallel mmap_sem writers but those really
need a different approach (e.g. the range locking).

> Since unmapping does't require any atomicity, so here unmap large

How come? Could you be more specific why? Once you drop the lock the
address space might change under your feet and you might be unmapping a
completely different vma. That would require userspace doing nasty
things of course (e.g. MAP_FIXED) but I am worried that userspace really
depends on mmap/munmap atomicity these days.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
  2018-03-21 13:08   ` Michal Hocko
@ 2018-03-21 13:14   ` Michal Hocko
  2018-03-21 16:50     ` Yang Shi
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2018-03-21 13:14 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, linux-mm, linux-kernel

On Wed 21-03-18 05:31:19, Yang Shi wrote:
> When running some mmap/munmap scalability tests with large memory (i.e.
> > 300GB), the below hung task issue may happen occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0

Slightly off-topic:
Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
of
	arg_start = mm->arg_start;
	arg_end = mm->arg_end;
	env_start = mm->env_start;
	env_end = mm->env_end;

change after exec or while the pid is already visible in proc? If yes
maybe we can use a dedicated lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 13:08   ` Michal Hocko
@ 2018-03-21 16:31     ` Yang Shi
  2018-03-21 17:29       ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-21 16:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, linux-kernel



On 3/21/18 6:08 AM, Michal Hocko wrote:
> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> It is because munmap holds mmap_sem from very beginning to all the way
>> down to the end, and doesn't release it in the middle. When unmapping
>> large mapping, it may take long time (take ~18 seconds to unmap 320GB
>> mapping with every single page mapped on an idle machine).
> Yes, this definitely sucks. One way to work that around is to split the
> unmap to two phases. One to drop all the pages. That would only need
> mmap_sem for read and then tear down the mapping with the mmap_sem for
> write. This wouldn't help for parallel mmap_sem writers but those really
> need a different approach (e.g. the range locking).

page fault might sneak in to map a page which has been unmapped before?

range locking should help a lot on manipulating small sections of a 
large mapping in parallel or multiple small mappings. It may not achieve 
too much for single large mapping.

>
>> Since unmapping does't require any atomicity, so here unmap large
> How come? Could you be more specific why? Once you drop the lock the
> address space might change under your feet and you might be unmapping a
> completely different vma. That would require userspace doing nasty
> things of course (e.g. MAP_FIXED) but I am worried that userspace really
> depends on mmap/munmap atomicity these days.

Sorry for the ambiguity. The statement does look misleading. munmap does 
need certain atomicity, particularly for the below sequence:

splitting vma
unmap region
free pagetables
free vmas

Otherwise it may run into the below race condition:

           CPU A                             CPU B
        ----------                     ----------
        do_munmap
      zap_pmd_range
        up_write                        do_munmap
                                             down_write
                                             ......
                                             remove_vma_list
                                             up_write
       down_write
      access vmas  <-- use-after-free bug

This is why I do the range unmap in do_munmap() rather than doing it in 
deeper location, i.e. zap_pmd_range(). I elaborated this in the cover 
letter.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 13:14   ` Michal Hocko
@ 2018-03-21 16:50     ` Yang Shi
  2018-03-21 17:16       ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-21 16:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, linux-kernel



On 3/21/18 6:14 AM, Michal Hocko wrote:
> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>> When running some mmap/munmap scalability tests with large memory (i.e.
>>> 300GB), the below hung task issue may happen occasionally.
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> Slightly off-topic:
> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> of
> 	arg_start = mm->arg_start;
> 	arg_end = mm->arg_end;
> 	env_start = mm->env_start;
> 	env_end = mm->env_end;
>
> change after exec or while the pid is already visible in proc? If yes
> maybe we can use a dedicated lock.

Actually, Alexey Dobriyan had the same comment when he reviewed my very 
first patch (which changes down_read to down_read_killable at that place).

Those 4 values might be changed by prctl_set_mm() and prctl_set_mm_map() 
concurrently. They used to use down_read() to protect the change, but it 
looks not good enough to protect concurrent writing. So, Mateusz Guzik's 
commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem 
for writing to protect against others") change it to down_write().

It seems mmap_sem can be replaced to a dedicated lock. How about 
defining a rwlock in mm_struct to protect those data? I will come up 
with a RFC patch for this.

However, this dedicated lock just can work around this specific case. I 
believe solving mmap_sem scalability issue aimed by the patch series is 
still our consensus.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 7/8] x86: mpx: pass atomic parameter to do_munmap()
  2018-03-20 22:35   ` Thomas Gleixner
@ 2018-03-21 16:53     ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-21 16:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: akpm, linux-mm, linux-kernel, Ricardo Neri, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, x86



On 3/20/18 3:35 PM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Yang Shi wrote:
>
> Please CC everyone involved on the full patch set next time. I had to dig
> the rest out from my lkml archive to get the context.

Sorry for the inconvenience. Will pay attention to it next time.

>
>> Pass "true" to do_munmap() to not do unlock/relock to mmap_sem when
>> manipulating mpx map.
>> This is API change only.
> This is wrong. You cannot change the function in one patch and then clean
> up the users. That breaks bisectability.
>
> Depending on the number of callers this wants to be a single patch changing
> both the function and the callers or you need to create a new function
> which has the extra argument and switch all users over to it and then
> remove the old function.
>
>> @@ -780,7 +780,7 @@ static int unmap_entire_bt(struct mm_struct *mm,
>>   	 * avoid recursion, do_munmap() will check whether it comes
>>   	 * from one bounds table through VM_MPX flag.
>>   	 */
>> -	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL);
>> +	return do_munmap(mm, bt_addr, mpx_bt_size_bytes(mm), NULL, true);
> But looking at the full context this is the wrong approach.
>
> First of all the name of that parameter 'atomic' is completely
> misleading. It suggests that this happens in fully atomic context, which is
> not the case.
>
> Secondly, conditional locking is frowned upon in general and rightfully so.
>
> So the right thing to do is to leave do_munmap() alone and add a new
> function do_munmap_huge() or whatever sensible name you come up with. Then
> convert the places which are considered to be safe one by one with a proper
> changelog which explains WHY this is safe.
>
> That way you avoid the chasing game of all existing do_munmap() callers and
> just use the new 'free in chunks' approach where it is appropriate and
> safe. No suprises, no bisectability issues....
>
> While at it please add proper kernel doc documentation to both do_munmap()
> and the new function which explains the intricacies.

Thanks a lot for the suggestion. Absolutely agree. Will fix the problems 
in newer version.

Yang

>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 16:50     ` Yang Shi
@ 2018-03-21 17:16       ` Yang Shi
  2018-03-21 21:23         ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-21 17:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, linux-kernel



On 3/21/18 9:50 AM, Yang Shi wrote:
>
>
> On 3/21/18 6:14 AM, Michal Hocko wrote:
>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>> 300GB), the below hung task issue may happen occasionally.
>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>>   ps              D    0 14018      1 0x00000004
>>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>   Call Trace:
>>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> Slightly off-topic:
>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>> of
>>     arg_start = mm->arg_start;
>>     arg_end = mm->arg_end;
>>     env_start = mm->env_start;
>>     env_end = mm->env_end;
>>
>> change after exec or while the pid is already visible in proc? If yes
>> maybe we can use a dedicated lock.

BTW, this is not the only place to acquire mmap_sem in 
proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire 
mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.

Yang

>
> Actually, Alexey Dobriyan had the same comment when he reviewed my 
> very first patch (which changes down_read to down_read_killable at 
> that place).
>
> Those 4 values might be changed by prctl_set_mm() and 
> prctl_set_mm_map() concurrently. They used to use down_read() to 
> protect the change, but it looks not good enough to protect concurrent 
> writing. So, Mateusz Guzik's commit 
> ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for 
> writing to protect against others") change it to down_write().
>
> It seems mmap_sem can be replaced to a dedicated lock. How about 
> defining a rwlock in mm_struct to protect those data? I will come up 
> with a RFC patch for this.
>
> However, this dedicated lock just can work around this specific case. 
> I believe solving mmap_sem scalability issue aimed by the patch series 
> is still our consensus.
>
> Thanks,
> Yang
>
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 16:31     ` Yang Shi
@ 2018-03-21 17:29       ` Matthew Wilcox
  2018-03-21 21:45         ` Yang Shi
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-21 17:29 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> On 3/21/18 6:08 AM, Michal Hocko wrote:
> > Yes, this definitely sucks. One way to work that around is to split the
> > unmap to two phases. One to drop all the pages. That would only need
> > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > write. This wouldn't help for parallel mmap_sem writers but those really
> > need a different approach (e.g. the range locking).
> 
> page fault might sneak in to map a page which has been unmapped before?
> 
> range locking should help a lot on manipulating small sections of a large
> mapping in parallel or multiple small mappings. It may not achieve too much
> for single large mapping.

I don't think we need range locking.  What if we do munmap this way:

Take the mmap_sem for write
Find the VMA
  If the VMA is large(*)
    Mark the VMA as deleted
    Drop the mmap_sem
    zap all of the entries
    Take the mmap_sem
  Else
    zap all of the entries
Continue finding VMAs
Drop the mmap_sem

Now we need to change everywhere which looks up a VMA to see if it needs
to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
does not care; munmap will need to wait for the existing munmap operation
to complete), but it gives us the atomicity, at least on a per-VMA basis.

We could also do:

Take the mmap_sem for write
Mark all VMAs in the range as deleted & modify any partial VMAs
Drop mmap_sem
zap pages from deleted VMAs

That would give us the same atomicity that we have today.

Deleted VMAs would need a pointer to a completion, so operations that
need to wait can queue themselves up.  I'd recommend we use the low bit
of vm_file and treat it as a pointer to a struct completion if set.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 17:16       ` Yang Shi
@ 2018-03-21 21:23         ` Michal Hocko
  2018-03-21 22:36           ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2018-03-21 21:23 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, linux-mm, linux-kernel

On Wed 21-03-18 10:16:41, Yang Shi wrote:
> 
> 
> On 3/21/18 9:50 AM, Yang Shi wrote:
> > 
> > 
> > On 3/21/18 6:14 AM, Michal Hocko wrote:
> > > On Wed 21-03-18 05:31:19, Yang Shi wrote:
> > > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > > 300GB), the below hung task issue may happen occasionally.
> > > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > >         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
> > > >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > message.
> > > >   ps              D    0 14018      1 0x00000004
> > > >    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > > >    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > > >    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > > >   Call Trace:
> > > >    [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > > >    [<ffffffff817159e6>] schedule+0x36/0x80
> > > >    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > > >    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > > >    [<ffffffff81717db0>] down_read+0x20/0x40
> > > >    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > > Slightly off-topic:
> > > Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> > > of
> > >     arg_start = mm->arg_start;
> > >     arg_end = mm->arg_end;
> > >     env_start = mm->env_start;
> > >     env_end = mm->env_end;
> > > 
> > > change after exec or while the pid is already visible in proc? If yes
> > > maybe we can use a dedicated lock.
> 
> BTW, this is not the only place to acquire mmap_sem in
> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.

Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
to remove that. munmap should perform much better. How to do that safely
is a different question. I am not yet convinced that tearing down a vma
in batches is safe. The vast majority of time is spent on tearing down
pages and that is quite easy to move out of the write lock. That would
be an improvement already and it should be risk safe. If even that is
not sufficient then using range locking should help a lot. There
shouldn't be really any other address space operations within the range
most of the time so this would be basically non-contended access.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 17:29       ` Matthew Wilcox
@ 2018-03-21 21:45         ` Yang Shi
  2018-03-21 22:15           ` Matthew Wilcox
  2018-03-21 22:46           ` Matthew Wilcox
  2018-03-22 17:34         ` Yang Shi
  2018-03-24 18:24         ` Jerome Glisse
  2 siblings, 2 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-21 21:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>> Yes, this definitely sucks. One way to work that around is to split the
>>> unmap to two phases. One to drop all the pages. That would only need
>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>> need a different approach (e.g. the range locking).
>> page fault might sneak in to map a page which has been unmapped before?
>>
>> range locking should help a lot on manipulating small sections of a large
>> mapping in parallel or multiple small mappings. It may not achieve too much
>> for single large mapping.
> I don't think we need range locking.  What if we do munmap this way:
>
> Take the mmap_sem for write
> Find the VMA
>    If the VMA is large(*)
>      Mark the VMA as deleted
>      Drop the mmap_sem
>      zap all of the entries
>      Take the mmap_sem
>    Else
>      zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
>
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap

Marking vma as deleted sounds good. The problem for my current approach 
is the concurrent page fault may succeed if it access the not yet 
unmapped section. Marking deleted vma could tell page fault the vma is 
not valid anymore, then return SIGSEGV.

> does not care; munmap will need to wait for the existing munmap operation

Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

Thanks,
Yang

> to complete), but it gives us the atomicity, at least on a per-VMA basis.
>
> We could also do:
>
> Take the mmap_sem for write
> Mark all VMAs in the range as deleted & modify any partial VMAs
> Drop mmap_sem
> zap pages from deleted VMAs
>
> That would give us the same atomicity that we have today.
>
> Deleted VMAs would need a pointer to a completion, so operations that
> need to wait can queue themselves up.  I'd recommend we use the low bit
> of vm_file and treat it as a pointer to a struct completion if set.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 21:45         ` Yang Shi
@ 2018-03-21 22:15           ` Matthew Wilcox
  2018-03-21 22:40             ` Yang Shi
  2018-03-21 22:46           ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-21 22:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> > On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> > > On 3/21/18 6:08 AM, Michal Hocko wrote:
> > > > Yes, this definitely sucks. One way to work that around is to split the
> > > > unmap to two phases. One to drop all the pages. That would only need
> > > > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > > > write. This wouldn't help for parallel mmap_sem writers but those really
> > > > need a different approach (e.g. the range locking).
> > > page fault might sneak in to map a page which has been unmapped before?
> > > 
> > > range locking should help a lot on manipulating small sections of a large
> > > mapping in parallel or multiple small mappings. It may not achieve too much
> > > for single large mapping.
> > I don't think we need range locking.  What if we do munmap this way:
> > 
> > Take the mmap_sem for write
> > Find the VMA
> >    If the VMA is large(*)
> >      Mark the VMA as deleted
> >      Drop the mmap_sem
> >      zap all of the entries
> >      Take the mmap_sem
> >    Else
> >      zap all of the entries
> > Continue finding VMAs
> > Drop the mmap_sem
> > 
> > Now we need to change everywhere which looks up a VMA to see if it needs
> > to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> 
> Marking vma as deleted sounds good. The problem for my current approach is
> the concurrent page fault may succeed if it access the not yet unmapped
> section. Marking deleted vma could tell page fault the vma is not valid
> anymore, then return SIGSEGV.
> 
> > does not care; munmap will need to wait for the existing munmap operation
> 
> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

Oh, I forgot about MAP_FIXED.  Yes, MAP_FIXED should wait for the munmap
to finish.  But a regular mmap can just pretend that it happened before
the munmap call and avoid the deleted VMAs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 21:23         ` Michal Hocko
@ 2018-03-21 22:36           ` Yang Shi
  2018-03-22  9:10             ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-21 22:36 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, linux-kernel



On 3/21/18 2:23 PM, Michal Hocko wrote:
> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>
>> On 3/21/18 9:50 AM, Yang Shi wrote:
>>>
>>> On 3/21/18 6:14 AM, Michal Hocko wrote:
>>>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>>>> 300GB), the below hung task issue may happen occasionally.
>>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>>          Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>>>> message.
>>>>>    ps              D    0 14018      1 0x00000004
>>>>>     ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>>     ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>>     00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>>    Call Trace:
>>>>>     [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>>     [<ffffffff817159e6>] schedule+0x36/0x80
>>>>>     [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>>     [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>>     [<ffffffff81717db0>] down_read+0x20/0x40
>>>>>     [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>> Slightly off-topic:
>>>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>>>> of
>>>>      arg_start = mm->arg_start;
>>>>      arg_end = mm->arg_end;
>>>>      env_start = mm->env_start;
>>>>      env_end = mm->env_end;
>>>>
>>>> change after exec or while the pid is already visible in proc? If yes
>>>> maybe we can use a dedicated lock.
>> BTW, this is not the only place to acquire mmap_sem in
>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> to remove that. munmap should perform much better. How to do that safely

Yes, agree. We are on the same page.

> is a different question. I am not yet convinced that tearing down a vma
> in batches is safe. The vast majority of time is spent on tearing down

You can try my patches. I did full LTP test and running multiple kernel 
build in parallel. It survives.

> pages and that is quite easy to move out of the write lock. That would
> be an improvement already and it should be risk safe. If even that is
> not sufficient then using range locking should help a lot. There
> shouldn't be really any other address space operations within the range
> most of the time so this would be basically non-contended access.

It might depend on how the range is defined. Too big range may lead to 
surprisingly more contention, but too small range may bring in too much 
lock/unlock operations.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 22:15           ` Matthew Wilcox
@ 2018-03-21 22:40             ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-21 22:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 3/21/18 3:15 PM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>>>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>>>> Yes, this definitely sucks. One way to work that around is to split the
>>>>> unmap to two phases. One to drop all the pages. That would only need
>>>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>>>> need a different approach (e.g. the range locking).
>>>> page fault might sneak in to map a page which has been unmapped before?
>>>>
>>>> range locking should help a lot on manipulating small sections of a large
>>>> mapping in parallel or multiple small mappings. It may not achieve too much
>>>> for single large mapping.
>>> I don't think we need range locking.  What if we do munmap this way:
>>>
>>> Take the mmap_sem for write
>>> Find the VMA
>>>     If the VMA is large(*)
>>>       Mark the VMA as deleted
>>>       Drop the mmap_sem
>>>       zap all of the entries
>>>       Take the mmap_sem
>>>     Else
>>>       zap all of the entries
>>> Continue finding VMAs
>>> Drop the mmap_sem
>>>
>>> Now we need to change everywhere which looks up a VMA to see if it needs
>>> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> Oh, I forgot about MAP_FIXED.  Yes, MAP_FIXED should wait for the munmap
> to finish.  But a regular mmap can just pretend that it happened before
> the munmap call and avoid the deleted VMAs.

But, my test shows race condition for reduced size mmap which calls 
do_munmap(). It may need wait for the munmap finish too.

So, in my patches, I just make the do_munmap() called from mmap() hold 
mmap_sem all the time.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 21:45         ` Yang Shi
  2018-03-21 22:15           ` Matthew Wilcox
@ 2018-03-21 22:46           ` Matthew Wilcox
  2018-03-22 15:32             ` Laurent Dufour
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-21 22:46 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> Marking vma as deleted sounds good. The problem for my current approach is
> the concurrent page fault may succeed if it access the not yet unmapped
> section. Marking deleted vma could tell page fault the vma is not valid
> anymore, then return SIGSEGV.
> 
> > does not care; munmap will need to wait for the existing munmap operation
> 
> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?

The other thing about MAP_FIXED that we'll need to handle is unmapping
conflicts atomically.  Say a program has a 200GB mapping and then
mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
are also going to have to wait for deleted vmas (then retry the fault)
rather than immediately raising SIGSEGV.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 22:36           ` Yang Shi
@ 2018-03-22  9:10             ` Michal Hocko
  2018-03-22 16:06               ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2018-03-22  9:10 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, linux-mm, linux-kernel

On Wed 21-03-18 15:36:12, Yang Shi wrote:
> 
> 
> On 3/21/18 2:23 PM, Michal Hocko wrote:
> > On Wed 21-03-18 10:16:41, Yang Shi wrote:
> > > 
> > > On 3/21/18 9:50 AM, Yang Shi wrote:
> > > > 
> > > > On 3/21/18 6:14 AM, Michal Hocko wrote:
> > > > > On Wed 21-03-18 05:31:19, Yang Shi wrote:
> > > > > > When running some mmap/munmap scalability tests with large memory (i.e.
> > > > > > > 300GB), the below hung task issue may happen occasionally.
> > > > > > INFO: task ps:14018 blocked for more than 120 seconds.
> > > > > >          Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
> > > > > >    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > > > message.
> > > > > >    ps              D    0 14018      1 0x00000004
> > > > > >     ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> > > > > >     ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> > > > > >     00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> > > > > >    Call Trace:
> > > > > >     [<ffffffff817154d0>] ? __schedule+0x250/0x730
> > > > > >     [<ffffffff817159e6>] schedule+0x36/0x80
> > > > > >     [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> > > > > >     [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> > > > > >     [<ffffffff81717db0>] down_read+0x20/0x40
> > > > > >     [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> > > > > Slightly off-topic:
> > > > > Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
> > > > > of
> > > > >      arg_start = mm->arg_start;
> > > > >      arg_end = mm->arg_end;
> > > > >      env_start = mm->env_start;
> > > > >      env_end = mm->env_end;
> > > > > 
> > > > > change after exec or while the pid is already visible in proc? If yes
> > > > > maybe we can use a dedicated lock.
> > > BTW, this is not the only place to acquire mmap_sem in
> > > proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> > > mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> > Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> > to remove that. munmap should perform much better. How to do that safely
> 
> Yes, agree. We are on the same page.
> 
> > is a different question. I am not yet convinced that tearing down a vma
> > in batches is safe. The vast majority of time is spent on tearing down
> 
> You can try my patches. I did full LTP test and running multiple kernel
> build in parallel. It survives.

Which doesn't really mean anything. Those tests are likely to not hit
corner cases where an application silently depends on the mmap locking
and unmap atomicity.
 
> > pages and that is quite easy to move out of the write lock. That would
> > be an improvement already and it should be risk safe. If even that is
> > not sufficient then using range locking should help a lot. There
> > shouldn't be really any other address space operations within the range
> > most of the time so this would be basically non-contended access.
> 
> It might depend on how the range is defined. Too big range may lead to
> surprisingly more contention, but too small range may bring in too much
> lock/unlock operations.

The full vma will have to be range locked. So there is nothing small or large.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 22:46           ` Matthew Wilcox
@ 2018-03-22 15:32             ` Laurent Dufour
  2018-03-22 15:40               ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Laurent Dufour @ 2018-03-22 15:32 UTC (permalink / raw)
  To: Matthew Wilcox, Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 21/03/2018 23:46, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>>
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> 
> The other thing about MAP_FIXED that we'll need to handle is unmapping
> conflicts atomically.  Say a program has a 200GB mapping and then
> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
> are also going to have to wait for deleted vmas (then retry the fault)
> rather than immediately raising SIGSEGV.

Regarding the page fault, why not relying on the PTE locking ?

When munmap() will unset the PTE it will have to held the PTE lock, so this
will serialize the access.
If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 15:32             ` Laurent Dufour
@ 2018-03-22 15:40               ` Matthew Wilcox
  2018-03-22 15:54                 ` Laurent Dufour
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-22 15:40 UTC (permalink / raw)
  To: Laurent Dufour; +Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel

On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
> On 21/03/2018 23:46, Matthew Wilcox wrote:
> > On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
> >> Marking vma as deleted sounds good. The problem for my current approach is
> >> the concurrent page fault may succeed if it access the not yet unmapped
> >> section. Marking deleted vma could tell page fault the vma is not valid
> >> anymore, then return SIGSEGV.
> >>
> >>> does not care; munmap will need to wait for the existing munmap operation
> >>
> >> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> > 
> > The other thing about MAP_FIXED that we'll need to handle is unmapping
> > conflicts atomically.  Say a program has a 200GB mapping and then
> > mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
> > are also going to have to wait for deleted vmas (then retry the fault)
> > rather than immediately raising SIGSEGV.
> 
> Regarding the page fault, why not relying on the PTE locking ?
> 
> When munmap() will unset the PTE it will have to held the PTE lock, so this
> will serialize the access.
> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.

The page fault handler will walk the VMA tree to find the correct
VMA and then find that the VMA is marked as deleted.  If it assumes
that the VMA has been deleted because of munmap(), then it can raise
SIGSEGV immediately.  But if the VMA is marked as deleted because of
mmap(MAP_FIXED), it must wait until the new VMA is in place.

I think I was wrong to describe VMAs as being *deleted*.  I think we
instead need the concept of a *locked* VMA that page faults will block on.
Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
an rwsem since the only reason to write-lock the VMA is because it is
being deleted.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 15:40               ` Matthew Wilcox
@ 2018-03-22 15:54                 ` Laurent Dufour
  2018-03-22 16:05                   ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Laurent Dufour @ 2018-03-22 15:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel



On 22/03/2018 16:40, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>> On 21/03/2018 23:46, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>>>> Marking vma as deleted sounds good. The problem for my current approach is
>>>> the concurrent page fault may succeed if it access the not yet unmapped
>>>> section. Marking deleted vma could tell page fault the vma is not valid
>>>> anymore, then return SIGSEGV.
>>>>
>>>>> does not care; munmap will need to wait for the existing munmap operation
>>>>
>>>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
>>>
>>> The other thing about MAP_FIXED that we'll need to handle is unmapping
>>> conflicts atomically.  Say a program has a 200GB mapping and then
>>> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
>>> are also going to have to wait for deleted vmas (then retry the fault)
>>> rather than immediately raising SIGSEGV.
>>
>> Regarding the page fault, why not relying on the PTE locking ?
>>
>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>> will serialize the access.
>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
> 
> The page fault handler will walk the VMA tree to find the correct
> VMA and then find that the VMA is marked as deleted.  If it assumes
> that the VMA has been deleted because of munmap(), then it can raise
> SIGSEGV immediately.  But if the VMA is marked as deleted because of
> mmap(MAP_FIXED), it must wait until the new VMA is in place.

I'm wondering if such a complexity is required.
If the user space process try to access the page being overwritten through
mmap(MAP_FIXED) by another thread, there is no guarantee that it will
manipulate the *old* page or *new* one.
I'd think this is up to the user process to handle that concurrency.
What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
are no more there, which is done through the mmap_sem and PTE locking.

> I think I was wrong to describe VMAs as being *deleted*.  I think we
> instead need the concept of a *locked* VMA that page faults will block on.
> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> an rwsem since the only reason to write-lock the VMA is because it is
> being deleted.

Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
the VMA is removed there is no need to wait. Isn't it ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 15:54                 ` Laurent Dufour
@ 2018-03-22 16:05                   ` Matthew Wilcox
  2018-03-22 16:18                     ` Laurent Dufour
  2018-03-22 16:49                     ` Yang Shi
  0 siblings, 2 replies; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-22 16:05 UTC (permalink / raw)
  To: Laurent Dufour; +Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel

On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
> On 22/03/2018 16:40, Matthew Wilcox wrote:
> > On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
> >> Regarding the page fault, why not relying on the PTE locking ?
> >>
> >> When munmap() will unset the PTE it will have to held the PTE lock, so this
> >> will serialize the access.
> >> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
> >> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
> > 
> > The page fault handler will walk the VMA tree to find the correct
> > VMA and then find that the VMA is marked as deleted.  If it assumes
> > that the VMA has been deleted because of munmap(), then it can raise
> > SIGSEGV immediately.  But if the VMA is marked as deleted because of
> > mmap(MAP_FIXED), it must wait until the new VMA is in place.
> 
> I'm wondering if such a complexity is required.
> If the user space process try to access the page being overwritten through
> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
> manipulate the *old* page or *new* one.

Right; but it must return one or the other, it can't segfault.

> I'd think this is up to the user process to handle that concurrency.
> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
> are no more there, which is done through the mmap_sem and PTE locking.

Yes, and allowing the fault handler to return the *old* page risks the
old page being reinserted into the page tables after the unmapping task
has done its work.

It's *really* rare to page-fault on a VMA which is in the middle of
being replaced.  Why are you trying to optimise it?

> > I think I was wrong to describe VMAs as being *deleted*.  I think we
> > instead need the concept of a *locked* VMA that page faults will block on.
> > Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> > an rwsem since the only reason to write-lock the VMA is because it is
> > being deleted.
> 
> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
> the VMA is removed there is no need to wait. Isn't it ?

I can't think of another reason.  I suppose we could mark the VMA as
locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
use of our time.  Just always have the pagefault sleep for a deleted VMA.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22  9:10             ` Michal Hocko
@ 2018-03-22 16:06               ` Yang Shi
  2018-03-22 16:12                 ` Michal Hocko
  2018-03-22 16:13                 ` Matthew Wilcox
  0 siblings, 2 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-22 16:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, linux-kernel



On 3/22/18 2:10 AM, Michal Hocko wrote:
> On Wed 21-03-18 15:36:12, Yang Shi wrote:
>>
>> On 3/21/18 2:23 PM, Michal Hocko wrote:
>>> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>>> On 3/21/18 9:50 AM, Yang Shi wrote:
>>>>> On 3/21/18 6:14 AM, Michal Hocko wrote:
>>>>>> On Wed 21-03-18 05:31:19, Yang Shi wrote:
>>>>>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>>>>>> 300GB), the below hung task issue may happen occasionally.
>>>>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>>>>           Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>>>>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>>>>>> message.
>>>>>>>     ps              D    0 14018      1 0x00000004
>>>>>>>      ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>>>>      ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>>>>      00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>>>>     Call Trace:
>>>>>>>      [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>>>>      [<ffffffff817159e6>] schedule+0x36/0x80
>>>>>>>      [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>>>>      [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>>>>      [<ffffffff81717db0>] down_read+0x20/0x40
>>>>>>>      [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>>>> Slightly off-topic:
>>>>>> Btw. this sucks as well. Do we really need to take mmap_sem here? Do any
>>>>>> of
>>>>>>       arg_start = mm->arg_start;
>>>>>>       arg_end = mm->arg_end;
>>>>>>       env_start = mm->env_start;
>>>>>>       env_end = mm->env_end;
>>>>>>
>>>>>> change after exec or while the pid is already visible in proc? If yes
>>>>>> maybe we can use a dedicated lock.
>>>> BTW, this is not the only place to acquire mmap_sem in
>>>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>>>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
>>> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
>>> to remove that. munmap should perform much better. How to do that safely
>> Yes, agree. We are on the same page.
>>
>>> is a different question. I am not yet convinced that tearing down a vma
>>> in batches is safe. The vast majority of time is spent on tearing down
>> You can try my patches. I did full LTP test and running multiple kernel
>> build in parallel. It survives.
> Which doesn't really mean anything. Those tests are likely to not hit
> corner cases where an application silently depends on the mmap locking
> and unmap atomicity.

They definitely can't cover all corner cases. But, they do give us 
somehow confidence that the most part works. The mmap stress tests in 
LTP did discover some race conditions when I tried different approaches.

>   
>>> pages and that is quite easy to move out of the write lock. That would
>>> be an improvement already and it should be risk safe. If even that is
>>> not sufficient then using range locking should help a lot. There
>>> shouldn't be really any other address space operations within the range
>>> most of the time so this would be basically non-contended access.
>> It might depend on how the range is defined. Too big range may lead to
>> surprisingly more contention, but too small range may bring in too much
>> lock/unlock operations.
> The full vma will have to be range locked. So there is nothing small or large.

It sounds not helpful to a single large vma case since just one range 
lock for the vma, it sounds equal to mmap_sem.

Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:06               ` Yang Shi
@ 2018-03-22 16:12                 ` Michal Hocko
  2018-03-22 16:13                 ` Matthew Wilcox
  1 sibling, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2018-03-22 16:12 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, linux-mm, linux-kernel

On Thu 22-03-18 09:06:14, Yang Shi wrote:
> 
> 
> On 3/22/18 2:10 AM, Michal Hocko wrote:
> > On Wed 21-03-18 15:36:12, Yang Shi wrote:
> > > 
> > > On 3/21/18 2:23 PM, Michal Hocko wrote:
[...]
> > > > pages and that is quite easy to move out of the write lock. That would
> > > > be an improvement already and it should be risk safe. If even that is
> > > > not sufficient then using range locking should help a lot. There
> > > > shouldn't be really any other address space operations within the range
> > > > most of the time so this would be basically non-contended access.
> > > It might depend on how the range is defined. Too big range may lead to
> > > surprisingly more contention, but too small range may bring in too much
> > > lock/unlock operations.
> > The full vma will have to be range locked. So there is nothing small or large.
> 
> It sounds not helpful to a single large vma case since just one range lock
> for the vma, it sounds equal to mmap_sem.

This is not how the range locking works. If we have a range lock per mm
then exclusive ranges are not contending. So if you are unmapping one
vma and want to create a new mapping or fault into a different range
then you are basically lockless.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:06               ` Yang Shi
  2018-03-22 16:12                 ` Michal Hocko
@ 2018-03-22 16:13                 ` Matthew Wilcox
  2018-03-22 16:28                   ` Laurent Dufour
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-22 16:13 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On Thu, Mar 22, 2018 at 09:06:14AM -0700, Yang Shi wrote:
> On 3/22/18 2:10 AM, Michal Hocko wrote:
> > On Wed 21-03-18 15:36:12, Yang Shi wrote:
> > > On 3/21/18 2:23 PM, Michal Hocko wrote:
> > > > On Wed 21-03-18 10:16:41, Yang Shi wrote:
> > > > > proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
> > > > > mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
> > > > Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> > > > to remove that. munmap should perform much better. How to do that safely
> > The full vma will have to be range locked. So there is nothing small or large.
> 
> It sounds not helpful to a single large vma case since just one range lock
> for the vma, it sounds equal to mmap_sem.

But splitting mmap_sem into pieces is beneficial for this case.  Imagine
we have a spinlock / rwlock to protect the rbtree / arg_start / arg_end
/ ...  and then each VMA has a rwsem (or equivalent).  access_remote_vm()
would walk the tree and grab the VMA's rwsem for read while reading
out the arguments.  The munmap code would have a completely different
VMA write-locked.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:05                   ` Matthew Wilcox
@ 2018-03-22 16:18                     ` Laurent Dufour
  2018-03-22 16:46                       ` Yang Shi
  2018-03-22 16:51                       ` Matthew Wilcox
  2018-03-22 16:49                     ` Yang Shi
  1 sibling, 2 replies; 41+ messages in thread
From: Laurent Dufour @ 2018-03-22 16:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel



On 22/03/2018 17:05, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
> 
> Right; but it must return one or the other, it can't segfault.

Good point, I missed that...

> 
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
> 
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.

The PTE locking should prevent that.

> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced.  Why are you trying to optimise it?

I was not trying to optimize it, but to not wait in the page fault handler.
This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
done and before the waiting page fault got woken up. This means that the
removed VMA structure will have to remain until all the waiters are woken up
which implies ref_count or similar.

> 
>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>>
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
> 
> I can't think of another reason.  I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time.  Just always have the pagefault sleep for a deleted VMA.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:13                 ` Matthew Wilcox
@ 2018-03-22 16:28                   ` Laurent Dufour
  2018-03-22 16:36                     ` David Laight
  0 siblings, 1 reply; 41+ messages in thread
From: Laurent Dufour @ 2018-03-22 16:28 UTC (permalink / raw)
  To: Matthew Wilcox, Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On 22/03/2018 17:13, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 09:06:14AM -0700, Yang Shi wrote:
>> On 3/22/18 2:10 AM, Michal Hocko wrote:
>>> On Wed 21-03-18 15:36:12, Yang Shi wrote:
>>>> On 3/21/18 2:23 PM, Michal Hocko wrote:
>>>>> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>>>>>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>>>>>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or later.
>>>>> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
>>>>> to remove that. munmap should perform much better. How to do that safely
>>> The full vma will have to be range locked. So there is nothing small or large.
>>
>> It sounds not helpful to a single large vma case since just one range lock
>> for the vma, it sounds equal to mmap_sem.
> 
> But splitting mmap_sem into pieces is beneficial for this case.  Imagine
> we have a spinlock / rwlock to protect the rbtree 

Which is more or less what I'm proposing in the speculative page fault series:
https://lkml.org/lkml/2018/3/13/1158

This being said, having a per VMA lock could lead to tricky dead lock case,
when merging multiple VMA happens in parallel since multiple VMA will have to
be locked at the same time, grabbing those lock in a fine order will be required.

> ... / arg_start / arg_end
> / ...  and then each VMA has a rwsem (or equivalent).  access_remote_vm()
> would walk the tree and grab the VMA's rwsem for read while reading
> out the arguments.  The munmap code would have a completely different
> VMA write-locked.
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:28                   ` Laurent Dufour
@ 2018-03-22 16:36                     ` David Laight
  0 siblings, 0 replies; 41+ messages in thread
From: David Laight @ 2018-03-22 16:36 UTC (permalink / raw)
  To: 'Laurent Dufour', Matthew Wilcox, Yang Shi
  Cc: Michal Hocko, akpm, linux-mm, linux-kernel

From:  Laurent Dufour
> Sent: 22 March 2018 16:29
...
> This being said, having a per VMA lock could lead to tricky dead lock case,
> when merging multiple VMA happens in parallel since multiple VMA will have to
> be locked at the same time, grabbing those lock in a fine order will be required.

You could have a global lock and per VMA locks.
Anything that only accesses one VMA could release the global lock after
acquiring the per VMA lock.
If code needs multiple VMA 'locked' it can lock and unlock each VMA
in turn, then keep the global lock held.

	David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:18                     ` Laurent Dufour
@ 2018-03-22 16:46                       ` Yang Shi
  2018-03-23 13:03                         ` Laurent Dufour
  2018-03-22 16:51                       ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-22 16:46 UTC (permalink / raw)
  To: Laurent Dufour, Matthew Wilcox; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 3/22/18 9:18 AM, Laurent Dufour wrote:
>
> On 22/03/2018 17:05, Matthew Wilcox wrote:
>> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>>
>>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>>> will serialize the access.
>>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>> The page fault handler will walk the VMA tree to find the correct
>>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>>> that the VMA has been deleted because of munmap(), then it can raise
>>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>> I'm wondering if such a complexity is required.
>>> If the user space process try to access the page being overwritten through
>>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>>> manipulate the *old* page or *new* one.
>> Right; but it must return one or the other, it can't segfault.
> Good point, I missed that...
>
>>> I'd think this is up to the user process to handle that concurrency.
>>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>>> are no more there, which is done through the mmap_sem and PTE locking.
>> Yes, and allowing the fault handler to return the *old* page risks the
>> old page being reinserted into the page tables after the unmapping task
>> has done its work.
> The PTE locking should prevent that.
>
>> It's *really* rare to page-fault on a VMA which is in the middle of
>> being replaced.  Why are you trying to optimise it?
> I was not trying to optimize it, but to not wait in the page fault handler.
> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
> done and before the waiting page fault got woken up. This means that the
> removed VMA structure will have to remain until all the waiters are woken up
> which implies ref_count or similar.

We may not need ref_count. After removing "locked-for-deletion" vmas 
when mmap(MAP_FIXED) is done, just wake up page fault to re-lookup vma, 
then it will find the new vma installed by mmap(MAP_FIXED), right?

I'm not sure if completion can do this or not since I'm not quite 
familiar with it :-(

Yang

>
>>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>>> instead need the concept of a *locked* VMA that page faults will block on.
>>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>>> an rwsem since the only reason to write-lock the VMA is because it is
>>>> being deleted.
>>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>>> the VMA is removed there is no need to wait. Isn't it ?
>> I can't think of another reason.  I suppose we could mark the VMA as
>> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
>> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
>> use of our time.  Just always have the pagefault sleep for a deleted VMA.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:05                   ` Matthew Wilcox
  2018-03-22 16:18                     ` Laurent Dufour
@ 2018-03-22 16:49                     ` Yang Shi
  1 sibling, 0 replies; 41+ messages in thread
From: Yang Shi @ 2018-03-22 16:49 UTC (permalink / raw)
  To: Matthew Wilcox, Laurent Dufour; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 3/22/18 9:05 AM, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
> Right; but it must return one or the other, it can't segfault.
>
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.
>
> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced.  Why are you trying to optimise it?
>
>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
> I can't think of another reason.  I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time.  Just always have the pagefault sleep for a deleted VMA.

It sounds worth to me. If we have every page fault sleep to wait for vma 
deletion is done, it sounds equal to wait for mmap_sem all the time, right?

Yang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:18                     ` Laurent Dufour
  2018-03-22 16:46                       ` Yang Shi
@ 2018-03-22 16:51                       ` Matthew Wilcox
  1 sibling, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-22 16:51 UTC (permalink / raw)
  To: Laurent Dufour; +Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel

On Thu, Mar 22, 2018 at 05:18:55PM +0100, Laurent Dufour wrote:
> > It's *really* rare to page-fault on a VMA which is in the middle of
> > being replaced.  Why are you trying to optimise it?
> 
> I was not trying to optimize it, but to not wait in the page fault handler.
> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
> done and before the waiting page fault got woken up. This means that the
> removed VMA structure will have to remain until all the waiters are woken up
> which implies ref_count or similar.

Yes, that's why we don't want an actual rwsem.  What I had in mind was
a struct completion on the stack of the caller of munmap(), and a pointer
to it from the vma.  The page fault handler grabs the VMA tree lock, walks
the VMA tree and finds a VMA.  If the VMA is marked as locked, it waits
for the completion.  Upon wakeup *it does not look at the VMA*, instead it
restarts the page fault.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 17:29       ` Matthew Wilcox
  2018-03-21 21:45         ` Yang Shi
@ 2018-03-22 17:34         ` Yang Shi
  2018-03-22 18:48           ` Matthew Wilcox
  2018-03-24 18:24         ` Jerome Glisse
  2 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2018-03-22 17:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel



On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
>> On 3/21/18 6:08 AM, Michal Hocko wrote:
>>> Yes, this definitely sucks. One way to work that around is to split the
>>> unmap to two phases. One to drop all the pages. That would only need
>>> mmap_sem for read and then tear down the mapping with the mmap_sem for
>>> write. This wouldn't help for parallel mmap_sem writers but those really
>>> need a different approach (e.g. the range locking).
>> page fault might sneak in to map a page which has been unmapped before?
>>
>> range locking should help a lot on manipulating small sections of a large
>> mapping in parallel or multiple small mappings. It may not achieve too much
>> for single large mapping.
> I don't think we need range locking.  What if we do munmap this way:
>
> Take the mmap_sem for write
> Find the VMA
>    If the VMA is large(*)
>      Mark the VMA as deleted
>      Drop the mmap_sem
>      zap all of the entries
>      Take the mmap_sem
>    Else
>      zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
>
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> does not care; munmap will need to wait for the existing munmap operation

The other question is why munmap need wait? If the other parallel munmap 
finds the vma has been marked as "deleted", it just need return 0 as it 
doesn't find vma.

Currently do_munmap() does the below logic:
     vma = find_vma(mm, start);
     if (!vma)
         return 0;

Yang

> to complete), but it gives us the atomicity, at least on a per-VMA basis.
>
> We could also do:
>
> Take the mmap_sem for write
> Mark all VMAs in the range as deleted & modify any partial VMAs
> Drop mmap_sem
> zap pages from deleted VMAs
>
> That would give us the same atomicity that we have today.
>
> Deleted VMAs would need a pointer to a completion, so operations that
> need to wait can queue themselves up.  I'd recommend we use the low bit
> of vm_file and treat it as a pointer to a struct completion if set.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 17:34         ` Yang Shi
@ 2018-03-22 18:48           ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2018-03-22 18:48 UTC (permalink / raw)
  To: Yang Shi; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On Thu, Mar 22, 2018 at 10:34:08AM -0700, Yang Shi wrote:
> On 3/21/18 10:29 AM, Matthew Wilcox wrote:
> > Take the mmap_sem for write
> > Find the VMA
> >    If the VMA is large(*)
> >      Mark the VMA as deleted
> >      Drop the mmap_sem
> >      zap all of the entries
> >      Take the mmap_sem
> >    Else
> >      zap all of the entries
> > Continue finding VMAs
> > Drop the mmap_sem
> > 
> > Now we need to change everywhere which looks up a VMA to see if it needs
> > to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> > does not care; munmap will need to wait for the existing munmap operation
> 
> The other question is why munmap need wait? If the other parallel munmap
> finds the vma has been marked as "deleted", it just need return 0 as it
> doesn't find vma.
> 
> Currently do_munmap() does the below logic:
>     vma = find_vma(mm, start);
>     if (!vma)
>         return 0;

At the point a munmap() returns, the area should be available for reuse.
If another thread is still unmapping, it won't actually be available yet.
We should wait for the other thread to finish before returning.

There may be some other corner cases; like what to do if there's a partial
unmap of a VMA, or a MAP_FIXED over part of an existing VMA.  It's going
to be safer to just wait for any conflicts to die down.  It's not like
real programs call munmap() on conflicting areas at the same time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-22 16:46                       ` Yang Shi
@ 2018-03-23 13:03                         ` Laurent Dufour
  0 siblings, 0 replies; 41+ messages in thread
From: Laurent Dufour @ 2018-03-23 13:03 UTC (permalink / raw)
  To: Yang Shi, Matthew Wilcox; +Cc: Michal Hocko, akpm, linux-mm, linux-kernel

On 22/03/2018 17:46, Yang Shi wrote:
> 
> 
> On 3/22/18 9:18 AM, Laurent Dufour wrote:
>>
>> On 22/03/2018 17:05, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>>>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>>>
>>>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>>>> will serialize the access.
>>>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>>>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>>> The page fault handler will walk the VMA tree to find the correct
>>>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>>>> that the VMA has been deleted because of munmap(), then it can raise
>>>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>>> I'm wondering if such a complexity is required.
>>>> If the user space process try to access the page being overwritten through
>>>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>>>> manipulate the *old* page or *new* one.
>>> Right; but it must return one or the other, it can't segfault.
>> Good point, I missed that...
>>
>>>> I'd think this is up to the user process to handle that concurrency.
>>>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>>>> are no more there, which is done through the mmap_sem and PTE locking.
>>> Yes, and allowing the fault handler to return the *old* page risks the
>>> old page being reinserted into the page tables after the unmapping task
>>> has done its work.
>> The PTE locking should prevent that.
>>
>>> It's *really* rare to page-fault on a VMA which is in the middle of
>>> being replaced.  Why are you trying to optimise it?
>> I was not trying to optimize it, but to not wait in the page fault handler.
>> This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
>> done and before the waiting page fault got woken up. This means that the
>> removed VMA structure will have to remain until all the waiters are woken up
>> which implies ref_count or similar.
> 
> We may not need ref_count. After removing "locked-for-deletion" vmas when
> mmap(MAP_FIXED) is done, just wake up page fault to re-lookup vma, then it will
> find the new vma installed by mmap(MAP_FIXED), right?

I do agree, as far as waking up would not require access to the VMA.

> I'm not sure if completion can do this or not since I'm not quite familiar with
> it :-(

I don't know either :/

Laurent.

> Yang
> 
>>
>>>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>>>> instead need the concept of a *locked* VMA that page faults will block on.
>>>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>>>> an rwsem since the only reason to write-lock the VMA is because it is
>>>>> being deleted.
>>>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>>>> the VMA is removed there is no need to wait. Isn't it ?
>>> I can't think of another reason.  I suppose we could mark the VMA as
>>> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
>>> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
>>> use of our time.  Just always have the pagefault sleep for a deleted VMA.
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section
  2018-03-21 17:29       ` Matthew Wilcox
  2018-03-21 21:45         ` Yang Shi
  2018-03-22 17:34         ` Yang Shi
@ 2018-03-24 18:24         ` Jerome Glisse
  2 siblings, 0 replies; 41+ messages in thread
From: Jerome Glisse @ 2018-03-24 18:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yang Shi, Michal Hocko, akpm, linux-mm, linux-kernel, Laurent Dufour

On Wed, Mar 21, 2018 at 10:29:32AM -0700, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote:
> > On 3/21/18 6:08 AM, Michal Hocko wrote:
> > > Yes, this definitely sucks. One way to work that around is to split the
> > > unmap to two phases. One to drop all the pages. That would only need
> > > mmap_sem for read and then tear down the mapping with the mmap_sem for
> > > write. This wouldn't help for parallel mmap_sem writers but those really
> > > need a different approach (e.g. the range locking).
> > 
> > page fault might sneak in to map a page which has been unmapped before?
> > 
> > range locking should help a lot on manipulating small sections of a large
> > mapping in parallel or multiple small mappings. It may not achieve too much
> > for single large mapping.
> 
> I don't think we need range locking.  What if we do munmap this way:
> 
> Take the mmap_sem for write
> Find the VMA
>   If the VMA is large(*)
>     Mark the VMA as deleted
>     Drop the mmap_sem
>     zap all of the entries
>     Take the mmap_sem
>   Else
>     zap all of the entries
> Continue finding VMAs
> Drop the mmap_sem
> 
> Now we need to change everywhere which looks up a VMA to see if it needs
> to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap
> does not care; munmap will need to wait for the existing munmap operation
> to complete), but it gives us the atomicity, at least on a per-VMA basis.
> 

What about something that should fix all issues:
    struct list_head to_free_puds;
    ...
    down_write(&mm->mmap_sem);
    ...
    unmap_vmas(&tlb, vma, start, end, &to_free_puds);
    arch_unmap(mm, vma, start, end);
    /* Fix up all other VM information */
    remove_vma_list(mm, vma);
    ...
    up_write(&mm->mmap_sem);
    ...
    zap_pud_list(rss_update_info, to_free_puds);
    update_rss(rss_update_info)

We collect pud that need to be free/zap we update the page table PUD
entry to pud_none under the write sem and CPU page table lock, add the
pud to the list that need zapping. We only collect pud fully cover,
and usual business for partialy covered pud.

Everything behave as today except that we do not free memory. Care
must be take with the anon vma and we should probably not free the
vma struct either before the zap but all other mm struct can be
updated. The rss_counter would also to be updated post zap pud.

We would need special code to zap pud list, no need to take lock or
do special arch tlb flushing, ptep_get_clear, ... when walking down
those puds. So it should scale a lot better too.

Did i miss something ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-03-24 18:24 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20 21:31 [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Yang Shi
2018-03-20 21:31 ` [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Yang Shi
2018-03-21 13:08   ` Michal Hocko
2018-03-21 16:31     ` Yang Shi
2018-03-21 17:29       ` Matthew Wilcox
2018-03-21 21:45         ` Yang Shi
2018-03-21 22:15           ` Matthew Wilcox
2018-03-21 22:40             ` Yang Shi
2018-03-21 22:46           ` Matthew Wilcox
2018-03-22 15:32             ` Laurent Dufour
2018-03-22 15:40               ` Matthew Wilcox
2018-03-22 15:54                 ` Laurent Dufour
2018-03-22 16:05                   ` Matthew Wilcox
2018-03-22 16:18                     ` Laurent Dufour
2018-03-22 16:46                       ` Yang Shi
2018-03-23 13:03                         ` Laurent Dufour
2018-03-22 16:51                       ` Matthew Wilcox
2018-03-22 16:49                     ` Yang Shi
2018-03-22 17:34         ` Yang Shi
2018-03-22 18:48           ` Matthew Wilcox
2018-03-24 18:24         ` Jerome Glisse
2018-03-21 13:14   ` Michal Hocko
2018-03-21 16:50     ` Yang Shi
2018-03-21 17:16       ` Yang Shi
2018-03-21 21:23         ` Michal Hocko
2018-03-21 22:36           ` Yang Shi
2018-03-22  9:10             ` Michal Hocko
2018-03-22 16:06               ` Yang Shi
2018-03-22 16:12                 ` Michal Hocko
2018-03-22 16:13                 ` Matthew Wilcox
2018-03-22 16:28                   ` Laurent Dufour
2018-03-22 16:36                     ` David Laight
2018-03-20 21:31 ` [RFC PATCH 2/8] mm: mmap: pass atomic parameter to do_munmap() call sites Yang Shi
2018-03-20 21:31 ` [RFC PATCH 3/8] mm: mremap: pass atomic parameter to do_munmap() Yang Shi
2018-03-20 21:31 ` [RFC PATCH 4/8] mm: nommu: add " Yang Shi
2018-03-20 21:31 ` [RFC PATCH 5/8] ipc: shm: pass " Yang Shi
2018-03-20 21:31 ` [RFC PATCH 6/8] fs: proc/vmcore: " Yang Shi
2018-03-20 21:31 ` [RFC PATCH 7/8] x86: mpx: " Yang Shi
2018-03-20 22:35   ` Thomas Gleixner
2018-03-21 16:53     ` Yang Shi
2018-03-20 21:31 ` [RFC PATCH 8/8] x86: vma: " Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).