LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/9] Free user PTE page table pages
@ 2021-08-19  3:18 Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
                   ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

	VIRT:  55t
	RES:   590g
	VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

	mmap 50G
	while (1) {
		for (; i < 1024 * 25; i++) {
			touch 2M memory
			madvise MADV_DONTNEED 2M
		}
	}

As we can see, the memory usage of VmPTE is reduced:

			before		                after
VIRT		       50.0 GB			      50.0 GB
RES		        3.1 MB			       3.6 MB
VmPTE		     102640 kB			       248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads         before (pf/min)                     after (pf/min)
    1                32,085,255                         31,880,833 (-0.64%)
    8               101,674,967                        100,588,311 (-1.17%)
   16               113,207,000                        112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

This series is based on next-20210812.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/patchwork/comment/296794/

Changelog in v1 -> v2:
 - Change pte_install() to pmd_install().
 - Fix some typo and code style problems.
 - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
   and [PATCH v2 7/9].

Qi Zheng (9):
  mm: introduce pmd_install() helper
  mm: remove redundant smp_wmb()
  mm: rework the parameter of lock_page_or_retry()
  mm: move pte_alloc{,_map,_map_lock}() to a separate file
  mm: pte_refcount infrastructure
  mm: free user PTE page table pages
  mm: add THP support for pte_ref
  mm: free PTE page table by using rcu mechanism
  mm: use mmu_gather to free PTE page table

 arch/arm/mm/pgd.c             |   1 +
 arch/arm64/mm/hugetlbpage.c   |   1 +
 arch/ia64/mm/hugetlbpage.c    |   1 +
 arch/parisc/mm/hugetlbpage.c  |   1 +
 arch/powerpc/mm/hugetlbpage.c |   1 +
 arch/s390/mm/gmap.c           |   1 +
 arch/s390/mm/pgtable.c        |   1 +
 arch/sh/mm/hugetlbpage.c      |   1 +
 arch/sparc/mm/hugetlbpage.c   |   1 +
 arch/x86/Kconfig              |   2 +-
 fs/proc/task_mmu.c            |  22 +++-
 fs/userfaultfd.c              |   2 +
 include/linux/mm.h            |  12 +-
 include/linux/mm_types.h      |   8 +-
 include/linux/pagemap.h       |   8 +-
 include/linux/pgtable.h       |   3 +-
 include/linux/pte_ref.h       | 300 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/rmap.h          |   4 +-
 kernel/events/uprobes.c       |   3 +
 mm/Kconfig                    |   4 +
 mm/Makefile                   |   3 +-
 mm/filemap.c                  |  57 ++++----
 mm/gup.c                      |   7 +
 mm/hmm.c                      |   4 +
 mm/internal.h                 |   2 +
 mm/khugepaged.c               |   9 ++
 mm/ksm.c                      |   4 +
 mm/madvise.c                  |  18 ++-
 mm/memcontrol.c               |  11 +-
 mm/memory.c                   | 276 ++++++++++++++++++++++++--------------
 mm/mempolicy.c                |   4 +-
 mm/migrate.c                  |  18 +--
 mm/mincore.c                  |   5 +-
 mm/mlock.c                    |   1 +
 mm/mmu_gather.c               |  40 +++---
 mm/mprotect.c                 |  10 +-
 mm/mremap.c                   |  12 +-
 mm/page_vma_mapped.c          |   4 +
 mm/pagewalk.c                 |  16 ++-
 mm/pgtable-generic.c          |   2 +
 mm/pte_ref.c                  | 143 ++++++++++++++++++++
 mm/rmap.c                     |  13 ++
 mm/sparse-vmemmap.c           |   2 +-
 mm/swapfile.c                 |   3 +-
 mm/userfaultfd.c              |  18 ++-
 45 files changed, 849 insertions(+), 210 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/9] mm: introduce pmd_install() helper
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-24 16:26   ` David Hildenbrand
  2021-08-19  3:18 ` [PATCH v2 2/9] mm: remove redundant smp_wmb() Qi Zheng
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

Currently we have three times the same few lines repeated in the
code. Deduplicate them by newly introduced pmd_install() helper.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/filemap.c       | 11 ++---------
 mm/memory.c        | 34 ++++++++++++++++------------------
 3 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce8fc0fd6d6e..57e48217bd71 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2471,6 +2471,7 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
+extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index 53913fced7ae..9f773059c6dc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3210,15 +3210,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	    }
 	}
 
-	if (pmd_none(*vmf->pmd)) {
-		vmf->ptl = pmd_lock(mm, vmf->pmd);
-		if (likely(pmd_none(*vmf->pmd))) {
-			mm_inc_nr_ptes(mm);
-			pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
-			vmf->prealloc_pte = NULL;
-		}
-		spin_unlock(vmf->ptl);
-	}
+	if (pmd_none(*vmf->pmd))
+		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
 	/* See comment in handle_pte_fault() */
 	if (pmd_devmap_trans_unstable(vmf->pmd)) {
diff --git a/mm/memory.c b/mm/memory.c
index 39e7a1495c3c..ef7b1762e996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -433,9 +433,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	spinlock_t *ptl = pmd_lock(mm, pmd);
+
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		mm_inc_nr_ptes(mm);
+		pmd_populate(mm, pmd, *pte);
+		*pte = NULL;
+	}
+	spin_unlock(ptl);
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
-	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
@@ -455,13 +466,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 	 */
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
-	ptl = pmd_lock(mm, pmd);
-	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
-		mm_inc_nr_ptes(mm);
-		pmd_populate(mm, pmd, new);
-		new = NULL;
-	}
-	spin_unlock(ptl);
+	pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
 	return 0;
@@ -4027,17 +4032,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				return ret;
 		}
 
-		if (vmf->prealloc_pte) {
-			vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-			if (likely(pmd_none(*vmf->pmd))) {
-				mm_inc_nr_ptes(vma->vm_mm);
-				pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
-				vmf->prealloc_pte = NULL;
-			}
-			spin_unlock(vmf->ptl);
-		} else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
+		if (vmf->prealloc_pte)
+			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
 			return VM_FAULT_OOM;
-		}
 	}
 
 	/* See comment in handle_pte_fault() */
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 2/9] mm: remove redundant smp_wmb()
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Qi Zheng
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

The smp_wmb() which is in the __pte_alloc() is used to
ensure all ptes setup is visible before the pte is made
visible to other CPUs by being put into page tables. We
only need this when the pte is actually populated, so
move it to pte_install(). __pte_alloc_kernel(),
__p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar
to this case.

We can also defer smp_wmb() to the place where the pmd entry
is really populated by preallocated pte. There are two kinds
of user of preallocated pte, one is filemap & finish_fault(),
another is THP. The former does not need another smp_wmb()
because the smp_wmb() has been done by pte_install().
Fortunately, the latter also does not need another smp_wmb()
because there is already a smp_wmb() before populating the
new pte when the THP uses a preallocated pte to split a huge
pmd.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memory.c         | 47 ++++++++++++++++++++---------------------------
 mm/sparse-vmemmap.c |  2 +-
 2 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ef7b1762e996..9c7534187454 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -439,6 +439,20 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm_inc_nr_ptes(mm);
+		/*
+		 * Ensure all pte setup (eg. pte page lock and page clearing) are
+		 * visible before the pte is made visible to other CPUs by being
+		 * put into page tables.
+		 *
+		 * The other side of the story is the pointer chasing in the page
+		 * table walking code (when walking the page table without locking;
+		 * ie. most of the time). Fortunately, these data accesses consist
+		 * of a chain of data-dependent loads, meaning most CPUs (alpha
+		 * being the notable exception) will already guarantee loads are
+		 * seen in-order. See the alpha page table accessors for the
+		 * smp_rmb() barriers in page table walking code.
+		 */
+		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 		pmd_populate(mm, pmd, *pte);
 		*pte = NULL;
 	}
@@ -451,21 +465,6 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 	if (!new)
 		return -ENOMEM;
 
-	/*
-	 * Ensure all pte setup (eg. pte page lock and page clearing) are
-	 * visible before the pte is made visible to other CPUs by being
-	 * put into page tables.
-	 *
-	 * The other side of the story is the pointer chasing in the page
-	 * table walking code (when walking the page table without locking;
-	 * ie. most of the time). Fortunately, these data accesses consist
-	 * of a chain of data-dependent loads, meaning most CPUs (alpha
-	 * being the notable exception) will already guarantee loads are
-	 * seen in-order. See the alpha page table accessors for the
-	 * smp_rmb() barriers in page table walking code.
-	 */
-	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-
 	pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
@@ -478,10 +477,9 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		smp_wmb(); /* See comment in pmd_install() */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
 	}
@@ -3857,7 +3855,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	ret = vma->vm_ops->fault(vmf);
@@ -3919,7 +3916,6 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -4144,7 +4140,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
@@ -4819,13 +4814,13 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&mm->page_table_lock);
 	if (pgd_present(*pgd))		/* Another has populated it */
 		p4d_free(mm, new);
-	else
+	else {
+		smp_wmb(); /* See comment in pmd_install() */
 		pgd_populate(mm, pgd, new);
+	}
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -4842,11 +4837,10 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&mm->page_table_lock);
 	if (!p4d_present(*p4d)) {
 		mm_inc_nr_puds(mm);
+		smp_wmb(); /* See comment in pmd_install() */
 		p4d_populate(mm, p4d, new);
 	} else	/* Another has populated it */
 		pud_free(mm, new);
@@ -4867,11 +4861,10 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	ptl = pud_lock(mm, pud);
 	if (!pud_present(*pud)) {
 		mm_inc_nr_pmds(mm);
+		smp_wmb(); /* See comment in pmd_install() */
 		pud_populate(mm, pud, new);
 	} else	/* Another has populated it */
 		pmd_free(mm, new);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bdce883f9286..db6df27c852a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
 
-	/* Make pte visible before pmd. See comment in __pte_alloc(). */
+	/* Make pte visible before pmd. See comment in pmd_install(). */
 	smp_wmb();
 	pmd_populate_kernel(&init_mm, pmd, pgtable);
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry()
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 2/9] mm: remove redundant smp_wmb() Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Qi Zheng
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

we need the vmf in lock_page_or_retry() in the subsequent patch,
so pass in it directly.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pagemap.h | 8 +++-----
 mm/filemap.c            | 6 ++++--
 mm/memory.c             | 4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c36292181a21..dbb4dd6bb608 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -710,8 +710,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-				unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -773,14 +772,13 @@ static inline int lock_page_killable(struct page *page)
  * Return value and mmap_lock implications depend on flags; see
  * __folio_lock_or_retry().
  */
-static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
-				     unsigned int flags)
+static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf)
 {
 	struct folio *folio;
 	might_sleep();
 
 	folio = page_folio(page);
-	return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+	return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f773059c6dc..eeac0e119cf5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1688,9 +1688,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1
  * with the folio locked and the mmap_lock unperturbed.
  */
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-			 unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 {
+	unsigned int flags = vmf->flags;
+	struct mm_struct *mm = vmf->vma->vm_mm;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index 9c7534187454..d2aaa85b840c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3451,7 +3451,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
 
-	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vmf))
 		return VM_FAULT_RETRY;
 	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
 				vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3584,7 +3584,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+	locked = lock_page_or_retry(page, vmf);
 
 	delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
 	if (!locked) {
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (2 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 5/9] mm: pte_refcount infrastructure Qi Zheng
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

Subsequent patches will modify pte_alloc{,_map,_map_lock}(),
which are the allocate function related to user PTE page
table pages, so move those to a separate file in advance.

This patch contains no functional changes, only some
preparatory work.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/arm/mm/pgd.c             |  1 +
 arch/arm64/mm/hugetlbpage.c   |  1 +
 arch/ia64/mm/hugetlbpage.c    |  1 +
 arch/parisc/mm/hugetlbpage.c  |  1 +
 arch/powerpc/mm/hugetlbpage.c |  1 +
 arch/s390/mm/gmap.c           |  1 +
 arch/s390/mm/pgtable.c        |  1 +
 arch/sh/mm/hugetlbpage.c      |  1 +
 arch/sparc/mm/hugetlbpage.c   |  1 +
 include/linux/mm.h            | 11 ----------
 include/linux/pte_ref.h       | 29 ++++++++++++++++++++++++++
 mm/Makefile                   |  3 ++-
 mm/internal.h                 |  1 +
 mm/memory.c                   | 38 ----------------------------------
 mm/pte_ref.c                  | 48 +++++++++++++++++++++++++++++++++++++++++++
 15 files changed, 89 insertions(+), 50 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index f8e9bc58a84f..dcac1124e8bd 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -8,6 +8,7 @@
 #include <linux/gfp.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/pte_ref.h>
 
 #include <asm/cp15.h>
 #include <asm/pgalloc.h>
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 23505fc35324..6c8177647053 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -14,6 +14,7 @@
 #include <linux/pagemap.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
+#include <linux/pte_ref.h>
 #include <asm/mman.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index f993cb36c062..d343639db1ad 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -17,6 +17,7 @@
 #include <linux/module.h>
 #include <linux/sysctl.h>
 #include <linux/log2.h>
+#include <linux/pte_ref.h>
 #include <asm/mman.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d1d3990b83f6..4f6044d6b4bd 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -13,6 +13,7 @@
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/sysctl.h>
+#include <linux/pte_ref.h>
 
 #include <asm/mman.h>
 #include <asm/tlb.h>
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9a75ba078e1b..1afb8f552bcf 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -19,6 +19,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
+#include <linux/pte_ref.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/setup.h>
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..c1826d74773a 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -18,6 +18,7 @@
 #include <linux/ksm.h>
 #include <linux/mman.h>
 #include <linux/pgtable.h>
+#include <linux/pte_ref.h>
 
 #include <asm/pgalloc.h>
 #include <asm/gmap.h>
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index eec3a9d7176e..594a6d6888e3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,7 @@
 #include <linux/sysctl.h>
 #include <linux/ksm.h>
 #include <linux/mman.h>
+#include <linux/pte_ref.h>
 
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 999ab5916e69..6a593ab248b7 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -15,6 +15,7 @@
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/sysctl.h>
+#include <linux/pte_ref.h>
 
 #include <asm/mman.h>
 #include <asm/tlb.h>
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 0f49fada2093..4b9119825264 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -11,6 +11,7 @@
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/sysctl.h>
+#include <linux/pte_ref.h>
 
 #include <asm/mman.h>
 #include <asm/pgalloc.h>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57e48217bd71..369d4283de49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2245,7 +2245,6 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
 static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
 int __pte_alloc_kernel(pmd_t *pmd);
 
 #if defined(CONFIG_MMU)
@@ -2371,15 +2370,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
-
-#define pte_alloc_map(mm, pmd, address)			\
-	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
-
-#define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	(pte_alloc(mm, pmd) ?			\
-		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
-
 #define pte_alloc_kernel(pmd, address)			\
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
 		NULL: pte_offset_kernel(pmd, address))
@@ -2471,7 +2461,6 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
-extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..60b752dd7846
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free user PTE page table pages
+ *
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#include <linux/mm.h>
+#include <linux/pgtable.h>
+#include <asm/pgalloc.h>
+
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
+
+#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+
+#define pte_alloc_map(mm, pmd, address)			\
+	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+
+#define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
+	(pte_alloc(mm, pmd) ?				\
+		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
+
+#endif
+
diff --git a/mm/Makefile b/mm/Makefile
index 970604ea97dd..5cdbfaa2a7b5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,7 +38,8 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o ioremap.o
+			   pgtable-generic.o rmap.o vmalloc.o ioremap.o \
+			   pte_ref.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/internal.h b/mm/internal.h
index b1001ebeb286..c1e23fa563a7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/pte_ref.h>
 
 /*
  * The set of flags that only affect watermark checking and reclaim
diff --git a/mm/memory.c b/mm/memory.c
index d2aaa85b840c..265b841cc7f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -433,44 +433,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
-{
-	spinlock_t *ptl = pmd_lock(mm, pmd);
-
-	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
-		mm_inc_nr_ptes(mm);
-		/*
-		 * Ensure all pte setup (eg. pte page lock and page clearing) are
-		 * visible before the pte is made visible to other CPUs by being
-		 * put into page tables.
-		 *
-		 * The other side of the story is the pointer chasing in the page
-		 * table walking code (when walking the page table without locking;
-		 * ie. most of the time). Fortunately, these data accesses consist
-		 * of a chain of data-dependent loads, meaning most CPUs (alpha
-		 * being the notable exception) will already guarantee loads are
-		 * seen in-order. See the alpha page table accessors for the
-		 * smp_rmb() barriers in page table walking code.
-		 */
-		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-		pmd_populate(mm, pmd, *pte);
-		*pte = NULL;
-	}
-	spin_unlock(ptl);
-}
-
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
-{
-	pgtable_t new = pte_alloc_one(mm);
-	if (!new)
-		return -ENOMEM;
-
-	pmd_install(mm, pmd, &new);
-	if (new)
-		pte_free(mm, new);
-	return 0;
-}
-
 int __pte_alloc_kernel(pmd_t *pmd)
 {
 	pte_t *new = pte_alloc_one_kernel(&init_mm);
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..07a73b5521cc
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free user PTE page table pages
+ *
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#include <linux/pte_ref.h>
+
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	spinlock_t *ptl = pmd_lock(mm, pmd);
+
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		mm_inc_nr_ptes(mm);
+		/*
+		 * Ensure all pte setup (eg. pte page lock and page clearing) are
+		 * visible before the pte is made visible to other CPUs by being
+		 * put into page tables.
+		 *
+		 * The other side of the story is the pointer chasing in the page
+		 * table walking code (when walking the page table without locking;
+		 * ie. most of the time). Fortunately, these data accesses consist
+		 * of a chain of data-dependent loads, meaning most CPUs (alpha
+		 * being the notable exception) will already guarantee loads are
+		 * seen in-order. See the alpha page table accessors for the
+		 * smp_rmb() barriers in page table walking code.
+		 */
+		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+		pmd_populate(mm, pmd, *pte);
+		*pte = NULL;
+	}
+	spin_unlock(ptl);
+}
+
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+	pgtable_t new = pte_alloc_one(mm);
+	if (!new)
+		return -ENOMEM;
+
+	pmd_install(mm, pmd, &new);
+	if (new)
+		pte_free(mm, new);
+	return 0;
+}
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 5/9] mm: pte_refcount infrastructure
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (3 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

In this patch, we add two fields to the struct page of PTE page
table, the pte_refcount is used to track how many users of PTE
page table, the pmd is used to record the pmd entry which is
populated by the PTE. And we introduce some APIs that will be
used in subsequent patches:

The following are some APIs related to ->pte_refcount:

	pte_get{,_many}()
	pte_put{,_many,_vmf}()
	pte_try_get()

And we add pte_alloc_get{,_map,_map_lock}(), the semantics of
these function is to hold a pte_refcount count when it returns 0,
and its caller is responsible for decrease the pte_refcount by
calling pte_put(), like the following pattern:

	--> pte_alloc_get()

	    do something about pte

	    pte_put()

This patch contains no functional changes, just creates some
infrastructure, and we will use these in the next patch.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_types.h |   8 +-
 include/linux/pgtable.h  |   3 +-
 include/linux/pte_ref.h  | 208 ++++++++++++++++++++++++++++++++++++++++++++++-
 mm/Kconfig               |   4 +
 mm/memory.c              |   8 +-
 mm/pte_ref.c             |  59 ++++++++++++--
 mm/sparse-vmemmap.c      |   2 +-
 7 files changed, 279 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b9aa9aa99924..85c5cc4b95b3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -161,11 +161,17 @@ struct page {
 		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
-			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			union {
+				pgtable_t pmd_huge_pte; /* protected by page->ptl */
+				pmd_t *pmd;             /* PTE page only */
+			};
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+#ifdef CONFIG_FREE_USER_PTE
+				atomic_t pte_refcount;  /* PTE page only */
+#endif
 			};
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..c799475bc5c9 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -331,7 +331,6 @@ static inline pte_t ptep_get_lockless(pte_t *ptep)
 }
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address,
@@ -342,6 +341,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 	return pmd;
 }
 #endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
 static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address,
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 60b752dd7846..47aaeac7507e 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -15,6 +15,8 @@
 
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
+void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd);
 
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
@@ -25,5 +27,209 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
 	(pte_alloc(mm, pmd) ?				\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#endif
+#ifdef CONFIG_FREE_USER_PTE
+
+void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);
+
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+	pte->pmd = pmd;
+	atomic_set(&pte->pte_refcount, count);
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return virt_to_page(pte)->pmd;
+}
+
+static inline void pte_migrate_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+	pmd_pgtable(old_pmd)->pmd = new_pmd;
+}
+
+static inline void pte_get_many(pmd_t *pmdp, unsigned int nr)
+{
+	pgtable_t pte = pmd_pgtable(*pmdp);
+
+	VM_BUG_ON(!PageTable(pte));
+	atomic_add(nr, &pte->pte_refcount);
+}
+
+/*
+ * pte_get - Increment ->pte_refcount for the PTE page table.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Similar to the mechanism of page refcount, the user of PTE page table
+ * should hold a refcount to it before accessing.
+ */
+static inline void pte_get(pmd_t *pmdp)
+{
+	pte_get_many(pmdp, 1);
+}
+
+/*
+ * pte_get_unless_zero - Increment ->pte_refcount for the PTE page table
+ *			 unless it is zero.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+static inline bool pte_get_unless_zero(pmd_t *pmdp)
+{
+	pgtable_t pte = pmd_pgtable(*pmdp);
 
+	VM_BUG_ON(!PageTable(pte));
+	return atomic_inc_not_zero(&pte->pte_refcount);
+}
+
+/*
+ * pte_try_get - Try to increment ->pte_refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return true if the increment succeeded. Otherwise return false.
+ *
+ * Before Operating the PTE page table, we need to hold a ->pte_refcount
+ * to protect against the concurrent release of the PTE page table.
+ */
+static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	bool retval = true;
+	spinlock_t *ptl;
+
+	ptl = pmd_lock(mm, pmdp);
+	if (!pmd_present(*pmdp) || !pte_get_unless_zero(pmdp))
+		retval = false;
+	spin_unlock(ptl);
+
+	return retval;
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp,
+				unsigned long addr, unsigned int nr)
+{
+	pgtable_t pte = pmd_pgtable(*pmdp);
+
+	VM_BUG_ON(!PageTable(pte));
+	if (atomic_sub_and_test(nr, &pte->pte_refcount))
+		free_pte_table(mm, pmdp, addr & PMD_MASK);
+}
+
+/*
+ * pte_put - Decrement ->pte_refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ * @addr: the start address of the tlb range to be flushed.
+ *
+ * The PTE page table page will be freed when the last refcount is dropped.
+ */
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
+{
+	pte_put_many(mm, pmdp, addr, 1);
+}
+
+/*
+ * pte_put_vmf - Decrement ->pte_refcount for the PTE page table.
+ * @vmf: fault information
+ *
+ * While we access ->pte_refcount of a PTE page table, any of the following
+ * ensures the pmd entry corresponding to the PTE page table stability:
+ * 	- mmap_lock
+ * 	- anon_lock
+ * 	- i_mmap_lock
+ * 	- parallel threads are excluded by other means which
+ * 	  can make ->pmd stable(e.g. gup case)
+ * But the mmap_lock maybe unlocked in advance in some cases in
+ * handle_pte_fault(), so we should ensure the pte_put() is performed
+ * in the critical section of the mmap_lock.
+ */
+static inline void pte_put_vmf(struct vm_fault *vmf)
+{
+	if (!(vmf->flags & FAULT_FLAG_PTE_GET))
+		return;
+	vmf->flags &= ~FAULT_FLAG_PTE_GET;
+
+	pte_put(vmf->vma->vm_mm, vmf->pmd, vmf->address);
+}
+
+/*
+ * pte_alloc_get - allocate a PTE page table if the pmd entry is none, and
+ *		   hold a ->pte_refcount when returning
+ * @mm: the mm_struct of the target address space.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	spinlock_t *ptl;
+
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp) || !pte_get_unless_zero(pmdp)) {
+		spin_unlock(ptl);
+		return __pte_alloc_get(mm, pmdp);
+	}
+	spin_unlock(ptl);
+	return 0;
+}
+
+#define pte_alloc_get_map(mm, pmd, address)			\
+	(pte_alloc_get(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+
+#define pte_alloc_get_map_lock(mm, pmd, address, ptlp)		\
+	(pte_alloc_get(mm, pmd) ?				\
+		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
+#else
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return NULL;
+}
+
+static inline void pte_migrate_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+}
+
+static inline void pte_get_many(pmd_t *pmdp, unsigned int nr)
+{
+}
+
+static inline void pte_get(pmd_t *pmdp)
+{
+}
+
+static inline bool pte_get_unless_zero(pmd_t *pmdp)
+{
+	return true;
+}
+
+static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	return true;
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp,
+				unsigned long addr, unsigned int value)
+{
+}
+
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
+{
+}
+
+static inline void pte_put_vmf(struct vm_fault *vmf)
+{
+}
+
+static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	return pte_alloc(mm, pmdp);
+}
+
+#define pte_alloc_get_map(mm, pmd, address)			\
+	pte_alloc_map(mm, pmd, address)
+
+#define pte_alloc_get_map_lock(mm, pmd, address, ptlp)		\
+	pte_alloc_map_lock(mm, pmd, address, ptlp)
+#endif /* CONFIG_FREE_USER_PTE */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 50ca602edeb6..fb16ecebcb80 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -894,6 +894,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config FREE_USER_PTE
+	def_bool y
+	depends on X86_64 && !TRANSPARENT_HUGEPAGE
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/memory.c b/mm/memory.c
index 265b841cc7f9..d1efb868e682 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -441,7 +441,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
 
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
-		smp_wmb(); /* See comment in pmd_install() */
+		smp_wmb(); /* See comment in pmd_install_get() */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
 	}
@@ -4780,7 +4780,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 	if (pgd_present(*pgd))		/* Another has populated it */
 		p4d_free(mm, new);
 	else {
-		smp_wmb(); /* See comment in pmd_install() */
+		smp_wmb(); /* See comment in pmd_install_get() */
 		pgd_populate(mm, pgd, new);
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -4802,7 +4802,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
 	spin_lock(&mm->page_table_lock);
 	if (!p4d_present(*p4d)) {
 		mm_inc_nr_puds(mm);
-		smp_wmb(); /* See comment in pmd_install() */
+		smp_wmb(); /* See comment in pmd_install_get() */
 		p4d_populate(mm, p4d, new);
 	} else	/* Another has populated it */
 		pud_free(mm, new);
@@ -4826,7 +4826,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 	ptl = pud_lock(mm, pud);
 	if (!pud_present(*pud)) {
 		mm_inc_nr_pmds(mm);
-		smp_wmb(); /* See comment in pmd_install() */
+		smp_wmb(); /* See comment in pmd_install_get() */
 		pud_populate(mm, pud, new);
 	} else	/* Another has populated it */
 		pmd_free(mm, new);
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 07a73b5521cc..630704ae26db 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -8,12 +8,47 @@
  */
 
 #include <linux/pte_ref.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+#ifdef CONFIG_DEBUG_VM
+static void pte_free_debug(pmd_t pmd)
+{
+	pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PTE; i++)
+		BUG_ON(!pte_none(*ptep++));
+}
+#else
+static inline void pte_free_debug(pmd_t pmd)
+{
+}
+#endif
+
+void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	spinlock_t *ptl;
+	pmd_t pmd;
+
+	ptl = pmd_lock(mm, pmdp);
+	pmd = pmdp_huge_get_and_clear(mm, addr, pmdp);
+	spin_unlock(ptl);
+
+	pte_free_debug(pmd);
+	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	mm_dec_nr_ptes(mm);
+	pte_free(mm, pmd_pgtable(pmd));
+}
+
+void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 {
-	spinlock_t *ptl = pmd_lock(mm, pmd);
+	spinlock_t *ptl;
 
-	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+retry:
+	ptl = pmd_lock(mm, pmd);
+	if (likely(pmd_none(*pmd))) {
 		mm_inc_nr_ptes(mm);
 		/*
 		 * Ensure all pte setup (eg. pte page lock and page clearing) are
@@ -30,19 +65,33 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 		 */
 		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 		pmd_populate(mm, pmd, *pte);
+		pte_ref_init(*pte, pmd, 1);
 		*pte = NULL;
+	} else if (!pte_get_unless_zero(pmd)) {
+		spin_unlock(ptl);
+		goto retry;
 	}
 	spin_unlock(ptl);
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	pmd_install_get(mm, pmd, pte);
+}
+
+int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	pmd_install_get(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
 	return 0;
 }
+
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+	return __pte_alloc_get(mm, pmd);
+}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index db6df27c852a..818c82ee3c6e 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
 
-	/* Make pte visible before pmd. See comment in pmd_install(). */
+	/* Make pte visible before pmd. See comment in pmd_install_get(). */
 	smp_wmb();
 	pmd_populate_kernel(&init_mm, pmd, pgtable);
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 6/9] mm: free user PTE page table pages
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (4 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 5/9] mm: pte_refcount infrastructure Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  7:01   ` David Hildenbrand
  2021-09-01 13:53   ` Jason Gunthorpe
  2021-08-19  3:18 ` [PATCH v2 7/9] mm: add THP support for pte_ref Qi Zheng
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

Some malloc libraries(e.g. jemalloc or tcmalloc) usually
allocate the amount of VAs by mmap() and do not unmap
those VAs. They will use madvise(MADV_DONTNEED) to free
physical memory if they want. But the page tables do not
be freed by madvise(), so it can produce many page tables
when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one
process which actually happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the
RES is 590g. In theory, the process only need 1.2g PTE page
tables to map those physical memory. The reason why PTE page
tables occupy a lot of memory is that madvise(MADV_DONTNEED)
only empty the PTE and free physical memory but doesn't free
the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory
about 108g(best case). And the larger the difference between
the size of VIRT and RES, the more memory we save.

In this patch series, we add a pte_refcount field to the
struct page of page table to track how many users of PTE page
table. Similar to the mechanism of page refcount, the user of
PTE page table should hold a refcount to it before accessing.
The PTE page table page will be freed when the last refcount
is dropped.

While we access ->pte_refcount of a PTE page table, any of the
following ensures the pmd entry corresponding to the PTE page
table stability:

	- mmap_lock
	- anon_lock
	- i_mmap_lock
	- parallel threads are excluded by other means which
	  can make ->pmd stable(e.g. gup case)

This patch does not support THP temporarily, it will be
supported in the next patch.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c      |  18 ++++-
 fs/userfaultfd.c        |   2 +
 include/linux/mm.h      |   2 +
 include/linux/pte_ref.h |   1 -
 include/linux/rmap.h    |   4 +-
 kernel/events/uprobes.c |   3 +
 mm/filemap.c            |  43 ++++++-----
 mm/gup.c                |   7 ++
 mm/hmm.c                |   4 +
 mm/internal.h           |   1 +
 mm/ksm.c                |   4 +
 mm/madvise.c            |  17 ++++-
 mm/memcontrol.c         |   8 +-
 mm/memory.c             | 194 +++++++++++++++++++++++++++++++++++++++---------
 mm/mempolicy.c          |   4 +-
 mm/migrate.c            |  15 +++-
 mm/mincore.c            |   5 +-
 mm/mlock.c              |   1 +
 mm/mprotect.c           |  10 ++-
 mm/mremap.c             |  12 ++-
 mm/page_vma_mapped.c    |   4 +
 mm/pagewalk.c           |  16 +++-
 mm/pte_ref.c            |   5 --
 mm/rmap.c               |  13 ++++
 mm/swapfile.c           |   4 +-
 mm/userfaultfd.c        |  14 +++-
 26 files changed, 326 insertions(+), 85 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..85ee730ff6ae 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -574,6 +574,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -582,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
 		goto out;
 	/*
 	 * The mmap_lock held all the way back in m_start() is what
@@ -593,6 +594,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 out:
 	cond_resched();
 	return 0;
@@ -1121,6 +1123,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -1143,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
 		return 0;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -1168,6 +1171,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 	return 0;
 }
@@ -1407,6 +1411,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	pte_t *pte, *orig_pte;
 	int err = 0;
+	unsigned long start = addr;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	ptl = pmd_trans_huge_lock(pmdp, vma);
@@ -1475,6 +1480,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		return 0;
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+	if (!pte_try_get(walk->mm, pmdp))
+		return 0;
+
 	/*
 	 * We can assume that @vma always points to a valid one and @end never
 	 * goes beyond vma->vm_end.
@@ -1489,6 +1497,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			break;
 	}
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(walk->mm, pmdp, start);
 
 	cond_resched();
 
@@ -1795,6 +1804,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *orig_pte;
 	pte_t *pte;
+	unsigned long start = addr;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	ptl = pmd_trans_huge_lock(pmd, vma);
@@ -1812,6 +1822,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
+	if (!pte_try_get(walk->mm, pmd))
+		return 0;
+
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
@@ -1821,6 +1834,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(walk->mm, pmd, start);
 	cond_resched();
 	return 0;
 }
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 003f0d31743e..baa1d98a6f87 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -28,6 +28,7 @@
 #include <linux/ioctl.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/pte_ref.h>
 
 int sysctl_unprivileged_userfaultfd __read_mostly;
 
@@ -509,6 +510,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
 						       vmf->address,
 						       vmf->flags, reason);
+	pte_put_vmf(vmf);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 369d4283de49..5659a483072b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,6 +437,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PTE_GET: This means the pte ->pte_refcount has been got.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -468,6 +469,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_PTE_GET =		1 << 10,
 };
 
 /*
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 47aaeac7507e..c2389d03bb59 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -13,7 +13,6 @@
 #include <linux/pgtable.h>
 #include <asm/pgalloc.h>
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
 void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 221c3c6438a7..3a7ec085d939 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -11,7 +11,7 @@
 #include <linux/rwsem.h>
 #include <linux/memcontrol.h>
 #include <linux/highmem.h>
-
+#include <linux/pte_ref.h>
 #include <linux/refcount.h>
 
 /*
@@ -222,6 +222,8 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		pte_unmap(pvmw->pte);
 	if (pvmw->ptl)
 		spin_unlock(pvmw->ptl);
+	if (pvmw->pte && !PageHuge(pvmw->page))
+		pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address);
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..5762f3abfd55 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -206,6 +206,9 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
 
+	if (!new_page)
+		pte_put(mm, pte_to_pmd(pvmw.pte), addr);
+
 	if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
 		munlock_vma_page(old_page);
 	put_page(old_page);
diff --git a/mm/filemap.c b/mm/filemap.c
index eeac0e119cf5..7f2eb64e5c76 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1701,6 +1701,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			return false;
 
+		pte_put_vmf(vmf);
 		mmap_read_unlock(mm);
 		if (flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
@@ -1713,6 +1714,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 
 		ret = __folio_lock_killable(folio);
 		if (ret) {
+			pte_put_vmf(vmf);
 			mmap_read_unlock(mm);
 			return false;
 		}
@@ -3197,32 +3199,30 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	struct mm_struct *mm = vmf->vma->vm_mm;
 
 	/* Huge page is mapped? No need to proceed. */
-	if (pmd_trans_huge(*vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_trans_huge(*vmf->pmd))
+		goto out;
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
-	if (pmd_none(*vmf->pmd))
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+	if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd))
+		pmd_install_get(mm, vmf->pmd, &vmf->prealloc_pte);
 
 	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_devmap_trans_unstable(vmf->pmd))
+		goto out;
 
 	return false;
+out:
+	unlock_page(page);
+	put_page(page);
+	return true;
 }
 
 static struct page *next_uptodate_page(struct page *page,
@@ -3292,10 +3292,12 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
 	unsigned long addr;
+	unsigned long start;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	unsigned int nr_get = 0;
 
 	rcu_read_lock();
 	head = first_map_page(mapping, &xas, end_pgoff);
@@ -3307,7 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
 		page = find_subpage(head, xas.xa_index);
@@ -3329,6 +3331,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			ret = VM_FAULT_NOPAGE;
 
 		do_set_pte(vmf, page, addr);
+		nr_get++;
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, addr, vmf->pte);
 		unlock_page(head);
@@ -3338,6 +3341,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		put_page(head);
 	} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_get_many(vmf->pmd, nr_get);
+	pte_put(vma->vm_mm, vmf->pmd, start);
 out:
 	rcu_read_unlock();
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
diff --git a/mm/gup.c b/mm/gup.c
index 2630ed1bb4f4..30757f3b176c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
 
+	if (!pte_try_get(mm, pmd))
+		return no_page_table(vma, flags);
+
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 	pte = *ptep;
 	if (!pte_present(pte)) {
@@ -517,6 +520,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		if (!is_migration_entry(entry))
 			goto no_page;
 		pte_unmap_unlock(ptep, ptl);
+		pte_put(mm, pmd, address);
 		migration_entry_wait(mm, pmd, address);
 		goto retry;
 	}
@@ -524,6 +528,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
+		pte_put(mm, pmd, address);
 		return NULL;
 	}
 
@@ -612,9 +617,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 out:
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmd, address);
 	return page;
 no_page:
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmd, address);
 	if (!pte_none(pte))
 		return NULL;
 	return no_page_table(vma, flags);
diff --git a/mm/hmm.c b/mm/hmm.c
index fad6be2bf072..29bb379510cc 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -380,6 +380,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
+	if (!pte_try_get(walk->mm, pmdp))
+		goto again;
+
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) {
 		int r;
@@ -391,6 +394,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		}
 	}
 	pte_unmap(ptep - 1);
+	pte_put(walk->mm, pmdp, start);
 	return 0;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index c1e23fa563a7..5cebb4ee1792 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -466,6 +466,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 	if (fault_flag_allow_retry_first(flags) &&
 	    !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
 		fpin = get_file(vmf->vma->vm_file);
+		pte_put_vmf(vmf);
 		mmap_read_unlock(vmf->vma->vm_mm);
 	}
 	return fpin;
diff --git a/mm/ksm.c b/mm/ksm.c
index 08092e6f0b73..d0d72dd1eaf0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1138,6 +1138,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
+	if (!pte_try_get(mm, pmd))
+		goto out;
+
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
@@ -1187,6 +1190,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	err = 0;
 out_mn:
 	mmu_notifier_invalidate_range_end(&range);
+	pte_put(mm, pmd, addr);
 out:
 	return err;
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..1494da73281c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -191,7 +191,8 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+	if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
+	    !pte_try_get(vma->vm_mm, pmd))
 		return 0;
 
 	for (index = start; index != end; index += PAGE_SIZE) {
@@ -215,6 +216,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		if (page)
 			put_page(page);
 	}
+	pte_put(vma->vm_mm, pmd, start);
 
 	return 0;
 }
@@ -318,6 +320,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	spinlock_t *ptl;
 	struct page *page = NULL;
 	LIST_HEAD(page_list);
+	unsigned long start = addr;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -392,6 +395,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
+	if (!pte_try_get(vma->vm_mm, pmd))
+		return 0;
+
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	flush_tlb_batched_pending(mm);
@@ -471,6 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	if (pageout)
 		reclaim_pages(&page_list);
 	cond_resched();
@@ -580,14 +587,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd) || !pte_try_get(mm, pmd))
 		return 0;
+	nr_put++;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -612,6 +622,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			nr_swap--;
 			free_swap_and_cache(entry);
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			continue;
 		}
 
@@ -696,6 +707,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e7c205a1852..f7d203ce14af 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5795,6 +5795,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -5809,13 +5810,14 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (get_mctgt_type(vma, addr, *pte, NULL))
 			mc.precharge++;	/* increment precharge temporarily */
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 
 	return 0;
@@ -5995,6 +5997,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct page *page;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -6030,6 +6033,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 retry:
+	if (!pte_try_get(vma->vm_mm, pmd))
+		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
@@ -6080,6 +6085,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		}
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 
 	if (addr != end) {
diff --git a/mm/memory.c b/mm/memory.c
index d1efb868e682..8fcef8b67971 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -219,6 +219,17 @@ static void check_sync_rss_stat(struct task_struct *task)
 
 #endif /* SPLIT_RSS_COUNTING */
 
+#ifdef CONFIG_FREE_USER_PTE
+static inline void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+				  unsigned long addr)
+{
+	/*
+	 * We should never reach here since the PTE page tables are
+	 * dynamically freed.
+	 */
+	BUG();
+}
+#else
 /*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
@@ -231,6 +242,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 	pte_free_tlb(tlb, token, addr);
 	mm_dec_nr_ptes(tlb->mm);
 }
+#endif /* CONFIG_FREE_USER_PTE */
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
@@ -822,6 +834,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -890,6 +903,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_wrprotect(pte_mkuffd_wp(pte));
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -942,6 +956,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		pte = pte_clear_uffd_wp(pte);
 
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -983,7 +998,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	progress = 0;
 	init_rss_vec(rss);
 
-	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
+	dst_pte = pte_alloc_get_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte) {
 		ret = -ENOMEM;
 		goto out;
@@ -1071,8 +1086,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		goto out;
 	} else if (ret ==  -EAGAIN) {
 		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
-		if (!prealloc)
-			return -ENOMEM;
+		if (!prealloc) {
+			ret = -ENOMEM;
+			goto out;
+		}
 	} else if (ret) {
 		VM_WARN_ON_ONCE(1);
 	}
@@ -1080,11 +1097,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	/* We've captured and resolved the error. Reset, try again. */
 	ret = 0;
 
-	if (addr != end)
+	if (addr != end) {
+		pte_put(dst_mm, dst_pmd, addr);
 		goto again;
+	}
 out:
 	if (unlikely(prealloc))
 		put_page(prealloc);
+	pte_put(dst_mm, dst_pmd, addr);
 	return ret;
 }
 
@@ -1103,9 +1123,13 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
+		pmd_t pmdval;
+
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+retry:
+		pmdval = READ_ONCE(*src_pmd);
+		if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval)
+			|| pmd_devmap(pmdval)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1118,9 +1142,14 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
+		if (!pte_try_get(src_mm, src_pmd))
+			goto retry;
 		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-				   addr, next))
+				   addr, next)) {
+			pte_put(src_mm, src_pmd, addr);
 			return -ENOMEM;
+		}
+		pte_put(src_mm, src_pmd, addr);
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1278,6 +1307,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1310,6 +1341,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			nr_put++;
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1352,6 +1384,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			rss[mm_counter(page)]--;
 
 			if (is_device_private_entry(entry))
@@ -1376,6 +1409,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		nr_put++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1402,6 +1436,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto again;
 	}
 
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
+
 	return addr;
 }
 
@@ -1441,9 +1478,11 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		 * because MADV_DONTNEED holds the mmap_lock in read
 		 * mode.
 		 */
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+		if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
+		    !pte_try_get(tlb->mm, pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		pte_put(tlb->mm, pmd, addr);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -1698,7 +1737,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 
 	if (!pmd)
 		return NULL;
-	return pte_alloc_map_lock(mm, pmd, addr, ptl);
+	return pte_alloc_get_map_lock(mm, pmd, addr, ptl);
 }
 
 static int validate_page_before_insert(struct page *page)
@@ -1719,6 +1758,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 	inc_mm_counter_fast(mm, mm_counter_file(page));
 	page_add_file_rmap(page, false);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_get(pte_to_pmd(pte));
 	return 0;
 }
 
@@ -1746,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 		goto out;
 	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 out:
 	return retval;
 }
@@ -1789,7 +1830,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Allocate the PTE if necessary; takes PMD lock once only. */
 	ret = -ENOMEM;
-	if (pte_alloc(mm, pmd))
+	if (pte_alloc_get(mm, pmd))
 		goto out;
 
 	while (pages_to_write_in_pmd) {
@@ -1816,6 +1857,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 	if (remaining_pages_total)
 		goto more;
 	ret = 0;
+	pte_put(mm, pmd, addr);
 out:
 	*num = remaining_pages_total;
 	return ret;
@@ -2039,10 +2081,12 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	set_pte_at(mm, addr, pte, entry);
+	pte_get(pte_to_pmd(pte));
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 	return VM_FAULT_NOPAGE;
 }
 
@@ -2246,8 +2290,10 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
+	unsigned int nr_get = 0;
+	unsigned long start_addr = addr;
 
-	mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	mapped_pte = pte = pte_alloc_get_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
 	arch_enter_lazy_mmu_mode();
@@ -2258,10 +2304,13 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			break;
 		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
+		nr_get++;
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_get_many(pmd, nr_get);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(mm, pmd, start_addr);
 	return err;
 }
 
@@ -2474,13 +2523,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
+	unsigned int nr_put = 0;
+	unsigned int nr_get = 0;
+	unsigned long start = addr;
 
 	if (create) {
 		mapped_pte = pte = (mm == &init_mm) ?
 			pte_alloc_kernel_track(pmd, addr, mask) :
-			pte_alloc_map_lock(mm, pmd, addr, &ptl);
+			pte_alloc_get_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
 			return -ENOMEM;
+		nr_put++;
 	} else {
 		mapped_pte = pte = (mm == &init_mm) ?
 			pte_offset_kernel(pmd, addr) :
@@ -2495,17 +2548,30 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		do {
 			if (create || !pte_none(*pte)) {
 				err = fn(pte++, addr, data);
-				if (err)
-					break;
+				if (IS_ENABLED(CONFIG_FREE_USER_PTE) &&
+					mm != &init_mm && !pte_none(*(pte-1)))
+					nr_get++;
+			} else if (!pte_none(*pte)) {
+				err = fn(pte++, addr, data);
+				if (IS_ENABLED(CONFIG_FREE_USER_PTE) &&
+					mm != &init_mm && pte_none(*(pte-1)))
+					nr_put++;
 			}
+			if (err)
+				break;
 		} while (addr += PAGE_SIZE, addr != end);
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
 		pte_unmap_unlock(mapped_pte, ptl);
+		pte_get_many(pmd, nr_get);
+		if (nr_put)
+			pte_put_many(mm, pmd, start, nr_put);
+	}
+
 	return err;
 }
 
@@ -2529,6 +2595,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 	}
 	do {
 		next = pmd_addr_end(addr, end);
+retry:
 		if (pmd_none(*pmd) && !create)
 			continue;
 		if (WARN_ON_ONCE(pmd_leaf(*pmd)))
@@ -2538,8 +2605,12 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 				continue;
 			pmd_clear_bad(pmd);
 		}
+		if (!create && !pte_try_get(mm, pmd))
+			goto retry;
 		err = apply_to_pte_range(mm, pmd, addr, next,
 					 fn, data, create, mask);
+		if (!create)
+			pte_put(mm, pmd, addr);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
@@ -3689,7 +3760,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	/*
-	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * Use pte_alloc_get() instead of pte_alloc_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
 	 * from a different thread.
 	 *
@@ -3698,7 +3769,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 *
 	 * Here we only have mmap_read_lock(mm).
 	 */
-	if (pte_alloc(vma->vm_mm, vmf->pmd))
+	if (pte_alloc_get(vma->vm_mm, vmf->pmd))
 		return VM_FAULT_OOM;
 
 	/* See comment in handle_pte_fault() */
@@ -3722,7 +3793,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			return handle_userfault(vmf, VM_UFFD_MISSING);
+			ret = handle_userfault(vmf, VM_UFFD_MISSING);
+			goto put;
 		}
 		goto setpte;
 	}
@@ -3765,7 +3837,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		put_page(page);
-		return handle_userfault(vmf, VM_UFFD_MISSING);
+		ret = handle_userfault(vmf, VM_UFFD_MISSING);
+		goto put;
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3773,19 +3846,23 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	pte_get(vmf->pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return ret;
+	goto put;
 release:
 	put_page(page);
 	goto unlock;
 oom_free_page:
 	put_page(page);
 oom:
-	return VM_FAULT_OOM;
+	ret = VM_FAULT_OOM;
+put:
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
+	return ret;
 }
 
 /*
@@ -3813,7 +3890,7 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	 *				unlock_page(B)
 	 *				# flush A, B to clear the writeback
 	 */
-	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+	if (!vmf->prealloc_pte) {
 		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
@@ -3983,6 +4060,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+retry:
 	if (pmd_none(*vmf->pmd)) {
 		if (PageTransCompound(page)) {
 			ret = do_set_pmd(vmf, page);
@@ -3991,9 +4069,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
-		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
+			pmd_install_get(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+		else if (unlikely(pte_alloc_get(vma->vm_mm, vmf->pmd)))
 			return VM_FAULT_OOM;
+	} else if (!pte_try_get(vma->vm_mm, vmf->pmd)) {
+		goto retry;
 	}
 
 	/* See comment in handle_pte_fault() */
@@ -4004,13 +4084,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_none(*vmf->pte))) {
 		do_set_pte(vmf, page, vmf->address);
-	else
+		pte_get(vmf->pmd);
+	} else {
 		ret = VM_FAULT_NOPAGE;
+	}
 
 	update_mmu_tlb(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
 	return ret;
 }
 
@@ -4232,9 +4315,15 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 		 * If we find a migration pmd entry or a none pmd entry, which
 		 * should never happen, return SIGBUS
 		 */
-		if (unlikely(!pmd_present(*vmf->pmd)))
+		if (unlikely(!pmd_present(*vmf->pmd))) {
 			ret = VM_FAULT_SIGBUS;
-		else {
+			goto out;
+		} else {
+			if (!pte_try_get(vma->vm_mm, vmf->pmd)) {
+				ret = VM_FAULT_SIGBUS;
+				goto out;
+			}
+
 			vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
 						       vmf->pmd,
 						       vmf->address,
@@ -4252,6 +4341,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 				ret = VM_FAULT_NOPAGE;
 
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			pte_put(vma->vm_mm, vmf->pmd, vmf->address);
 		}
 	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
 		ret = do_read_fault(vmf);
@@ -4265,6 +4355,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 		pte_free(vm_mm, vmf->prealloc_pte);
 		vmf->prealloc_pte = NULL;
 	}
+out:
 	return ret;
 }
 
@@ -4460,10 +4551,12 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
+	vm_fault_t ret;
 
-	if (unlikely(pmd_none(*vmf->pmd))) {
+retry:
+	if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) {
 		/*
-		 * Leave __pte_alloc() until later: because vm_ops->fault may
+		 * Leave __pte_alloc_get() until later: because vm_ops->fault may
 		 * want to allocate huge page, and if we expose page table
 		 * for an instant, it will be difficult to retract from
 		 * concurrent faults and from rmap lookups.
@@ -4484,6 +4577,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 */
 		if (pmd_devmap_trans_unstable(vmf->pmd))
 			return 0;
+
+		if (!pte_try_get(vmf->vma->vm_mm, vmf->pmd))
+			goto retry;
+
+		if (IS_ENABLED(CONFIG_FREE_USER_PTE))
+			vmf->flags |= FAULT_FLAG_PTE_GET;
+
 		/*
 		 * A regular pmd is established and it can't morph into a huge
 		 * pmd from under us anymore at this point because we hold the
@@ -4505,6 +4605,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		if (pte_none(vmf->orig_pte)) {
 			pte_unmap(vmf->pte);
 			vmf->pte = NULL;
+			pte_put_vmf(vmf);
 		}
 	}
 
@@ -4515,11 +4616,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
-		return do_swap_page(vmf);
+	if (!pte_present(vmf->orig_pte)) {
+		ret = do_swap_page(vmf);
+		goto put;
+	}
 
-	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
-		return do_numa_page(vmf);
+	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+		ret = do_numa_page(vmf);
+		goto put;
+	}
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
@@ -4529,8 +4634,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		goto unlock;
 	}
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
-			return do_wp_page(vmf);
+		if (!pte_write(entry)) {
+			ret = do_wp_page(vmf);
+			goto put;
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -4552,7 +4659,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	}
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	ret = 0;
+put:
+	pte_put_vmf(vmf);
+	return ret;
 }
 
 /*
@@ -4889,13 +4999,22 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
 					(address & PAGE_MASK) + PAGE_SIZE);
 		mmu_notifier_invalidate_range_start(range);
 	}
+	if (!pte_try_get(mm, pmd))
+		goto out;
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
 	if (!pte_present(*ptep))
 		goto unlock;
+	/*
+	 * when we reach here, it means that the ->pte_refcount is at least
+	 * one and the contents of the PTE page table are stable until @ptlp is
+	 * released, so we can put pte safely.
+	 */
+	pte_put(mm, pmd, address);
 	*ptepp = ptep;
 	return 0;
 unlock:
 	pte_unmap_unlock(ptep, *ptlp);
+	pte_put(mm, pmd, address);
 	if (range)
 		mmu_notifier_invalidate_range_end(range);
 out:
@@ -5022,6 +5141,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 		return -EINVAL;
 	pte = *ptep;
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(vma->vm_mm, pte_to_pmd(ptep), addr);
 
 	prot = pgprot_val(pte_pgprot(pte));
 	phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9d3afa015fac..acef7df762a9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -508,6 +508,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	bool has_unmovable = false;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -517,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	/* THP was split, fall through to pte walk */
 
-	if (pmd_trans_unstable(pmd))
+	if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd))
 		return 0;
 
 	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
@@ -553,6 +554,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(walk->mm, pmd, start);
 	cond_resched();
 
 	if (has_unmovable)
diff --git a/mm/migrate.c b/mm/migrate.c
index 7a03a61bbcd8..a3bcef1430c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2298,6 +2298,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	if (unlikely(pmd_bad(*pmdp)))
 		return migrate_vma_collect_skip(start, end, walk);
 
+	if (!pte_try_get(mm, pmdp))
+		goto again;
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
 
@@ -2419,6 +2421,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
+	pte_put(mm, pmdp, start);
 
 	/* Only flush the TLB if we actually modified any entries */
 	if (unmapped)
@@ -2826,7 +2829,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	/*
-	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * Use pte_alloc_get() instead of pte_alloc_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
 	 * from a different thread.
 	 *
@@ -2835,7 +2838,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 *
 	 * Here we only have mmap_read_lock(mm).
 	 */
-	if (pte_alloc(mm, pmdp))
+	if (pte_alloc_get(mm, pmdp))
 		goto abort;
 
 	/* See the comment in pte_alloc_one_map() */
@@ -2843,9 +2846,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	if (unlikely(anon_vma_prepare(vma)))
-		goto abort;
+		goto put;
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
-		goto abort;
+		goto put;
 
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
@@ -2914,15 +2917,19 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	} else {
 		/* No need to invalidate - it was non-present before */
 		set_pte_at(mm, addr, ptep, entry);
+		pte_get(pmdp);
 		update_mmu_cache(vma, addr, ptep);
 	}
 
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmdp, addr);
 	*src = MIGRATE_PFN_MIGRATE;
 	return;
 
 unlock_abort:
 	pte_unmap_unlock(ptep, ptl);
+put:
+	pte_put(mm, pmdp, addr);
 abort:
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..0401b526adee 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -18,6 +18,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/pte_ref.h>
 
 #include <linux/uaccess.h>
 
@@ -104,6 +105,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *ptep;
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -112,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd)) {
+	if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) {
 		__mincore_unmapped_range(addr, end, vma, vec);
 		goto out;
 	}
@@ -148,6 +150,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		vec++;
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
+	pte_put(walk->mm, pmd, start);
 out:
 	walk->private += nr;
 	cond_resched();
diff --git a/mm/mlock.c b/mm/mlock.c
index e263d62ae2d0..a4ef20ba9627 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			break;
 	}
 	pte_unmap_unlock(pte, ptl);
+	pte_put(vma->vm_mm, pte_to_pmd(pte), start);
 	return start;
 }
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..53b412423ee8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -234,9 +234,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd = pmd_offset(pud, addr);
 	do {
 		unsigned long this_pages;
+		pmd_t pmdval;
 
 		next = pmd_addr_end(addr, end);
 
+retry:
+		pmdval = READ_ONCE(*pmd);
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -245,7 +248,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * Hence, it's necessary to atomically read the PMD value
 		 * for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		if (!is_swap_pmd(pmdval) && !pmd_devmap(pmdval) &&
 		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;
 
@@ -257,7 +260,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
@@ -276,8 +279,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
+		if (!pte_try_get(vma->vm_mm, pmd))
+			goto retry;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 					      cp_flags);
+		pte_put(vma->vm_mm, pmd, addr);
 		pages += this_pages;
 next:
 		cond_resched();
diff --git a/mm/mremap.c b/mm/mremap.c
index c0b6c41b7b78..461dcb2de18a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -141,6 +141,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
 	unsigned long len = old_end - old_addr;
+	unsigned long old_start = old_addr;
+	unsigned int nr_put = 0;
+	unsigned int nr_get = 0;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -181,6 +184,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 			continue;
 
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		nr_put++;
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -197,7 +201,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
+		nr_get++;
 	}
+	pte_get_many(new_pmd, nr_get);
 
 	arch_leave_lazy_mmu_mode();
 	if (force_flush)
@@ -206,6 +212,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	if (nr_put)
+		pte_put_many(mm, old_pmd, old_start, nr_put);
 	if (need_rmap_locks)
 		drop_rmap_locks(vma);
 }
@@ -271,6 +279,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 	VM_BUG_ON(!pmd_none(*new_pmd));
 
 	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
+	pte_migrate_pmd(pmd, new_pmd);
 	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
@@ -548,10 +557,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				continue;
 		}
 
-		if (pte_alloc(new_vma->vm_mm, new_pmd))
+		if (pte_alloc_get(new_vma->vm_mm, new_pmd))
 			break;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
 			  new_pmd, new_addr, need_rmap_locks);
+		pte_put(new_vma->vm_mm, new_pmd, new_addr);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f7b331081791..eb84fa5825c0 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -211,6 +211,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		}
 
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
+retry:
 		/*
 		 * Make sure the pmd value isn't cached in a register by the
 		 * compiler and used as a stale value after we've observed a
@@ -258,6 +259,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
+		if (!pte_try_get(pvmw->vma->vm_mm, pvmw->pmd))
+			goto retry;
 		if (!map_pte(pvmw))
 			goto next_pte;
 this_pte:
@@ -275,6 +278,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					pvmw->ptl = NULL;
 				}
 				pte_unmap(pvmw->pte);
+				pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address);
 				pvmw->pte = NULL;
 				goto restart;
 			}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..da1324021429 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/pte_ref.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -110,6 +111,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	do {
 again:
 		next = pmd_addr_end(addr, end);
+retry:
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -147,10 +149,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				goto again;
 		}
 
-		if (is_hugepd(__hugepd(pmd_val(*pmd))))
+		if (is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
-		else
-			err = walk_pte_range(pmd, addr, next, walk);
+		} else {
+			if (!walk->no_vma) {
+				if (!pte_try_get(walk->mm, pmd))
+					goto retry;
+				err = walk_pte_range(pmd, addr, next, walk);
+				pte_put(walk->mm, pmd, addr);
+			} else {
+				err = walk_pte_range(pmd, addr, next, walk);
+			}
+		}
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 630704ae26db..dfbf817b7367 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -74,11 +74,6 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 	spin_unlock(ptl);
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
-{
-	pmd_install_get(mm, pmd, pte);
-}
-
 int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t new = pte_alloc_one(mm);
diff --git a/mm/rmap.c b/mm/rmap.c
index 09c41e1f44d8..eac31f3bca05 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1402,6 +1402,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	unsigned int nr_put = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1549,6 +1550,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			nr_put++;
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1562,6 +1564,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				/* We have to invalidate as we cleared the pte */
 				mmu_notifier_invalidate_range(mm, address,
 							address + PAGE_SIZE);
+				nr_put++;
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1573,6 +1576,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
+					nr_put++;
 					goto discard;
 				}
 
@@ -1628,6 +1632,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * See Documentation/vm/mmu_notifier.rst
 			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+			nr_put++;
 		}
 discard:
 		/*
@@ -1641,6 +1646,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		put_page(page);
 	}
 
+	if (nr_put)
+		pte_put_many(mm, pvmw.pmd, address, nr_put);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
@@ -1702,6 +1710,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	unsigned int nr_put = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1865,6 +1874,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			nr_put++;
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
@@ -1911,6 +1921,9 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		put_page(page);
 	}
 
+	if (nr_put)
+		pte_put_many(mm, pvmw.pmd, address, nr_put);
+
 	mmu_notifier_invalidate_range_end(&range);
 
 	return ret;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e3dcaeecc50f..10ebfc94208a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2021,10 +2021,12 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+		if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
+		    !pte_try_get(vma->vm_mm, pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
 				      frontswap, fs_pages_to_unuse);
+		pte_put(vma->vm_mm, pmd, addr);
 		if (ret)
 			return ret;
 	} while (pmd++, addr = next, addr != end);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 36e5f6ab976f..7661929b27bf 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -111,6 +111,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -205,6 +206,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 	ret = 0;
@@ -589,12 +591,15 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
+		    unlikely(__pte_alloc_get(dst_mm, dst_pmd))) {
 			err = -ENOMEM;
 			break;
-		}
-		/* If an huge pmd materialized from under us fail */
-		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+		} else if (unlikely(pmd_trans_huge(*dst_pmd)) ||
+			   !pte_try_get(dst_mm, dst_pmd)) {
+			/*
+			 * If an huge pmd materialized from under us fail or the
+			 * pte populated in the dst_pmd was freed.
+			 */
 			err = -EFAULT;
 			break;
 		}
@@ -604,6 +609,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       src_addr, &page, mcopy_mode, wp_copy);
+		pte_put(dst_mm, dst_pmd, dst_addr);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 7/9] mm: add THP support for pte_ref
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (5 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Qi Zheng
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

The situation will become more complex with THP enabled. In
the handle_pte_fault(), the pmd entry can be populated with
PTE page table as well as THP page. So we can no longer
always get ->pte_refcount successfully, we add the following
function to solve this case:

	pte_alloc_try_get()
	pmd_install_try_get()

When split THP into PTE page table, we initialize
->pte_refcount to HPAGE_PMD_NR, so that we can free this PTE
normally through pte_put().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c      | 14 +++-------
 include/linux/pte_ref.h | 46 +++++++++++++++++++++++++++++--
 mm/Kconfig              |  2 +-
 mm/filemap.c            | 11 ++++----
 mm/khugepaged.c         |  9 ++++++
 mm/madvise.c            |  9 ++----
 mm/memcontrol.c         |  7 ++---
 mm/memory.c             | 43 +++++++++++++++--------------
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  7 +----
 mm/mincore.c            |  2 +-
 mm/pgtable-generic.c    |  2 ++
 mm/pte_ref.c            | 73 +++++++++++++++++++++++++++++++++++++------------
 mm/sparse-vmemmap.c     |  2 +-
 mm/swapfile.c           |  3 +-
 mm/userfaultfd.c        | 10 ++++---
 16 files changed, 159 insertions(+), 83 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 85ee730ff6ae..b31915696210 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -583,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		goto out;
 	/*
 	 * The mmap_lock held all the way back in m_start() is what
@@ -1146,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		return 0;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -1475,12 +1475,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		spin_unlock(ptl);
 		return err;
 	}
-
-	if (pmd_trans_unstable(pmdp))
-		return 0;
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-	if (!pte_try_get(walk->mm, pmdp))
+	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmdp))
 		return 0;
 
 	/*
@@ -1818,11 +1815,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		spin_unlock(ptl);
 		return 0;
 	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
 #endif
-	if (!pte_try_get(walk->mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd))
 		return 0;
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index c2389d03bb59..04cca9427270 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -14,8 +14,9 @@
 #include <asm/pgalloc.h>
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+bool pmd_install_try_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd);
+int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd);
 
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
@@ -95,7 +96,8 @@ static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmdp);
-	if (!pmd_present(*pmdp) || !pte_get_unless_zero(pmdp))
+	if (pmd_leaf(*pmdp) || !pmd_present(*pmdp) ||
+	    !pte_get_unless_zero(pmdp))
 		retval = false;
 	spin_unlock(ptl);
 
@@ -174,6 +176,30 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
 #define pte_alloc_get_map_lock(mm, pmd, address, ptlp)		\
 	(pte_alloc_get(mm, pmd) ?				\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
+
+/*
+ * pte_alloc_try_get - allocate a PTE page table if the pmd entry is none, and
+ *		       try to hold a ->pte_refcount
+ * @mm: the mm_struct of the target address space.
+ * @pmdp: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return 1 if the ->pte_refcount is successfully held, return 0 if the pmd
+ * entry is populated with a THP page by another thread, Otherwise return
+ * -ENOMEM.
+ */
+static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	if (!pte_try_get(mm, pmdp))
+		return __pte_alloc_try_get(mm, pmdp);
+	return 1;
+}
+
+static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	if (!pte_try_get(mm, pmdp))
+		return true;
+	return false;
+}
 #else
 static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
 {
@@ -229,6 +255,22 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
 
 #define pte_alloc_get_map_lock(mm, pmd, address, ptlp)		\
 	pte_alloc_map_lock(mm, pmd, address, ptlp)
+
+static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	if (unlikely(pmd_none(*pmdp)))
+		return __pte_alloc_try_get(mm, pmdp);
+	if (unlikely(pmd_devmap_trans_unstable(pmdp)))
+		return 0;
+	return 1;
+}
+
+static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+{
+	if (pmd_trans_unstable(pmdp))
+		return true;
+	return false;
+}
 #endif /* CONFIG_FREE_USER_PTE */
 
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index fb16ecebcb80..18aea3d0ce49 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -896,7 +896,7 @@ config SECRETMEM
 
 config FREE_USER_PTE
 	def_bool y
-	depends on X86_64 && !TRANSPARENT_HUGEPAGE
+	depends on X86_64
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7f2eb64e5c76..8bcab6fc5d17 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3211,12 +3211,13 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		}
 	}
 
-	if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd))
-		pmd_install_get(mm, vmf->pmd, &vmf->prealloc_pte);
-
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
+	if (IS_ENABLED(CONFIG_FREE_USER_PTE) || pmd_none(*vmf->pmd)) {
+		if (!pmd_install_try_get(mm, vmf->pmd, &vmf->prealloc_pte))
+			goto out;
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
 		goto out;
+	}
 
 	return false;
 out:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6b9c98ddcd09..95d90c896580 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1240,6 +1240,10 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	if (!pte_try_get(mm, pmd)) {
+		result = SCAN_PMD_NULL;
+		goto out;
+	}
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1361,6 +1365,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pmd, address);
 	if (ret) {
 		node = khugepaged_find_target_node();
 		/* collapse_huge_page will return with the mmap_lock released */
@@ -1463,6 +1468,8 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!pmd)
 		goto drop_hpage;
 
+	if (!pte_try_get(mm, pmd))
+		goto drop_hpage;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
 
 	/* step 1: check all mapped PTEs are to the right huge page */
@@ -1501,6 +1508,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
+	pte_put(mm, pmd, haddr);
 
 	/* step 3: set proper refcount and mm_counters. */
 	if (count) {
@@ -1522,6 +1530,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 
 abort:
 	pte_unmap_unlock(start_pte, ptl);
+	pte_put(mm, pmd, haddr);
 	goto drop_hpage;
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 1494da73281c..1befb4e64f2b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -191,8 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
 
-	if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
-	    !pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		return 0;
 
 	for (index = start; index != end; index += PAGE_SIZE) {
@@ -392,10 +391,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 regular_page:
-	if (pmd_trans_unstable(pmd))
-		return 0;
 #endif
-	if (!pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
@@ -595,7 +592,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(mm, pmd))
 		return 0;
 	nr_put++;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f7d203ce14af..56c580d37e94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5810,7 +5810,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
@@ -6029,11 +6029,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		spin_unlock(ptl);
 		return 0;
 	}
-
-	if (pmd_trans_unstable(pmd))
-		return 0;
 retry:
-	if (!pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
diff --git a/mm/memory.c b/mm/memory.c
index 8fcef8b67971..99dde124755b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -453,7 +453,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
 
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
-		smp_wmb(); /* See comment in pmd_install_get() */
+		smp_wmb(); /* See comment in __pmd_install() */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
 	}
@@ -1478,8 +1478,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		 * because MADV_DONTNEED holds the mmap_lock in read
 		 * mode.
 		 */
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
-		    !pte_try_get(tlb->mm, pmd))
+		if (pmd_trans_unstable_or_pte_try_get(tlb->mm, pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
 		pte_put(tlb->mm, pmd, addr);
@@ -3760,7 +3759,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	/*
-	 * Use pte_alloc_get() instead of pte_alloc_map().  We can't run
+	 * Use pte_alloc_try_get() instead of pte_alloc_get_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
 	 * from a different thread.
 	 *
@@ -3769,12 +3768,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 *
 	 * Here we only have mmap_read_lock(mm).
 	 */
-	if (pte_alloc_get(vma->vm_mm, vmf->pmd))
-		return VM_FAULT_OOM;
-
-	/* See comment in handle_pte_fault() */
-	if (unlikely(pmd_trans_unstable(vmf->pmd)))
-		return 0;
+	ret = pte_alloc_try_get(vma->vm_mm, vmf->pmd);
+	if (ret <= 0)
+		return ret < 0 ? VM_FAULT_OOM : 0;
+	ret = 0;
 
 	/* Use the zero-page for reads */
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
@@ -4068,18 +4065,22 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				return ret;
 		}
 
-		if (vmf->prealloc_pte)
-			pmd_install_get(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
-		else if (unlikely(pte_alloc_get(vma->vm_mm, vmf->pmd)))
-			return VM_FAULT_OOM;
+		if (vmf->prealloc_pte) {
+			if (!pmd_install_try_get(vma->vm_mm, vmf->pmd,
+			    &vmf->prealloc_pte))
+				return 0;
+		} else {
+			ret = pte_alloc_try_get(vma->vm_mm, vmf->pmd);
+			if (ret <= 0)
+				return ret < 0 ? VM_FAULT_OOM : 0;
+		}
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
+		return 0;
 	} else if (!pte_try_get(vma->vm_mm, vmf->pmd)) {
 		goto retry;
 	}
 
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
-		return 0;
-
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				      vmf->address, &vmf->ptl);
 	ret = 0;
@@ -4890,7 +4891,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
 	if (pgd_present(*pgd))		/* Another has populated it */
 		p4d_free(mm, new);
 	else {
-		smp_wmb(); /* See comment in pmd_install_get() */
+		smp_wmb(); /* See comment in __pmd_install() */
 		pgd_populate(mm, pgd, new);
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -4912,7 +4913,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
 	spin_lock(&mm->page_table_lock);
 	if (!p4d_present(*p4d)) {
 		mm_inc_nr_puds(mm);
-		smp_wmb(); /* See comment in pmd_install_get() */
+		smp_wmb(); /* See comment in __pmd_install() */
 		p4d_populate(mm, p4d, new);
 	} else	/* Another has populated it */
 		pud_free(mm, new);
@@ -4936,7 +4937,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 	ptl = pud_lock(mm, pud);
 	if (!pud_present(*pud)) {
 		mm_inc_nr_pmds(mm);
-		smp_wmb(); /* See comment in pmd_install_get() */
+		smp_wmb(); /* See comment in __pmd_install() */
 		pud_populate(mm, pud, new);
 	} else	/* Another has populated it */
 		pmd_free(mm, new);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index acef7df762a9..9d0493f80a75 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -518,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	/* THP was split, fall through to pte walk */
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd))
 		return 0;
 
 	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index a3bcef1430c9..af5b8900551b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2829,7 +2829,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	/*
-	 * Use pte_alloc_get() instead of pte_alloc_map().  We can't run
+	 * Use pte_alloc_try_get() instead of pte_alloc_get_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
 	 * from a different thread.
 	 *
@@ -2841,12 +2841,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (pte_alloc_get(mm, pmdp))
 		goto abort;
 
-	/* See the comment in pte_alloc_one_map() */
-	if (unlikely(pmd_trans_unstable(pmdp)))
-		goto abort;
-
 	if (unlikely(anon_vma_prepare(vma)))
-		goto put;
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
 		goto put;
 
diff --git a/mm/mincore.c b/mm/mincore.c
index 0401b526adee..a72ec90dd54f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -114,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable(pmd) || !pte_try_get(walk->mm, pmd)) {
+	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) {
 		__mincore_unmapped_range(addr, end, vma, vec);
 		goto out;
 	}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..88464c54619f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/pte_ref.h>
 #include <asm/tlb.h>
 
 /*
@@ -186,6 +187,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 							  struct page, lru);
 	if (pmd_huge_pte(mm, pmdp))
 		list_del(&pgtable->lru);
+	pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR);
 	return pgtable;
 }
 #endif
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index dfbf817b7367..dff32909c7c4 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -42,6 +42,28 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
 	pte_free(mm, pmd_pgtable(pmd));
 }
 
+static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	mm_inc_nr_ptes(mm);
+	/*
+	 * Ensure all pte setup (eg. pte page lock and page clearing) are
+	 * visible before the pte is made visible to other CPUs by being
+	 * put into page tables.
+	 *
+	 * The other side of the story is the pointer chasing in the page
+	 * table walking code (when walking the page table without locking;
+	 * ie. most of the time). Fortunately, these data accesses consist
+	 * of a chain of data-dependent loads, meaning most CPUs (alpha
+	 * being the notable exception) will already guarantee loads are
+	 * seen in-order. See the alpha page table accessors for the
+	 * smp_rmb() barriers in page table walking code.
+	 */
+	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
+	pmd_populate(mm, pmd, *pte);
+	pte_ref_init(*pte, pmd, 1);
+	*pte = NULL;
+}
+
 void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 {
 	spinlock_t *ptl;
@@ -49,24 +71,7 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 retry:
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_none(*pmd))) {
-		mm_inc_nr_ptes(mm);
-		/*
-		 * Ensure all pte setup (eg. pte page lock and page clearing) are
-		 * visible before the pte is made visible to other CPUs by being
-		 * put into page tables.
-		 *
-		 * The other side of the story is the pointer chasing in the page
-		 * table walking code (when walking the page table without locking;
-		 * ie. most of the time). Fortunately, these data accesses consist
-		 * of a chain of data-dependent loads, meaning most CPUs (alpha
-		 * being the notable exception) will already guarantee loads are
-		 * seen in-order. See the alpha page table accessors for the
-		 * smp_rmb() barriers in page table walking code.
-		 */
-		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-		pmd_populate(mm, pmd, *pte);
-		pte_ref_init(*pte, pmd, 1);
-		*pte = NULL;
+		__pmd_install(mm, pmd, pte);
 	} else if (!pte_get_unless_zero(pmd)) {
 		spin_unlock(ptl);
 		goto retry;
@@ -74,6 +79,25 @@ void pmd_install_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 	spin_unlock(ptl);
 }
 
+bool pmd_install_try_get(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	spinlock_t *ptl;
+	bool retval = true;
+
+retry:
+	ptl = pmd_lock(mm, pmd);
+	if (likely(pmd_none(*pmd))) {
+		__pmd_install(mm, pmd, pte);
+	} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
+		retval = false;
+	} else if (!pte_get_unless_zero(pmd)) {
+		spin_unlock(ptl);
+		goto retry;
+	}
+	spin_unlock(ptl);
+	return retval;
+}
+
 int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd)
 {
 	pgtable_t new = pte_alloc_one(mm);
@@ -86,6 +110,19 @@ int __pte_alloc_get(struct mm_struct *mm, pmd_t *pmd)
 	return 0;
 }
 
+int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd)
+{
+	int retval;
+	pgtable_t new = pte_alloc_one(mm);
+	if (!new)
+		return -ENOMEM;
+
+	retval = pmd_install_try_get(mm, pmd, &new);
+	if (new)
+		pte_free(mm, new);
+	return retval;
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
 	return __pte_alloc_get(mm, pmd);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 818c82ee3c6e..0ce288b979f1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
 
-	/* Make pte visible before pmd. See comment in pmd_install_get(). */
+	/* Make pte visible before pmd. See comment in __pmd_install(). */
 	smp_wmb();
 	pmd_populate_kernel(&init_mm, pmd, pgtable);
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 10ebfc94208a..6db8381e1e19 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2021,8 +2021,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd) ||
-		    !pte_try_get(vma->vm_mm, pmd))
+		if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
 				      frontswap, fs_pages_to_unuse);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7661929b27bf..cb76c6167541 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -572,6 +572,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
+		int ret = 1;
 
 		BUG_ON(dst_addr >= dst_start + len);
 
@@ -590,12 +591,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			err = -EEXIST;
 			break;
 		}
-		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc_get(dst_mm, dst_pmd))) {
+		if ((IS_ENABLED(CONFIG_FREE_USER_PTE) &&
+		     unlikely((ret = pte_alloc_try_get(dst_mm, dst_pmd)) < 0)) ||
+		    (unlikely(pmd_none(dst_pmdval)) &&
+		     unlikely((ret = __pte_alloc_try_get(dst_mm, dst_pmd)) < 0))) {
 			err = -ENOMEM;
 			break;
-		} else if (unlikely(pmd_trans_huge(*dst_pmd)) ||
-			   !pte_try_get(dst_mm, dst_pmd)) {
+		} else if (!ret || unlikely(pmd_trans_huge(*dst_pmd))) {
 			/*
 			 * If an huge pmd materialized from under us fail or the
 			 * pte populated in the dst_pmd was freed.
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (6 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 7/9] mm: add THP support for pte_ref Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-08-19  3:18 ` [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Qi Zheng
  2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

With rcu_read_lock(), the release of the PTE page table
can be postponed. So we don't need to hold the pmd lock
anymore when we do pte_try_get()/pte_alloc_get(), which
can improve performance and simplify code logic.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c      |  8 ++++----
 include/linux/pte_ref.h | 34 ++++++++++++++++++----------------
 mm/gup.c                |  2 +-
 mm/hmm.c                |  2 +-
 mm/khugepaged.c         |  4 ++--
 mm/ksm.c                |  2 +-
 mm/madvise.c            |  6 +++---
 mm/memcontrol.c         |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/mempolicy.c          |  2 +-
 mm/migrate.c            |  2 +-
 mm/mincore.c            |  2 +-
 mm/mprotect.c           |  2 +-
 mm/page_vma_mapped.c    |  2 +-
 mm/pagewalk.c           |  2 +-
 mm/pte_ref.c            | 10 +++++++++-
 mm/swapfile.c           |  2 +-
 17 files changed, 55 insertions(+), 45 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b31915696210..f44caef03f22 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -583,7 +583,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		goto out;
 	/*
 	 * The mmap_lock held all the way back in m_start() is what
@@ -1146,7 +1146,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -1477,7 +1477,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmdp))
+	if (pmd_trans_unstable_or_pte_try_get(pmdp))
 		return 0;
 
 	/*
@@ -1816,7 +1816,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		return 0;
 	}
 #endif
-	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 04cca9427270..259e5aec048d 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -90,16 +90,17 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp)
  * Before Operating the PTE page table, we need to hold a ->pte_refcount
  * to protect against the concurrent release of the PTE page table.
  */
-static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+static inline bool pte_try_get(pmd_t *pmdp)
 {
 	bool retval = true;
-	spinlock_t *ptl;
+	pmd_t pmdval;
 
-	ptl = pmd_lock(mm, pmdp);
-	if (pmd_leaf(*pmdp) || !pmd_present(*pmdp) ||
-	    !pte_get_unless_zero(pmdp))
+	rcu_read_lock();
+	pmdval = READ_ONCE(*pmdp);
+	if (pmd_leaf(pmdval) || !pmd_present(pmdval) ||
+	    !pte_get_unless_zero(&pmdval))
 		retval = false;
-	spin_unlock(ptl);
+	rcu_read_unlock();
 
 	return retval;
 }
@@ -159,14 +160,15 @@ static inline void pte_put_vmf(struct vm_fault *vmf)
  */
 static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
 {
-	spinlock_t *ptl;
+	pmd_t pmdval;
 
-	ptl = pmd_lock(mm, pmdp);
-	if (pmd_none(*pmdp) || !pte_get_unless_zero(pmdp)) {
-		spin_unlock(ptl);
+	rcu_read_lock();
+	pmdval = READ_ONCE(*pmdp);
+	if (pmd_none(pmdval) || !pte_get_unless_zero(&pmdval)) {
+		rcu_read_unlock();
 		return __pte_alloc_get(mm, pmdp);
 	}
-	spin_unlock(ptl);
+	rcu_read_unlock();
 	return 0;
 }
 
@@ -189,14 +191,14 @@ static inline int pte_alloc_get(struct mm_struct *mm, pmd_t *pmdp)
  */
 static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp)
 {
-	if (!pte_try_get(mm, pmdp))
+	if (!pte_try_get(pmdp))
 		return __pte_alloc_try_get(mm, pmdp);
 	return 1;
 }
 
-static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+static inline bool pmd_trans_unstable_or_pte_try_get(pmd_t *pmdp)
 {
-	if (!pte_try_get(mm, pmdp))
+	if (!pte_try_get(pmdp))
 		return true;
 	return false;
 }
@@ -227,7 +229,7 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp)
 	return true;
 }
 
-static inline bool pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+static inline bool pte_try_get(pmd_t *pmdp)
 {
 	return true;
 }
@@ -265,7 +267,7 @@ static inline int pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmdp)
 	return 1;
 }
 
-static inline bool pmd_trans_unstable_or_pte_try_get(struct mm_struct *mm, pmd_t *pmdp)
+static inline bool pmd_trans_unstable_or_pte_try_get(pmd_t *pmdp)
 {
 	if (pmd_trans_unstable(pmdp))
 		return true;
diff --git a/mm/gup.c b/mm/gup.c
index 30757f3b176c..c987ac45d939 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -500,7 +500,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
 
-	if (!pte_try_get(mm, pmd))
+	if (!pte_try_get(pmd))
 		return no_page_table(vma, flags);
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/hmm.c b/mm/hmm.c
index 29bb379510cc..d0e767c5fbb6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -380,7 +380,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	if (!pte_try_get(walk->mm, pmdp))
+	if (!pte_try_get(pmdp))
 		goto again;
 
 	ptep = pte_offset_map(pmdp, addr);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 95d90c896580..f33db38eaafc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1240,7 +1240,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
-	if (!pte_try_get(mm, pmd)) {
+	if (!pte_try_get(pmd)) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
@@ -1468,7 +1468,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!pmd)
 		goto drop_hpage;
 
-	if (!pte_try_get(mm, pmd))
+	if (!pte_try_get(pmd))
 		goto drop_hpage;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index d0d72dd1eaf0..4a15418f1252 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1138,7 +1138,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	if (!pte_try_get(mm, pmd))
+	if (!pte_try_get(pmd))
 		goto out;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
diff --git a/mm/madvise.c b/mm/madvise.c
index 1befb4e64f2b..254811f41850 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -191,7 +191,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
 
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 
 	for (index = start; index != end; index += PAGE_SIZE) {
@@ -392,7 +392,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 regular_page:
 #endif
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
@@ -592,7 +592,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
-	if (pmd_trans_unstable_or_pte_try_get(mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 	nr_put++;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56c580d37e94..956920f96191 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5810,7 +5810,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
@@ -6030,7 +6030,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		return 0;
 	}
 retry:
-	if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
diff --git a/mm/memory.c b/mm/memory.c
index 99dde124755b..6a7fe29d593b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1142,7 +1142,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
-		if (!pte_try_get(src_mm, src_pmd))
+		if (!pte_try_get(src_pmd))
 			goto retry;
 		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
 				   addr, next)) {
@@ -1478,7 +1478,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		 * because MADV_DONTNEED holds the mmap_lock in read
 		 * mode.
 		 */
-		if (pmd_trans_unstable_or_pte_try_get(tlb->mm, pmd))
+		if (pmd_trans_unstable_or_pte_try_get(pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
 		pte_put(tlb->mm, pmd, addr);
@@ -2604,7 +2604,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 				continue;
 			pmd_clear_bad(pmd);
 		}
-		if (!create && !pte_try_get(mm, pmd))
+		if (!create && !pte_try_get(pmd))
 			goto retry;
 		err = apply_to_pte_range(mm, pmd, addr, next,
 					 fn, data, create, mask);
@@ -4077,7 +4077,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
 		/* See comment in handle_pte_fault() */
 		return 0;
-	} else if (!pte_try_get(vma->vm_mm, vmf->pmd)) {
+	} else if (!pte_try_get(vmf->pmd)) {
 		goto retry;
 	}
 
@@ -4320,7 +4320,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 			ret = VM_FAULT_SIGBUS;
 			goto out;
 		} else {
-			if (!pte_try_get(vma->vm_mm, vmf->pmd)) {
+			if (!pte_try_get(vmf->pmd)) {
 				ret = VM_FAULT_SIGBUS;
 				goto out;
 			}
@@ -4579,7 +4579,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		if (pmd_devmap_trans_unstable(vmf->pmd))
 			return 0;
 
-		if (!pte_try_get(vmf->vma->vm_mm, vmf->pmd))
+		if (!pte_try_get(vmf->pmd))
 			goto retry;
 
 		if (IS_ENABLED(CONFIG_FREE_USER_PTE))
@@ -5000,7 +5000,7 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
 					(address & PAGE_MASK) + PAGE_SIZE);
 		mmu_notifier_invalidate_range_start(range);
 	}
-	if (!pte_try_get(mm, pmd))
+	if (!pte_try_get(pmd))
 		goto out;
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
 	if (!pte_present(*ptep))
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9d0493f80a75..9a6d1c845a93 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -518,7 +518,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	/* THP was split, fall through to pte walk */
 
-	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd))
+	if (pmd_trans_unstable_or_pte_try_get(pmd))
 		return 0;
 
 	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index af5b8900551b..eb0da7fb7033 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2298,7 +2298,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	if (unlikely(pmd_bad(*pmdp)))
 		return migrate_vma_collect_skip(start, end, walk);
 
-	if (!pte_try_get(mm, pmdp))
+	if (!pte_try_get(pmdp))
 		goto again;
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
diff --git a/mm/mincore.c b/mm/mincore.c
index a72ec90dd54f..e47fe60c6e04 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -114,7 +114,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	if (pmd_trans_unstable_or_pte_try_get(walk->mm, pmd)) {
+	if (pmd_trans_unstable_or_pte_try_get(pmd)) {
 		__mincore_unmapped_range(addr, end, vma, vec);
 		goto out;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 53b412423ee8..4673604c709e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -279,7 +279,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
-		if (!pte_try_get(vma->vm_mm, pmd))
+		if (!pte_try_get(pmd))
 			goto retry;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 					      cp_flags);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eb84fa5825c0..c49bbff7aa60 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -259,7 +259,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
-		if (!pte_try_get(pvmw->vma->vm_mm, pvmw->pmd))
+		if (!pte_try_get(pvmw->pmd))
 			goto retry;
 		if (!map_pte(pvmw))
 			goto next_pte;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index da1324021429..97cd4e726a2b 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -153,7 +153,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 			err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
 		} else {
 			if (!walk->no_vma) {
-				if (!pte_try_get(walk->mm, pmd))
+				if (!pte_try_get(pmd))
 					goto retry;
 				err = walk_pte_range(pmd, addr, next, walk);
 				pte_put(walk->mm, pmd, addr);
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index dff32909c7c4..ea40b1777056 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -26,6 +26,14 @@ static inline void pte_free_debug(pmd_t pmd)
 }
 #endif
 
+static void pte_free_rcu(struct rcu_head *rcu)
+{
+	struct page *page = container_of(rcu, struct page, rcu_head);
+
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
 void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
 {
 	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
@@ -39,7 +47,7 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
 	pte_free_debug(pmd);
 	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
 	mm_dec_nr_ptes(mm);
-	pte_free(mm, pmd_pgtable(pmd));
+	call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu);
 }
 
 static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6db8381e1e19..47e95aceedd5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2021,7 +2021,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
-		if (pmd_trans_unstable_or_pte_try_get(vma->vm_mm, pmd))
+		if (pmd_trans_unstable_or_pte_try_get(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
 				      frontswap, fs_pages_to_unuse);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 9/9] mm: use mmu_gather to free PTE page table
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (7 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Qi Zheng
@ 2021-08-19  3:18 ` Qi Zheng
  2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
  9 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19  3:18 UTC (permalink / raw)
  To: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, Qi Zheng

In unmap_region() and other paths, we can reuse @tlb to
free PTE page table, which can reduce the number of tlb
flush.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/Kconfig        |  2 +-
 include/linux/pte_ref.h | 32 +++++++++++++++++++++++++++-----
 mm/madvise.c            |  4 ++--
 mm/memory.c             |  4 ++--
 mm/mmu_gather.c         | 40 +++++++++++++++++-----------------------
 mm/pte_ref.c            | 12 +++++++++---
 6 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 45962aaf2b2c..fc7453826160 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -232,7 +232,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE		if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE		if PARAVIRT || FREE_USER_PTE
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 259e5aec048d..4b7ea1fe447f 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -29,7 +29,8 @@ int __pte_alloc_try_get(struct mm_struct *mm, pmd_t *pmd);
 
 #ifdef CONFIG_FREE_USER_PTE
 
-void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);
+void free_pte_table(struct mmu_gather *tlb, struct mm_struct *mm, pmd_t *pmdp,
+		    unsigned long addr);
 
 static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
 {
@@ -76,7 +77,6 @@ static inline bool pte_get_unless_zero(pmd_t *pmdp)
 {
 	pgtable_t pte = pmd_pgtable(*pmdp);
 
-	VM_BUG_ON(!PageTable(pte));
 	return atomic_inc_not_zero(&pte->pte_refcount);
 }
 
@@ -105,14 +105,26 @@ static inline bool pte_try_get(pmd_t *pmdp)
 	return retval;
 }
 
-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp,
-				unsigned long addr, unsigned int nr)
+static inline void pte_put_many_tlb(struct mmu_gather *tlb, struct mm_struct *mm,
+				    pmd_t *pmdp, unsigned long addr, unsigned int nr)
 {
 	pgtable_t pte = pmd_pgtable(*pmdp);
 
 	VM_BUG_ON(!PageTable(pte));
 	if (atomic_sub_and_test(nr, &pte->pte_refcount))
-		free_pte_table(mm, pmdp, addr & PMD_MASK);
+		free_pte_table(tlb, mm, pmdp, addr & PMD_MASK);
+}
+
+static inline void pte_put_tlb(struct mmu_gather *tlb, struct mm_struct *mm,
+			       pmd_t *pmdp, unsigned long addr)
+{
+	pte_put_many_tlb(tlb, mm, pmdp, addr, 1);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp,
+				unsigned long addr, unsigned int nr)
+{
+	pte_put_many_tlb(NULL, mm, pmdp, addr, nr);
 }
 
 /*
@@ -234,6 +246,16 @@ static inline bool pte_try_get(pmd_t *pmdp)
 	return true;
 }
 
+static inline void pte_put_many_tlb(struct mmu_gather *tlb, struct mm_struct *mm,
+				    pmd_t *pmdp, unsigned long addr, unsigned int nr)
+{
+}
+
+static inline void pte_put_tlb(struct mmu_gather *tlb, struct mm_struct *mm,
+			       pmd_t *pmdp, unsigned long addr)
+{
+}
+
 static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmdp,
 				unsigned long addr, unsigned int value)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 254811f41850..6616c0567dcb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -474,7 +474,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
-	pte_put(vma->vm_mm, pmd, start);
+	pte_put_tlb(tlb, vma->vm_mm, pmd, start);
 	if (pageout)
 		reclaim_pages(&page_list);
 	cond_resched();
@@ -705,7 +705,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	if (nr_put)
-		pte_put_many(mm, pmd, start, nr_put);
+		pte_put_many_tlb(tlb, mm, pmd, start, nr_put);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 6a7fe29d593b..28639a75ce02 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,7 +1437,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	}
 
 	if (nr_put)
-		pte_put_many(mm, pmd, start, nr_put);
+		pte_put_many_tlb(tlb, mm, pmd, start, nr_put);
 
 	return addr;
 }
@@ -1481,7 +1481,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		if (pmd_trans_unstable_or_pte_try_get(pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-		pte_put(tlb->mm, pmd, addr);
+		pte_put_tlb(tlb, tlb->mm, pmd, addr);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..1bd9fa889421 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -134,42 +134,42 @@ static void __tlb_remove_table_free(struct mmu_table_batch *batch)
  *
  */
 
-static void tlb_remove_table_smp_sync(void *arg)
+static void tlb_remove_table_rcu(struct rcu_head *head)
 {
-	/* Simply deliver the interrupt */
+	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
 }
 
-static void tlb_remove_table_sync_one(void)
+static void tlb_remove_table_free(struct mmu_table_batch *batch)
 {
-	/*
-	 * This isn't an RCU grace period and hence the page-tables cannot be
-	 * assumed to be actually RCU-freed.
-	 *
-	 * It is however sufficient for software page-table walkers that rely on
-	 * IRQ disabling.
-	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	call_rcu(&batch->rcu, tlb_remove_table_rcu);
 }
 
-static void tlb_remove_table_rcu(struct rcu_head *head)
+static void tlb_remove_table_one_rcu(struct rcu_head *head)
 {
-	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
+	struct page *page = container_of(head, struct page, rcu_head);
+
+	__tlb_remove_table(page);
 }
 
-static void tlb_remove_table_free(struct mmu_table_batch *batch)
+static void tlb_remove_table_one(void *table)
 {
-	call_rcu(&batch->rcu, tlb_remove_table_rcu);
+	pgtable_t page = (pgtable_t)table;
+
+	call_rcu(&page->rcu_head, tlb_remove_table_one_rcu);
 }
 
 #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
-static void tlb_remove_table_sync_one(void) { }
-
 static void tlb_remove_table_free(struct mmu_table_batch *batch)
 {
 	__tlb_remove_table_free(batch);
 }
 
+static void tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 /*
@@ -187,12 +187,6 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
-static void tlb_remove_table_one(void *table)
-{
-	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
-}
-
 static void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index ea40b1777056..676923f3c7c8 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -10,6 +10,7 @@
 #include <linux/pte_ref.h>
 #include <linux/hugetlb.h>
 #include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #ifdef CONFIG_DEBUG_VM
 static void pte_free_debug(pmd_t pmd)
@@ -34,7 +35,8 @@ static void pte_free_rcu(struct rcu_head *rcu)
 	__free_page(page);
 }
 
-void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
+void free_pte_table(struct mmu_gather *tlb, struct mm_struct *mm,
+		    pmd_t *pmdp, unsigned long addr)
 {
 	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
 	spinlock_t *ptl;
@@ -45,9 +47,13 @@ void free_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr)
 	spin_unlock(ptl);
 
 	pte_free_debug(pmd);
-	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	if (!tlb) {
+		flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+		call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu);
+	} else {
+		pte_free_tlb(tlb, pmd_pgtable(pmd), addr);
+	}
 	mm_dec_nr_ptes(mm);
-	call_rcu(&pmd_pgtable(pmd)->rcu_head, pte_free_rcu);
 }
 
 static inline void __pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
@ 2021-08-19  7:01   ` David Hildenbrand
  2021-08-19 10:18     ` [External] " Qi Zheng
  2021-09-01 13:53   ` Jason Gunthorpe
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-08-19  7:01 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun

On 19.08.21 05:18, Qi Zheng wrote:
> Some malloc libraries(e.g. jemalloc or tcmalloc) usually
> allocate the amount of VAs by mmap() and do not unmap
> those VAs. They will use madvise(MADV_DONTNEED) to free
> physical memory if they want. But the page tables do not
> be freed by madvise(), so it can produce many page tables
> when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one
> process which actually happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the
> RES is 590g. In theory, the process only need 1.2g PTE page
> tables to map those physical memory. The reason why PTE page
> tables occupy a lot of memory is that madvise(MADV_DONTNEED)
> only empty the PTE and free physical memory but doesn't free
> the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory
> about 108g(best case). And the larger the difference between
> the size of VIRT and RES, the more memory we save.
> 
> In this patch series, we add a pte_refcount field to the
> struct page of page table to track how many users of PTE page
> table. Similar to the mechanism of page refcount, the user of
> PTE page table should hold a refcount to it before accessing.
> The PTE page table page will be freed when the last refcount
> is dropped.
> 
> While we access ->pte_refcount of a PTE page table, any of the
> following ensures the pmd entry corresponding to the PTE page
> table stability:
> 
> 	- mmap_lock
> 	- anon_lock
> 	- i_mmap_lock
> 	- parallel threads are excluded by other means which
> 	  can make ->pmd stable(e.g. gup case)
> 
> This patch does not support THP temporarily, it will be
> supported in the next patch.

Can you clarify (and document here) who exactly takes a reference on the 
page table? Do I understand correctly that

a) each !pte_none() entry inside a page table take a reference to the 
page it's containted in.
b) each page table walker temporarily grabs a page table reference
c) The PMD tables the PTE is referenced in (->currently only ever a 
single one) does *not* take a reference.

So if there are no PTE entries left and nobody walks the page tables, 
you can remove it? You should really extend the 
description/documentation to make it clearer how exactly it's supposed 
to work.


It feels kind of strange to not introduce the CONFIG_FREE_USER_PTE 
Kconfig option in this patch. At least it took me a while to identify it 
in the previous patch.

Maybe you should introduce the empty stubs and use them in a separate 
patch, and then have this patch just introduce CONFIG_FREE_USER_PTE 
along with the actual refcounting magic inside the !stub implementation.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [External] Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-08-19  7:01   ` David Hildenbrand
@ 2021-08-19 10:18     ` Qi Zheng
  0 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-19 10:18 UTC (permalink / raw)
  To: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/8/19 PM3:01, David Hildenbrand wrote:
>>
>> In this patch series, we add a pte_refcount field to the
>> struct page of page table to track how many users of PTE page
>> table. Similar to the mechanism of page refcount, the user of
>> PTE page table should hold a refcount to it before accessing.
>> The PTE page table page will be freed when the last refcount
>> is dropped.
>>
>> While we access ->pte_refcount of a PTE page table, any of the
>> following ensures the pmd entry corresponding to the PTE page
>> table stability:
>>
>>     - mmap_lock
>>     - anon_lock
>>     - i_mmap_lock
>>     - parallel threads are excluded by other means which
>>       can make ->pmd stable(e.g. gup case)
>>
>> This patch does not support THP temporarily, it will be
>> supported in the next patch.
> 
> Can you clarify (and document here) who exactly takes a reference on the 
> page table? Do I understand correctly that
> 
> a) each !pte_none() entry inside a page table take a reference to the 
> page it's containted in.
> b) each page table walker temporarily grabs a page table reference
> c) The PMD tables the PTE is referenced in (->currently only ever a 
> single one) does *not* take a reference.

Yes, both of the !pte_none() entry and the page table walker can be
regarded as users of the PTE page table, so they need to hold a
->pte_refcount during their life cycle. And the pte_refcount field
of struct page is only for PTE page table, so the PMD page tables does
*not* take a ->pte_refcount.

> 
> So if there are no PTE entries left and nobody walks the page tables, 
> you can remove it? You should really extend the 

Yes, if there are no PTE entries left and nobody walks the page tables,
which means there is no user, then we can remove it when we drop the
last ->pte_refcount.

> description/documentation to make it clearer how exactly it's supposed 
> to work
I'm sorry that there is no clear description of the usage of
pte_refcount, i will make a documentation to describe it.

> 
> 
> It feels kind of strange to not introduce the CONFIG_FREE_USER_PTE 
> Kconfig option in this patch. At least it took me a while to identify it 
> in the previous patch.

The introduction of the CONFIG_FREE_USER_PTE and related APIs are all
place in the previous patch ([PATCH v2 5/9] mm: pte_refcount
infrastructure). And in this and next patch, we use these
infrastructures to free user PTE page table pages.

> 
> Maybe you should introduce the empty stubs and use them in a separate 
> patch, and then have this patch just introduce CONFIG_FREE_USER_PTE 
> along with the actual refcounting magic inside the !stub implementation.
> 
Hmm, let me think about this suggestion.

Thanks,

Qi


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/9] mm: introduce pmd_install() helper
  2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
@ 2021-08-24 16:26   ` David Hildenbrand
  2021-08-25 16:20     ` Qi Zheng
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-08-24 16:26 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun

On 19.08.21 05:18, Qi Zheng wrote:
> Currently we have three times the same few lines repeated in the
> code. Deduplicate them by newly introduced pmd_install() helper.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   include/linux/mm.h |  1 +
>   mm/filemap.c       | 11 ++---------
>   mm/memory.c        | 34 ++++++++++++++++------------------
>   3 files changed, 19 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ce8fc0fd6d6e..57e48217bd71 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2471,6 +2471,7 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>   	return ptl;
>   }
>   
> +extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
>   extern void __init pagecache_init(void);
>   extern void __init free_area_init_memoryless_node(int nid);
>   extern void free_initmem(void);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 53913fced7ae..9f773059c6dc 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3210,15 +3210,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
>   	    }
>   	}
>   
> -	if (pmd_none(*vmf->pmd)) {
> -		vmf->ptl = pmd_lock(mm, vmf->pmd);
> -		if (likely(pmd_none(*vmf->pmd))) {
> -			mm_inc_nr_ptes(mm);
> -			pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
> -			vmf->prealloc_pte = NULL;
> -		}
> -		spin_unlock(vmf->ptl);
> -	}
> +	if (pmd_none(*vmf->pmd))
> +		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
>   
>   	/* See comment in handle_pte_fault() */
>   	if (pmd_devmap_trans_unstable(vmf->pmd)) {
> diff --git a/mm/memory.c b/mm/memory.c
> index 39e7a1495c3c..ef7b1762e996 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -433,9 +433,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   	}
>   }
>   
> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
> +{
> +	spinlock_t *ptl = pmd_lock(mm, pmd);
> +
> +	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
> +		mm_inc_nr_ptes(mm);
> +		pmd_populate(mm, pmd, *pte);
> +		*pte = NULL;
> +	}
> +	spin_unlock(ptl);
> +}
> +
>   int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>   {
> -	spinlock_t *ptl;
>   	pgtable_t new = pte_alloc_one(mm);
>   	if (!new)
>   		return -ENOMEM;
> @@ -455,13 +466,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>   	 */
>   	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>   
> -	ptl = pmd_lock(mm, pmd);
> -	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
> -		mm_inc_nr_ptes(mm);
> -		pmd_populate(mm, pmd, new);
> -		new = NULL;
> -	}
> -	spin_unlock(ptl);
> +	pmd_install(mm, pmd, &new);
>   	if (new)
>   		pte_free(mm, new);
>   	return 0;
> @@ -4027,17 +4032,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   				return ret;
>   		}
>   
> -		if (vmf->prealloc_pte) {
> -			vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -			if (likely(pmd_none(*vmf->pmd))) {
> -				mm_inc_nr_ptes(vma->vm_mm);
> -				pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> -				vmf->prealloc_pte = NULL;
> -			}
> -			spin_unlock(vmf->ptl);
> -		} else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
> +		if (vmf->prealloc_pte)
> +			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
> +		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
>   			return VM_FAULT_OOM;
> -		}
>   	}
>   
>   	/* See comment in handle_pte_fault() */
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

That's mostly unrelated to the remaining part of the series and can be 
picked up early.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/9] mm: introduce pmd_install() helper
  2021-08-24 16:26   ` David Hildenbrand
@ 2021-08-25 16:20     ` Qi Zheng
  2021-08-25 16:32       ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Qi Zheng @ 2021-08-25 16:20 UTC (permalink / raw)
  To: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/8/25 AM12:26, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Currently we have three times the same few lines repeated in the
>> code. Deduplicate them by newly introduced pmd_install() helper.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/mm.h |  1 +
>>   mm/filemap.c       | 11 ++---------
>>   mm/memory.c        | 34 ++++++++++++++++------------------
>>   3 files changed, 19 insertions(+), 27 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ce8fc0fd6d6e..57e48217bd71 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2471,6 +2471,7 @@ static inline spinlock_t *pud_lock(struct 
>> mm_struct *mm, pud_t *pud)
>>       return ptl;
>>   }
>> +extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t 
>> *pte);
>>   extern void __init pagecache_init(void);
>>   extern void __init free_area_init_memoryless_node(int nid);
>>   extern void free_initmem(void);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 53913fced7ae..9f773059c6dc 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3210,15 +3210,8 @@ static bool filemap_map_pmd(struct vm_fault 
>> *vmf, struct page *page)
>>           }
>>       }
>> -    if (pmd_none(*vmf->pmd)) {
>> -        vmf->ptl = pmd_lock(mm, vmf->pmd);
>> -        if (likely(pmd_none(*vmf->pmd))) {
>> -            mm_inc_nr_ptes(mm);
>> -            pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
>> -            vmf->prealloc_pte = NULL;
>> -        }
>> -        spin_unlock(vmf->ptl);
>> -    }
>> +    if (pmd_none(*vmf->pmd))
>> +        pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
>>       /* See comment in handle_pte_fault() */
>>       if (pmd_devmap_trans_unstable(vmf->pmd)) {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 39e7a1495c3c..ef7b1762e996 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -433,9 +433,20 @@ void free_pgtables(struct mmu_gather *tlb, struct 
>> vm_area_struct *vma,
>>       }
>>   }
>> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>> +{
>> +    spinlock_t *ptl = pmd_lock(mm, pmd);
>> +
>> +    if (likely(pmd_none(*pmd))) {    /* Has another populated it ? */
>> +        mm_inc_nr_ptes(mm);
>> +        pmd_populate(mm, pmd, *pte);
>> +        *pte = NULL;
>> +    }
>> +    spin_unlock(ptl);
>> +}
>> +
>>   int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>>   {
>> -    spinlock_t *ptl;
>>       pgtable_t new = pte_alloc_one(mm);
>>       if (!new)
>>           return -ENOMEM;
>> @@ -455,13 +466,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>>        */
>>       smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>> -    ptl = pmd_lock(mm, pmd);
>> -    if (likely(pmd_none(*pmd))) {    /* Has another populated it ? */
>> -        mm_inc_nr_ptes(mm);
>> -        pmd_populate(mm, pmd, new);
>> -        new = NULL;
>> -    }
>> -    spin_unlock(ptl);
>> +    pmd_install(mm, pmd, &new);
>>       if (new)
>>           pte_free(mm, new);
>>       return 0;
>> @@ -4027,17 +4032,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>                   return ret;
>>           }
>> -        if (vmf->prealloc_pte) {
>> -            vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>> -            if (likely(pmd_none(*vmf->pmd))) {
>> -                mm_inc_nr_ptes(vma->vm_mm);
>> -                pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
>> -                vmf->prealloc_pte = NULL;
>> -            }
>> -            spin_unlock(vmf->ptl);
>> -        } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
>> +        if (vmf->prealloc_pte)
>> +            pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
>> +        else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
>>               return VM_FAULT_OOM;
>> -        }
>>       }
>>       /* See comment in handle_pte_fault() */
>>
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks for your review, I will add this to the patch v3.

> 
> That's mostly unrelated to the remaining part of the series and can be 
> picked up early.

The implementation of subsequent patches depends on pmd_install().
So I am worried that if this patch is submitted as a separate patch,
subsequent patches will not be updated until this patch is merged.
What do you think?

> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/9] mm: introduce pmd_install() helper
  2021-08-25 16:20     ` Qi Zheng
@ 2021-08-25 16:32       ` David Hildenbrand
  2021-08-26  3:04         ` Qi Zheng
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-08-25 16:32 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun

On 25.08.21 18:20, Qi Zheng wrote:
> 
> 
> On 2021/8/25 AM12:26, David Hildenbrand wrote:
>> On 19.08.21 05:18, Qi Zheng wrote:
>>> Currently we have three times the same few lines repeated in the
>>> code. Deduplicate them by newly introduced pmd_install() helper.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>    include/linux/mm.h |  1 +
>>>    mm/filemap.c       | 11 ++---------
>>>    mm/memory.c        | 34 ++++++++++++++++------------------
>>>    3 files changed, 19 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index ce8fc0fd6d6e..57e48217bd71 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2471,6 +2471,7 @@ static inline spinlock_t *pud_lock(struct
>>> mm_struct *mm, pud_t *pud)
>>>        return ptl;
>>>    }
>>> +extern void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t
>>> *pte);
>>>    extern void __init pagecache_init(void);
>>>    extern void __init free_area_init_memoryless_node(int nid);
>>>    extern void free_initmem(void);
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 53913fced7ae..9f773059c6dc 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -3210,15 +3210,8 @@ static bool filemap_map_pmd(struct vm_fault
>>> *vmf, struct page *page)
>>>            }
>>>        }
>>> -    if (pmd_none(*vmf->pmd)) {
>>> -        vmf->ptl = pmd_lock(mm, vmf->pmd);
>>> -        if (likely(pmd_none(*vmf->pmd))) {
>>> -            mm_inc_nr_ptes(mm);
>>> -            pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
>>> -            vmf->prealloc_pte = NULL;
>>> -        }
>>> -        spin_unlock(vmf->ptl);
>>> -    }
>>> +    if (pmd_none(*vmf->pmd))
>>> +        pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
>>>        /* See comment in handle_pte_fault() */
>>>        if (pmd_devmap_trans_unstable(vmf->pmd)) {
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 39e7a1495c3c..ef7b1762e996 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -433,9 +433,20 @@ void free_pgtables(struct mmu_gather *tlb, struct
>>> vm_area_struct *vma,
>>>        }
>>>    }
>>> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>>> +{
>>> +    spinlock_t *ptl = pmd_lock(mm, pmd);
>>> +
>>> +    if (likely(pmd_none(*pmd))) {    /* Has another populated it ? */
>>> +        mm_inc_nr_ptes(mm);
>>> +        pmd_populate(mm, pmd, *pte);
>>> +        *pte = NULL;
>>> +    }
>>> +    spin_unlock(ptl);
>>> +}
>>> +
>>>    int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>>>    {
>>> -    spinlock_t *ptl;
>>>        pgtable_t new = pte_alloc_one(mm);
>>>        if (!new)
>>>            return -ENOMEM;
>>> @@ -455,13 +466,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>>>         */
>>>        smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>>> -    ptl = pmd_lock(mm, pmd);
>>> -    if (likely(pmd_none(*pmd))) {    /* Has another populated it ? */
>>> -        mm_inc_nr_ptes(mm);
>>> -        pmd_populate(mm, pmd, new);
>>> -        new = NULL;
>>> -    }
>>> -    spin_unlock(ptl);
>>> +    pmd_install(mm, pmd, &new);
>>>        if (new)
>>>            pte_free(mm, new);
>>>        return 0;
>>> @@ -4027,17 +4032,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>                    return ret;
>>>            }
>>> -        if (vmf->prealloc_pte) {
>>> -            vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> -            if (likely(pmd_none(*vmf->pmd))) {
>>> -                mm_inc_nr_ptes(vma->vm_mm);
>>> -                pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
>>> -                vmf->prealloc_pte = NULL;
>>> -            }
>>> -            spin_unlock(vmf->ptl);
>>> -        } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
>>> +        if (vmf->prealloc_pte)
>>> +            pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
>>> +        else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
>>>                return VM_FAULT_OOM;
>>> -        }
>>>        }
>>>        /* See comment in handle_pte_fault() */
>>>
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Thanks for your review, I will add this to the patch v3.
> 
>>
>> That's mostly unrelated to the remaining part of the series and can be
>> picked up early.
> 
> The implementation of subsequent patches depends on pmd_install().
> So I am worried that if this patch is submitted as a separate patch,
> subsequent patches will not be updated until this patch is merged.
> What do you think?

Usually I tend to send cleanups out independently, and then just base 
the other series on top of the other series.

I'll have some more comments in reply to v2. It's fairly hard to review 
because you do a lot of complicated stuff in only a handful of patches 
:) I'll try to think of something reasonable on how to split this up to 
make it easier to digest.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/9] mm: introduce pmd_install() helper
  2021-08-25 16:32       ` David Hildenbrand
@ 2021-08-26  3:04         ` Qi Zheng
  0 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-08-26  3:04 UTC (permalink / raw)
  To: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/8/26 AM12:32, David Hildenbrand wrote:
> On 25.08.21 18:20, Qi Zheng wrote:
>>
>>
>> On 2021/8/25 AM12:26, David Hildenbrand wrote:
>>> On 19.08.21 05:18, Qi Zheng wrote:
>>>> Currently we have three times the same few lines repeated in the
>>>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>
>> Thanks for your review, I will add this to the patch v3.
>>
>>>
>>> That's mostly unrelated to the remaining part of the series and can be
>>> picked up early.
>>
>> The implementation of subsequent patches depends on pmd_install().
>> So I am worried that if this patch is submitted as a separate patch,
>> subsequent patches will not be updated until this patch is merged.
>> What do you think?
> 
> Usually I tend to send cleanups out independently, and then just base 
> the other series on top of the other series.

LGTM, I will submit [PATCH v2 1/9] and [PATCH v2 2/9] as a separate
patch series.

> 
> I'll have some more comments in reply to v2. It's fairly hard to revie > because you do a lot of complicated stuff in only a handful of patches
> :) I'll try to think of something reasonable on how to split this up to 
> make it easier to digest.
> 

Thank you very much, and I look forward to your suggestions. At the same
time, I'll also find ways to organize the code more clearly and
concisely, and add documentation to explain the new APIs.

Thanks,
Qi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
                   ` (8 preceding siblings ...)
  2021-08-19  3:18 ` [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Qi Zheng
@ 2021-09-01 12:32 ` David Hildenbrand
  2021-09-01 16:07   ` Jason Gunthorpe
                     ` (2 more replies)
  9 siblings, 3 replies; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 12:32 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun

On 19.08.21 05:18, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to free user PTE page table pages when all PTE entries
> are empty.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
> They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
> 	VIRT:  55t
> 	RES:   590g
> 	VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.
> 
> Testing:
> 
> The following code snippet can show the effect of optimization:
> 
> 	mmap 50G
> 	while (1) {
> 		for (; i < 1024 * 25; i++) {
> 			touch 2M memory
> 			madvise MADV_DONTNEED 2M
> 		}
> 	}
> 
> As we can see, the memory usage of VmPTE is reduced:
> 
> 			before		                after
> VIRT		       50.0 GB			      50.0 GB
> RES		        3.1 MB			       3.6 MB
> VmPTE		     102640 kB			       248 kB
> 
> I also have tested the stability by LTP[1] for several weeks. I have not seen
> any crash so far.
> 
> The performance of page fault can be affected because of the allocation/freeing
> of PTE page table pages. The following is the test result by using a micro
> benchmark[2]:
> 
> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
> 
> threads         before (pf/min)                     after (pf/min)
>      1                32,085,255                         31,880,833 (-0.64%)
>      8               101,674,967                        100,588,311 (-1.17%)
>     16               113,207,000                        112,801,832 (-0.36%)
> 
> (The "pfn/min" means how many page faults in one minute.)
> 
> The performance of page fault is ~1% slower than before.
> 
> This series is based on next-20210812.
> 
> Comments and suggestions are welcome.
> 
> Thanks,
> Qi.
> 


Some high-level feedback after studying the code:

1. Try introducing the new dummy primitives ("API") first, and then 
convert each subsystem individually; especially, maybe convert the whole 
pagefault handling in a single patch, because it's far from trivial. 
This will make this series much easier to digest.

Then, have a patch that adds actual logic to the dummy primitives via a 
config option.

2. Minimize the API.

a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to 
pte_alloc_get().

b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.

Handle it independently for now, even if it implies duplicate runtime 
checks.

if (pmd_trans_unstable() || !pte_try_get()) ...

We can always optimize later, once we can come up with something cleaner.

3. Merge #6, and #7, after factoring out all changes to other subsystems 
to use the API

4. Merge #8 into #6. There is a lot of unnecessary code churn back and 
forth, and IMHO the whole approach might not make sense without RCU due 
to the additional locking overhead.

Or at least, try to not modify the API you introduced in patch #6 or #7 
in #8 again. Converting all call sites back and forth just makes review 
quite hard.


I am preparing some some cleanups that will make get_locked_pte() and 
similar a little nicer to handle. I'll send them out this or next week.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
  2021-08-19  7:01   ` David Hildenbrand
@ 2021-09-01 13:53   ` Jason Gunthorpe
  2021-09-01 13:57     ` David Hildenbrand
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 13:53 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, david, linux-doc, linux-kernel, linux-mm,
	songmuchun

On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:

> diff --git a/mm/gup.c b/mm/gup.c
> index 2630ed1bb4f4..30757f3b176c 100644
> +++ b/mm/gup.c
> @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	if (unlikely(pmd_bad(*pmd)))
>  		return no_page_table(vma, flags);
>  
> +	if (!pte_try_get(mm, pmd))
> +		return no_page_table(vma, flags);
> +
>  	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);

This is not good on a performance path, the pte_try_get() is
locking/locking the same lock that pte_offset_map_lock() is getting.

This would be much better if the map_lock infra could manage the
refcount itself.

I'm also not really keen on adding ptl level locking to all the
currently no-lock paths. If we are doing that then the no-lock paths
should rely on the ptl for alot more of their operations and avoid the
complicatred no-lock data access we have. eg 'pte_try_get()' should
also copy the pte_t under the lock.

Also, I don't really understand how this scheme works with
get_user_pages_fast.

Currently the zap triggers a TLB invalidation which synchronizes with
GUP fast, however this only makes the ptes non-present. The purpose is
to synchronize with the struct page refcount, not a pte refcount.

With this series the non-present PTEs are freed but how does this
synchronize with gup fast to avoid a use-after-free on the pte struct
page?

I agree with David, this series needs significant splitting to be
readable and a lot more explanation in the commit messages how all the
locking is working. Eg introducing the freeing should be a single
short patch at at end with a full explanation of the locking in all
the major scenarios.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 13:53   ` Jason Gunthorpe
@ 2021-09-01 13:57     ` David Hildenbrand
  2021-09-01 15:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 13:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Qi Zheng
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun

On 01.09.21 15:53, Jason Gunthorpe wrote:
> On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
> 
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 2630ed1bb4f4..30757f3b176c 100644
>> +++ b/mm/gup.c
>> @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>   	if (unlikely(pmd_bad(*pmd)))
>>   		return no_page_table(vma, flags);
>>   
>> +	if (!pte_try_get(mm, pmd))
>> +		return no_page_table(vma, flags);
>> +
>>   	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> 
> This is not good on a performance path, the pte_try_get() is
> locking/locking the same lock that pte_offset_map_lock() is getting.

Yes, and we really need patch #8, anything else is just confusing reviewers.

> 
> This would be much better if the map_lock infra could manage the
> refcount itself.
> 
> I'm also not really keen on adding ptl level locking to all the
> currently no-lock paths. If we are doing that then the no-lock paths
> should rely on the ptl for alot more of their operations and avoid the
> complicatred no-lock data access we have. eg 'pte_try_get()' should
> also copy the pte_t under the lock.
> 
> Also, I don't really understand how this scheme works with
> get_user_pages_fast.

With the RCU change it in #8 it should work just fine, because RCU 
synchronize has to wait either until all other CPUs have left the RCU 
read section, or re-enabled interrupts.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 13:57     ` David Hildenbrand
@ 2021-09-01 15:32       ` Jason Gunthorpe
  2021-09-01 16:13         ` David Hildenbrand
  2021-09-02  6:53         ` Qi Zheng
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 15:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 03:57:09PM +0200, David Hildenbrand wrote:
> On 01.09.21 15:53, Jason Gunthorpe wrote:
> > On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
> > 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 2630ed1bb4f4..30757f3b176c 100644
> > > +++ b/mm/gup.c
> > > @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> > >   	if (unlikely(pmd_bad(*pmd)))
> > >   		return no_page_table(vma, flags);
> > > +	if (!pte_try_get(mm, pmd))
> > > +		return no_page_table(vma, flags);
> > > +
> > >   	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> > 
> > This is not good on a performance path, the pte_try_get() is
> > locking/locking the same lock that pte_offset_map_lock() is getting.
> 
> Yes, and we really need patch #8, anything else is just confusing reviewers.

It is a bit better with patch 8, but it is still not optimal, we don't
need to do the atomic work at all if the entire ptep is accessed while
locked. So the above is stil not what I would expect here, even with
RCU.

eg I would expect that this kind of change would work first with the
existing paired acessors, ie

	pte = pte_offset_map(pmd, address);
	pte_unmap(pte);

Should handle the refcount under the covers, and same kind of idea for
the _locked/_unlocked varient.

Only places that don't already use that pairing should get modified.

To do this we have to extend the API so that pte_offset_map() can
fail, or very cleverly return some kind of global non-present pte page
(I wonder if the zero page would work?)

> > Also, I don't really understand how this scheme works with
> > get_user_pages_fast.
> 
> With the RCU change it in #8 it should work just fine, because RCU
> synchronize has to wait either until all other CPUs have left the RCU read
> section, or re-enabled interrupts.

So at this point in the series fast gup is broken, that does mean the
series presentation really needs to be reworked. The better
presentation is to add the API changes, with a
no-functional-difference implementation, push the new API in well
split patches to all the consumption sites, then change the API to
have the new semantics.

RCU and refcount to free the page levels seems like a reasonable
approach, but I have to say I haven't thought it through fully - are
all the contexts that have the pte deref safe to do call_rcu?

Jason



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
@ 2021-09-01 16:07   ` Jason Gunthorpe
  2021-09-01 16:10     ` David Hildenbrand
  2021-09-02  3:37   ` Qi Zheng
  2021-09-15 14:52   ` Qi Zheng
  2 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 16:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 02:32:08PM +0200, David Hildenbrand wrote:
 
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.

I suspect the good API here is really more like:

  ptep = pte_try_map(pmdp, &pmd_value)
  if (!ptep) {
     // pmd_value is guarenteed to not be a PTE table pointer.
     if (pmd_XXX(pmd_value))
  }

Ie the core code will do whatever stuff, including the THP data race
avoidance, to either return the next level page table or the value of
a pmd that is not a enxt level page table. Callers are much clearer in
this way.

Eg this is a fairly representative sample user:

static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
			   struct mm_walk *walk)
{
	if (pmd_trans_unstable(pmd))
		goto out;
	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

And it is obviously pretty easy to integrate any refcount into
pte_try_map and pte_unmap as in my other email.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-01 16:07   ` Jason Gunthorpe
@ 2021-09-01 16:10     ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 18:07, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 02:32:08PM +0200, David Hildenbrand wrote:
>   
>> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
> 
> I suspect the good API here is really more like:

That was my exactly my first idea and I tried to rework the code for 
roughly 2 days and failed.

Especially in pagefault logic, we temporarily unmap/unlock to map/lock 
again later and don't want the page table to just vanish.

I think I met similar cases when allocating a page table and not wanting 
it to vanish and not wanting to map/lock it. But I don't recall all the 
corner cases: it didn't work for me.

> 
>    ptep = pte_try_map(pmdp, &pmd_value)
>    if (!ptep) {
>       // pmd_value is guarenteed to not be a PTE table pointer.
>       if (pmd_XXX(pmd_value))
>    }
> 
> Ie the core code will do whatever stuff, including the THP data race
> avoidance, to either return the next level page table or the value of
> a pmd that is not a enxt level page table. Callers are much clearer in
> this way.
> 
> Eg this is a fairly representative sample user:
> 
> static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> 			   struct mm_walk *walk)
> {
> 	if (pmd_trans_unstable(pmd))
> 		goto out;
> 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> 
> And it is obviously pretty easy to integrate any refcount into
> pte_try_map and pte_unmap as in my other email.

It didn't work when I tried.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 15:32       ` Jason Gunthorpe
@ 2021-09-01 16:13         ` David Hildenbrand
  2021-09-01 16:16           ` Jason Gunthorpe
  2021-09-02  6:53         ` Qi Zheng
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 17:32, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 03:57:09PM +0200, David Hildenbrand wrote:
>> On 01.09.21 15:53, Jason Gunthorpe wrote:
>>> On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 2630ed1bb4f4..30757f3b176c 100644
>>>> +++ b/mm/gup.c
>>>> @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>>    	if (unlikely(pmd_bad(*pmd)))
>>>>    		return no_page_table(vma, flags);
>>>> +	if (!pte_try_get(mm, pmd))
>>>> +		return no_page_table(vma, flags);
>>>> +
>>>>    	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>>>
>>> This is not good on a performance path, the pte_try_get() is
>>> locking/locking the same lock that pte_offset_map_lock() is getting.
>>
>> Yes, and we really need patch #8, anything else is just confusing reviewers.
> 
> It is a bit better with patch 8, but it is still not optimal, we don't
> need to do the atomic work at all if the entire ptep is accessed while
> locked. So the above is stil not what I would expect here, even with
> RCU.
> 
> eg I would expect that this kind of change would work first with the
> existing paired acessors, ie
> 
> 	pte = pte_offset_map(pmd, address);
> 	pte_unmap(pte);
> 
> Should handle the refcount under the covers, and same kind of idea for
> the _locked/_unlocked varient.

See my other mail.

> 
> Only places that don't already use that pairing should get modified.
> 
> To do this we have to extend the API so that pte_offset_map() can
> fail, or very cleverly return some kind of global non-present pte page
> (I wonder if the zero page would work?)

I explored both ideas (returning NULL, return a specially prepared page) 
and it didn't work in some cases where we unmap+remap etc.

> 
>>> Also, I don't really understand how this scheme works with
>>> get_user_pages_fast.
>>
>> With the RCU change it in #8 it should work just fine, because RCU
>> synchronize has to wait either until all other CPUs have left the RCU read
>> section, or re-enabled interrupts.
> 
> So at this point in the series fast gup is broken, that does mean the
> series presentation really needs to be reworked. The better
> presentation is to add the API changes, with a
> no-functional-difference implementation, push the new API in well
> split patches to all the consumption sites, then change the API to
> have the new semantics.

Exactly my thoughts.

> 
> RCU and refcount to free the page levels seems like a reasonable
> approach, but I have to say I haven't thought it through fully - are
> all the contexts that have the pte deref safe to do call_rcu?


Very good question. I'd assume so.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 16:13         ` David Hildenbrand
@ 2021-09-01 16:16           ` Jason Gunthorpe
  2021-09-01 16:19             ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 16:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 06:13:07PM +0200, David Hildenbrand wrote:
> On 01.09.21 17:32, Jason Gunthorpe wrote:
> > On Wed, Sep 01, 2021 at 03:57:09PM +0200, David Hildenbrand wrote:
> > > On 01.09.21 15:53, Jason Gunthorpe wrote:
> > > > On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
> > > > 
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index 2630ed1bb4f4..30757f3b176c 100644
> > > > > +++ b/mm/gup.c
> > > > > @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> > > > >    	if (unlikely(pmd_bad(*pmd)))
> > > > >    		return no_page_table(vma, flags);
> > > > > +	if (!pte_try_get(mm, pmd))
> > > > > +		return no_page_table(vma, flags);
> > > > > +
> > > > >    	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> > > > 
> > > > This is not good on a performance path, the pte_try_get() is
> > > > locking/locking the same lock that pte_offset_map_lock() is getting.
> > > 
> > > Yes, and we really need patch #8, anything else is just confusing reviewers.
> > 
> > It is a bit better with patch 8, but it is still not optimal, we don't
> > need to do the atomic work at all if the entire ptep is accessed while
> > locked. So the above is stil not what I would expect here, even with
> > RCU.
> > 
> > eg I would expect that this kind of change would work first with the
> > existing paired acessors, ie
> > 
> > 	pte = pte_offset_map(pmd, address);
> > 	pte_unmap(pte);
> > 
> > Should handle the refcount under the covers, and same kind of idea for
> > the _locked/_unlocked varient.
> 
> See my other mail.

Do you have a reference?

> > Only places that don't already use that pairing should get modified.
> > 
> > To do this we have to extend the API so that pte_offset_map() can
> > fail, or very cleverly return some kind of global non-present pte page
> > (I wonder if the zero page would work?)
> 
> I explored both ideas (returning NULL, return a specially prepared page) and
> it didn't work in some cases where we unmap+remap etc.

I wouldn't think it works everywhere, bit it works in a lot of places,
and it is a heck of a lot better than what is proposed here. I'd
rather see the places that can use it be moved, and the few places
that can't be opencoded.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 16:16           ` Jason Gunthorpe
@ 2021-09-01 16:19             ` David Hildenbrand
  2021-09-01 17:10               ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 16:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 18:16, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 06:13:07PM +0200, David Hildenbrand wrote:
>> On 01.09.21 17:32, Jason Gunthorpe wrote:
>>> On Wed, Sep 01, 2021 at 03:57:09PM +0200, David Hildenbrand wrote:
>>>> On 01.09.21 15:53, Jason Gunthorpe wrote:
>>>>> On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
>>>>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index 2630ed1bb4f4..30757f3b176c 100644
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>>>>     	if (unlikely(pmd_bad(*pmd)))
>>>>>>     		return no_page_table(vma, flags);
>>>>>> +	if (!pte_try_get(mm, pmd))
>>>>>> +		return no_page_table(vma, flags);
>>>>>> +
>>>>>>     	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>>>>>
>>>>> This is not good on a performance path, the pte_try_get() is
>>>>> locking/locking the same lock that pte_offset_map_lock() is getting.
>>>>
>>>> Yes, and we really need patch #8, anything else is just confusing reviewers.
>>>
>>> It is a bit better with patch 8, but it is still not optimal, we don't
>>> need to do the atomic work at all if the entire ptep is accessed while
>>> locked. So the above is stil not what I would expect here, even with
>>> RCU.
>>>
>>> eg I would expect that this kind of change would work first with the
>>> existing paired acessors, ie
>>>
>>> 	pte = pte_offset_map(pmd, address);
>>> 	pte_unmap(pte);
>>>
>>> Should handle the refcount under the covers, and same kind of idea for
>>> the _locked/_unlocked varient.
>>
>> See my other mail.
> 
> Do you have a reference?

Reply to the other mail you just send.

> 
>>> Only places that don't already use that pairing should get modified.
>>>
>>> To do this we have to extend the API so that pte_offset_map() can
>>> fail, or very cleverly return some kind of global non-present pte page
>>> (I wonder if the zero page would work?)
>>
>> I explored both ideas (returning NULL, return a specially prepared page) and
>> it didn't work in some cases where we unmap+remap etc.
> 
> I wouldn't think it works everywhere, bit it works in a lot of places,
> and it is a heck of a lot better than what is proposed here. I'd
> rather see the places that can use it be moved, and the few places
> that can't be opencoded.

Well, I used ptep_get_map_lock() and friends. But hacking directly into 
ptep_map_lock() and friends wasn't possible due to all the corner cases.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 16:19             ` David Hildenbrand
@ 2021-09-01 17:10               ` Jason Gunthorpe
  2021-09-01 17:49                 ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 06:19:03PM +0200, David Hildenbrand wrote:

> > I wouldn't think it works everywhere, bit it works in a lot of places,
> > and it is a heck of a lot better than what is proposed here. I'd
> > rather see the places that can use it be moved, and the few places
> > that can't be opencoded.
> 
> Well, I used ptep_get_map_lock() and friends. But hacking directly into
> ptep_map_lock() and friends wasn't possible due to all the corner cases.

Sure, I'm not surprised you can't get every single case, but that just
suggest we need two API families, today's to support the special cases
and a different one for the other regular simple cases. 

A new function family pte_try_map/_locked() and paired unmap that can
internally do the recounting and THP trickery and convert the easy
callsites.

Very rough counting suggest at least half of the pte_offset_map_lock()
call sites can trivially use the simpler API.

The other cases can stay as is and get open coded refcounts, or maybe
someone will have a better idea once they are more clearly identified.

But I don't think we should take a performance hit of additional
atomics in cases like GUP where this is trivially delt with by using a
better API.

I'd start with a series to pull pmd_trans_unstable() into some
pte_try_map() and it looks like about 25 call sites can be trivially
converted.

Several more can be converted but need a bit of work.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 17:10               ` Jason Gunthorpe
@ 2021-09-01 17:49                 ` David Hildenbrand
  2021-09-01 17:55                   ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 17:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 19:10, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 06:19:03PM +0200, David Hildenbrand wrote:
> 
>>> I wouldn't think it works everywhere, bit it works in a lot of places,
>>> and it is a heck of a lot better than what is proposed here. I'd
>>> rather see the places that can use it be moved, and the few places
>>> that can't be opencoded.
>>
>> Well, I used ptep_get_map_lock() and friends. But hacking directly into
>> ptep_map_lock() and friends wasn't possible due to all the corner cases.
> 
> Sure, I'm not surprised you can't get every single case, but that just
> suggest we need two API families, today's to support the special cases
> and a different one for the other regular simple cases.
> 
> A new function family pte_try_map/_locked() and paired unmap that can
> internally do the recounting and THP trickery and convert the easy
> callsites.
> 
> Very rough counting suggest at least half of the pte_offset_map_lock()
> call sites can trivially use the simpler API.
> 
> The other cases can stay as is and get open coded refcounts, or maybe
> someone will have a better idea once they are more clearly identified.
> 
> But I don't think we should take a performance hit of additional
> atomics in cases like GUP where this is trivially delt with by using a
> better API.

Right, but as I said in the cover letter, we can happily optimize once 
we have the basic infrastructure in place and properly reviewed. Getting 
rid of some unnecessary atomics by introducing additional fancy helpers 
falls under that category.

The performance hit shouldn't exist if this feature is not compiled in.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 17:49                 ` David Hildenbrand
@ 2021-09-01 17:55                   ` Jason Gunthorpe
  2021-09-01 17:58                     ` David Hildenbrand
  2021-09-02  7:04                     ` Qi Zheng
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 17:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 07:49:23PM +0200, David Hildenbrand wrote:
> On 01.09.21 19:10, Jason Gunthorpe wrote:
> > On Wed, Sep 01, 2021 at 06:19:03PM +0200, David Hildenbrand wrote:
> > 
> > > > I wouldn't think it works everywhere, bit it works in a lot of places,
> > > > and it is a heck of a lot better than what is proposed here. I'd
> > > > rather see the places that can use it be moved, and the few places
> > > > that can't be opencoded.
> > > 
> > > Well, I used ptep_get_map_lock() and friends. But hacking directly into
> > > ptep_map_lock() and friends wasn't possible due to all the corner cases.
> > 
> > Sure, I'm not surprised you can't get every single case, but that just
> > suggest we need two API families, today's to support the special cases
> > and a different one for the other regular simple cases.
> > 
> > A new function family pte_try_map/_locked() and paired unmap that can
> > internally do the recounting and THP trickery and convert the easy
> > callsites.
> > 
> > Very rough counting suggest at least half of the pte_offset_map_lock()
> > call sites can trivially use the simpler API.
> > 
> > The other cases can stay as is and get open coded refcounts, or maybe
> > someone will have a better idea once they are more clearly identified.
> > 
> > But I don't think we should take a performance hit of additional
> > atomics in cases like GUP where this is trivially delt with by using a
> > better API.
> 
> Right, but as I said in the cover letter, we can happily optimize once we
> have the basic infrastructure in place and properly reviewed. Getting rid of
> some unnecessary atomics by introducing additional fancy helpers falls under
> that category.

I'm not sure I agree given how big and wide this patch series is. It
would be easier to review if it was touching less places. The helpers
are not fancy, it is a logical re-arrangement of existing code that
shrinks the LOC of this series and makes it more reviewable.

Or stated another way, a niche feature like this try much harder not
to add more complexity everywhere.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 17:55                   ` Jason Gunthorpe
@ 2021-09-01 17:58                     ` David Hildenbrand
  2021-09-01 18:09                       ` Jason Gunthorpe
  2021-09-02  7:04                     ` Qi Zheng
  1 sibling, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 17:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 19:55, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 07:49:23PM +0200, David Hildenbrand wrote:
>> On 01.09.21 19:10, Jason Gunthorpe wrote:
>>> On Wed, Sep 01, 2021 at 06:19:03PM +0200, David Hildenbrand wrote:
>>>
>>>>> I wouldn't think it works everywhere, bit it works in a lot of places,
>>>>> and it is a heck of a lot better than what is proposed here. I'd
>>>>> rather see the places that can use it be moved, and the few places
>>>>> that can't be opencoded.
>>>>
>>>> Well, I used ptep_get_map_lock() and friends. But hacking directly into
>>>> ptep_map_lock() and friends wasn't possible due to all the corner cases.
>>>
>>> Sure, I'm not surprised you can't get every single case, but that just
>>> suggest we need two API families, today's to support the special cases
>>> and a different one for the other regular simple cases.
>>>
>>> A new function family pte_try_map/_locked() and paired unmap that can
>>> internally do the recounting and THP trickery and convert the easy
>>> callsites.
>>>
>>> Very rough counting suggest at least half of the pte_offset_map_lock()
>>> call sites can trivially use the simpler API.
>>>
>>> The other cases can stay as is and get open coded refcounts, or maybe
>>> someone will have a better idea once they are more clearly identified.
>>>
>>> But I don't think we should take a performance hit of additional
>>> atomics in cases like GUP where this is trivially delt with by using a
>>> better API.
>>
>> Right, but as I said in the cover letter, we can happily optimize once we
>> have the basic infrastructure in place and properly reviewed. Getting rid of
>> some unnecessary atomics by introducing additional fancy helpers falls under
>> that category.
> 
> I'm not sure I agree given how big and wide this patch series is. It
> would be easier to review if it was touching less places. The helpers
> are not fancy, it is a logical re-arrangement of existing code that
> shrinks the LOC of this series and makes it more reviewable.

You'll most likely have to touch each and every place either way, for 
example when suddenly returning "null" instead of a pte. It's just a 
matter of making this easier to review and the changes as minimal and as 
clear as possible.

> 
> Or stated another way, a niche feature like this try much harder not
> to add more complexity everywhere.

I fully agree.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 17:58                     ` David Hildenbrand
@ 2021-09-01 18:09                       ` Jason Gunthorpe
  2021-09-01 18:10                         ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-01 18:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 01, 2021 at 07:58:47PM +0200, David Hildenbrand wrote:
> You'll most likely have to touch each and every place either way, for
> example when suddenly returning "null" instead of a pte. It's just a matter
> of making this easier to review and the changes as minimal and as clear as
> possible.

I imagine the leading series to add the simplified API would include
the null return already - the THP race avoidance requires it anyhow.

So you end up with a simpler self contained series that is a stand
alone improvement followed by a much smaller series here that doesn't
got back and re-touch the first series's changes.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 18:09                       ` Jason Gunthorpe
@ 2021-09-01 18:10                         ` David Hildenbrand
  0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2021-09-01 18:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On 01.09.21 20:09, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 07:58:47PM +0200, David Hildenbrand wrote:
>> You'll most likely have to touch each and every place either way, for
>> example when suddenly returning "null" instead of a pte. It's just a matter
>> of making this easier to review and the changes as minimal and as clear as
>> possible.
> 
> I imagine the leading series to add the simplified API would include
> the null return already - the THP race avoidance requires it anyhow.
> 

Okay, so we're on the same page.

> So you end up with a simpler self contained series that is a stand
> alone improvement followed by a much smaller series here that doesn't
> got back and re-touch the first series's changes.

Exactly.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
  2021-09-01 16:07   ` Jason Gunthorpe
@ 2021-09-02  3:37   ` Qi Zheng
  2021-09-15 14:52   ` Qi Zheng
  2 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-09-02  3:37 UTC (permalink / raw)
  To: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/9/1 PM8:32, David Hildenbrand wrote:
> On 19.08.21 05:18, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to free user PTE page table pages when all PTE 
>> entries
>> are empty.
>>
>> The beginning of this story is that some malloc libraries(e.g. 
>> jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not 
>> unmap those VAs.
>> They will use madvise(MADV_DONTNEED) to free physical memory if they 
>> want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which 
>> actually
>> happened on our server:
>>
>>     VIRT:  55t
>>     RES:   590g
>>     VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 
>> 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE 
>> page
>> tables to save memory. In the above cases, we can save memory about 
>> 108g(best
>> case). And the larger the difference between the size of VIRT and RES, 
>> the
>> more memory we save.
>>
>> In this patch series, we add a pte_refcount field to the struct page 
>> of page
>> table to track how many users of PTE page table. Similar to the 
>> mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it 
>> before
>> accessing. The PTE page table page will be freed when the last 
>> refcount is
>> dropped.
>>
>> Testing:
>>
>> The following code snippet can show the effect of optimization:
>>
>>     mmap 50G
>>     while (1) {
>>         for (; i < 1024 * 25; i++) {
>>             touch 2M memory
>>             madvise MADV_DONTNEED 2M
>>         }
>>     }
>>
>> As we can see, the memory usage of VmPTE is reduced:
>>
>>             before                        after
>> VIRT               50.0 GB                  50.0 GB
>> RES                3.1 MB                   3.6 MB
>> VmPTE             102640 kB                   248 kB
>>
>> I also have tested the stability by LTP[1] for several weeks. I have 
>> not seen
>> any crash so far.
>>
>> The performance of page fault can be affected because of the 
>> allocation/freeing
>> of PTE page table pages. The following is the test result by using a 
>> micro
>> benchmark[2]:
>>
>> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
>>
>> threads         before (pf/min)                     after (pf/min)
>>      1                32,085,255                         31,880,833 
>> (-0.64%)
>>      8               101,674,967                        100,588,311 
>> (-1.17%)
>>     16               113,207,000                        112,801,832 
>> (-0.36%)
>>
>> (The "pfn/min" means how many page faults in one minute.)
>>
>> The performance of page fault is ~1% slower than before.
>>
>> This series is based on next-20210812.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
> 
> 
> Some high-level feedback after studying the code:
> 
> 1. Try introducing the new dummy primitives ("API") first, and then 
> convert each subsystem individually; especially, maybe convert the whole 
> pagefault handling in a single patch, because it's far from trivial. 
> This will make this series much easier to digest.
> 
> Then, have a patch that adds actual logic to the dummy primitives via a 
> config option.
> 
> 2. Minimize the API.
> 
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to 
> pte_alloc_get().
> 
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
> 
> Handle it independently for now, even if it implies duplicate runtime 
> checks.
> 
> if (pmd_trans_unstable() || !pte_try_get()) ...
> 
> We can always optimize later, once we can come up with something cleaner.
> 
> 3. Merge #6, and #7, after factoring out all changes to other subsystems 
> to use the API
> 
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and 
> forth, and IMHO the whole approach might not make sense without RCU due 
> to the additional locking overhead.
> 
> Or at least, try to not modify the API you introduced in patch #6 or #7 
> in #8 again. Converting all call sites back and forth just makes review 
> quite hard.
> 

Very detailed feedback! Thank you very much for your time and energy,
I will seriously adopt and implement these modification suggestions.

> 
> I am preparing some some cleanups that will make get_locked_pte() and 
> similar a little nicer to handle. I'll send them out this or next week.
> 

Yup, we just simply convert the pte_alloc_map_lock() in
__get_locked_pte() to pte_alloc_get_map_lock(), and then
call the paired pte_put() in the caller of get_locked_pte().
Like the following pattern:

	insert_page
	--> get_locked_pte
	    --> __get_locked_pte
		--> pte_alloc_get_map_lock

	    "do some things"

	    pte_put

This is really ugly and hard to review.

I look forward to your cleanups, besides, can you outline your approach?

Thanks again,
Qi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 15:32       ` Jason Gunthorpe
  2021-09-01 16:13         ` David Hildenbrand
@ 2021-09-02  6:53         ` Qi Zheng
  1 sibling, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-09-02  6:53 UTC (permalink / raw)
  To: Jason Gunthorpe, David Hildenbrand
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/9/1 PM11:32, Jason Gunthorpe wrote:
> 
>>> Also, I don't really understand how this scheme works with
>>> get_user_pages_fast.
>>
>> With the RCU change it in #8 it should work just fine, because RCU
>> synchronize has to wait either until all other CPUs have left the RCU read
>> section, or re-enabled interrupts.
> 
> So at this point in the series fast gup is broken, that does mean the
> series presentation really needs to be reworked. The better
> presentation is to add the API changes, with a
> no-functional-difference implementation, push the new API in well
> split patches to all the consumption sites, then change the API to
> have the new semantics.
> 
> RCU and refcount to free the page levels seems like a reasonable
> approach, but I have to say I haven't thought it through fully - are
> all the contexts that have the pte deref safe to do call_rcu?

See Documentation/RCU/rcubarrier.rst:

"Since call_rcu() never blocks, this code can safely be used from within
IRQ context."

So I think call_rcu() can be safely run in any context.

Thinks,
Qi

> 
> Jason
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 6/9] mm: free user PTE page table pages
  2021-09-01 17:55                   ` Jason Gunthorpe
  2021-09-01 17:58                     ` David Hildenbrand
@ 2021-09-02  7:04                     ` Qi Zheng
  1 sibling, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-09-02  7:04 UTC (permalink / raw)
  To: Jason Gunthorpe, David Hildenbrand
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun



On 2021/9/2 AM1:55, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 07:49:23PM +0200, David Hildenbrand wrote:
>> On 01.09.21 19:10, Jason Gunthorpe wrote:
>>> On Wed, Sep 01, 2021 at 06:19:03PM +0200, David Hildenbrand wrote:
>>>
>>>>> I wouldn't think it works everywhere, bit it works in a lot of places,
>>>>> and it is a heck of a lot better than what is proposed here. I'd
>>>>> rather see the places that can use it be moved, and the few places
>>>>> that can't be opencoded.
>>>>
>>>> Well, I used ptep_get_map_lock() and friends. But hacking directly into
>>>> ptep_map_lock() and friends wasn't possible due to all the corner cases.
>>>
>>> Sure, I'm not surprised you can't get every single case, but that just
>>> suggest we need two API families, today's to support the special cases
>>> and a different one for the other regular simple cases.
>>>
>>> A new function family pte_try_map/_locked() and paired unmap that can
>>> internally do the recounting and THP trickery and convert the easy
>>> callsites.
>>>
>>> Very rough counting suggest at least half of the pte_offset_map_lock()
>>> call sites can trivially use the simpler API.
>>>
>>> The other cases can stay as is and get open coded refcounts, or maybe
>>> someone will have a better idea once they are more clearly identified.
>>>
>>> But I don't think we should take a performance hit of additional
>>> atomics in cases like GUP where this is trivially delt with by using a
>>> better API.
>>
>> Right, but as I said in the cover letter, we can happily optimize once we
>> have the basic infrastructure in place and properly reviewed. Getting rid of
>> some unnecessary atomics by introducing additional fancy helpers falls under
>> that category.
> 
> I'm not sure I agree given how big and wide this patch series is. It
> would be easier to review if it was touching less places. The helpers
> are not fancy, it is a logical re-arrangement of existing code that
> shrinks the LOC of this series and makes it more reviewable.
> 
> Or stated another way, a niche feature like this try much harder not
> to add more complexity everywhere.

Totally agree, I will rework this patch series based on you and David's
suggestions.

Thank you very much,
Qi

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
  2021-09-01 16:07   ` Jason Gunthorpe
  2021-09-02  3:37   ` Qi Zheng
@ 2021-09-15 14:52   ` Qi Zheng
  2021-09-15 14:59     ` Jason Gunthorpe
  2 siblings, 1 reply; 39+ messages in thread
From: Qi Zheng @ 2021-09-15 14:52 UTC (permalink / raw)
  To: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun



On 9/1/21 8:32 PM, David Hildenbrand wrote:

Hi David,

>>
> 
> 
> Some high-level feedback after studying the code:
> 
> 1. Try introducing the new dummy primitives ("API") first, and then 
> convert each subsystem individually; especially, maybe convert the whole 
> pagefault handling in a single patch, because it's far from trivial. 
> This will make this series much easier to digest.

I am going to split this patch series as follows:

1. Introduce the new dummy APIs, which is an empty implementation.
    But I will explain its semantics.
2. Merge #6, #7 and #8, and call these dummy APIs in any necessary
    location, and split some special cases into single patches, such as
    pagefault and gup, etc. So that we can explain in more detail the
    concurrency in these cases. For example, we don't need to hold any
    pte_refcount in the fast path in gup on the x86_64 platform. Because
    the PTE page can't be freed after the local CPU interrupt is closed
    in the fast path in gup.
3. Introduce CONFIG_FREE_USER_PTE and implement these empty dummy APIs.
4. Add a description document.

And I try to add a function that combines pte_offset_map() and
pte_try_get(). Maybe the func name is pte_try_map() recommended by
Jason, or keep the pte_offset_map() unchanged?

> 
> Then, have a patch that adds actual logic to the dummy primitives via a 
> config option.
> 
> 2. Minimize the API.
> 
> a) pte_alloc_get{,_map,_map_lock}() is really ugly. Maybe restrict it to 
> pte_alloc_get()
I also think pte_alloc_get{,_map,_map_lock}() is ugly, but I can't
figure out a more suitable name. Maybe we can keep the
pte_alloc{,_map,_map_lock}() without any modification? But I am
worried that the caller will forget to call the paired pte_put().

> 
> b) pmd_trans_unstable_or_pte_try_get() and friends are really ugly.
> 
> Handle it independently for now, even if it implies duplicate runtime 
> checks.
> 
> if (pmd_trans_unstable() || !pte_try_get()) ...
> 
> We can always optimize later, once we can come up with something cleaner.
> 
> 3. Merge #6, and #7, after factoring out all changes to other subsystems 
> to use the API
> 
> 4. Merge #8 into #6. There is a lot of unnecessary code churn back and 
> forth, and IMHO the whole approach might not make sense without RCU due 
> to the additional locking overhead.
> 
> Or at least, try to not modify the API you introduced in patch #6 or #7 
> in #8 again. Converting all call sites back and forth just makes review 
> quite hard.
> 
> 
> I am preparing some some cleanups that will make get_locked_pte() and 
> similar a little nicer to handle. I'll send them out this or next week.
> 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-15 14:52   ` Qi Zheng
@ 2021-09-15 14:59     ` Jason Gunthorpe
  2021-09-16  5:32       ` Qi Zheng
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2021-09-15 14:59 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun

On Wed, Sep 15, 2021 at 10:52:40PM +0800, Qi Zheng wrote:
> I am going to split this patch series as follows:
> 
> 1. Introduce the new dummy APIs, which is an empty implementation.
>    But I will explain its semantics.
> 2. Merge #6, #7 and #8, and call these dummy APIs in any necessary
>    location, and split some special cases into single patches, such as
>    pagefault and gup, etc. So that we can explain in more detail the
>    concurrency in these cases. For example, we don't need to hold any
>    pte_refcount in the fast path in gup on the x86_64 platform. Because
>    the PTE page can't be freed after the local CPU interrupt is closed
>    in the fast path in gup.
> 3. Introduce CONFIG_FREE_USER_PTE and implement these empty dummy APIs.
> 4. Add a description document.
> 
> And I try to add a function that combines pte_offset_map() and
> pte_try_get(). Maybe the func name is pte_try_map() recommended by
> Jason, or keep the pte_offset_map() unchanged?

It is part of the transformation, add a
pte_try_map()/pte_undo_try_map() and replace all the pte_offset_map()
callsites that can use the new API with it. The idea was that try_map
would incorporate the pmd_trans_unstable/etc mess so searching for
trans_unstable is a good place to start finding candidates. Some are
simple, some are tricky.

When you get to step 3 you just change pte_try_map() and the callsites
don't need changing.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-15 14:59     ` Jason Gunthorpe
@ 2021-09-16  5:32       ` Qi Zheng
  2021-09-16  8:30         ` David Hildenbrand
  0 siblings, 1 reply; 39+ messages in thread
From: Qi Zheng @ 2021-09-16  5:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, akpm, tglx, hannes, mhocko, vdavydov.dev,
	kirill.shutemov, mika.penttila, linux-doc, linux-kernel,
	linux-mm, songmuchun



On 9/15/21 10:59 PM, Jason Gunthorpe wrote:
> On Wed, Sep 15, 2021 at 10:52:40PM +0800, Qi Zheng wrote:
>> I am going to split this patch series as follows:
>>
>> 1. Introduce the new dummy APIs, which is an empty implementation.
>>     But I will explain its semantics.
>> 2. Merge #6, #7 and #8, and call these dummy APIs in any necessary
>>     location, and split some special cases into single patches, such as
>>     pagefault and gup, etc. So that we can explain in more detail the
>>     concurrency in these cases. For example, we don't need to hold any
>>     pte_refcount in the fast path in gup on the x86_64 platform. Because
>>     the PTE page can't be freed after the local CPU interrupt is closed
>>     in the fast path in gup.
>> 3. Introduce CONFIG_FREE_USER_PTE and implement these empty dummy APIs.
>> 4. Add a description document.
>>
>> And I try to add a function that combines pte_offset_map() and
>> pte_try_get(). Maybe the func name is pte_try_map() recommended by
>> Jason, or keep the pte_offset_map() unchanged?
> 
> It is part of the transformation, add a
> pte_try_map()/pte_undo_try_map() and replace all the pte_offset_map()
> callsites that can use the new API with it. The idea was that try_map
> would incorporate the pmd_trans_unstable/etc mess so searching for
> trans_unstable is a good place to start finding candidates. Some are
> simple, some are tricky.

Yes, I will search pte_offset_map()/pmd_trans_unstable/etc, and then 
analyze the specific situation.

> 
> When you get to step 3 you just change pte_try_map() and the callsites
> don't need changing.
> 
> Jason
> 

Thanks,
Qi



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-16  5:32       ` Qi Zheng
@ 2021-09-16  8:30         ` David Hildenbrand
  2021-09-16  8:41           ` Qi Zheng
  0 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2021-09-16  8:30 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun

On 16.09.21 07:32, Qi Zheng wrote:
> 
> 
> On 9/15/21 10:59 PM, Jason Gunthorpe wrote:
>> On Wed, Sep 15, 2021 at 10:52:40PM +0800, Qi Zheng wrote:
>>> I am going to split this patch series as follows:
>>>
>>> 1. Introduce the new dummy APIs, which is an empty implementation.
>>>      But I will explain its semantics.
>>> 2. Merge #6, #7 and #8, and call these dummy APIs in any necessary
>>>      location, and split some special cases into single patches, such as
>>>      pagefault and gup, etc. So that we can explain in more detail the
>>>      concurrency in these cases. For example, we don't need to hold any
>>>      pte_refcount in the fast path in gup on the x86_64 platform. Because
>>>      the PTE page can't be freed after the local CPU interrupt is closed
>>>      in the fast path in gup.
>>> 3. Introduce CONFIG_FREE_USER_PTE and implement these empty dummy APIs.
>>> 4. Add a description document.
>>>
>>> And I try to add a function that combines pte_offset_map() and
>>> pte_try_get(). Maybe the func name is pte_try_map() recommended by
>>> Jason, or keep the pte_offset_map() unchanged?
>>
>> It is part of the transformation, add a
>> pte_try_map()/pte_undo_try_map() and replace all the pte_offset_map()
>> callsites that can use the new API with it. The idea was that try_map
>> would incorporate the pmd_trans_unstable/etc mess so searching for
>> trans_unstable is a good place to start finding candidates. Some are
>> simple, some are tricky.
> 
> Yes, I will search pte_offset_map()/pmd_trans_unstable/etc, and then
> analyze the specific situation.

Maybe propose the new API first, before doing the actual implementation. 
Might safe you from doing some additional back-and-forth
work eventually.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/9] Free user PTE page table pages
  2021-09-16  8:30         ` David Hildenbrand
@ 2021-09-16  8:41           ` Qi Zheng
  0 siblings, 0 replies; 39+ messages in thread
From: Qi Zheng @ 2021-09-16  8:41 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, hannes, mhocko, vdavydov.dev, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun



On 9/16/21 4:30 PM, David Hildenbrand wrote:
> On 16.09.21 07:32, Qi Zheng wrote:
>>
>>
>> On 9/15/21 10:59 PM, Jason Gunthorpe wrote:
>>> On Wed, Sep 15, 2021 at 10:52:40PM +0800, Qi Zheng wrote:
>>>> I am going to split this patch series as follows:
>>>>
>>>> 1. Introduce the new dummy APIs, which is an empty implementation.
>>>>      But I will explain its semantics.
>>>> 2. Merge #6, #7 and #8, and call these dummy APIs in any necessary
>>>>      location, and split some special cases into single patches, 
>>>> such as
>>>>      pagefault and gup, etc. So that we can explain in more detail the
>>>>      concurrency in these cases. For example, we don't need to hold any
>>>>      pte_refcount in the fast path in gup on the x86_64 platform. 
>>>> Because
>>>>      the PTE page can't be freed after the local CPU interrupt is 
>>>> closed
>>>>      in the fast path in gup.
>>>> 3. Introduce CONFIG_FREE_USER_PTE and implement these empty dummy APIs.
>>>> 4. Add a description document.
>>>>
>>>> And I try to add a function that combines pte_offset_map() and
>>>> pte_try_get(). Maybe the func name is pte_try_map() recommended by
>>>> Jason, or keep the pte_offset_map() unchanged?
>>>
>>> It is part of the transformation, add a
>>> pte_try_map()/pte_undo_try_map() and replace all the pte_offset_map()
>>> callsites that can use the new API with it. The idea was that try_map
>>> would incorporate the pmd_trans_unstable/etc mess so searching for
>>> trans_unstable is a good place to start finding candidates. Some are
>>> simple, some are tricky.
>>
>> Yes, I will search pte_offset_map()/pmd_trans_unstable/etc, and then
>> analyze the specific situation.
> 
> Maybe propose the new API first, before doing the actual implementation. 
> Might safe you from doing some additional back-and-forth
> work eventually.
> 

OK, I prepare to propose all the dummy APIs in the step 1.

Thanks,
Qi

> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-09-16  8:42 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19  3:18 [PATCH v2 0/9] Free user PTE page table pages Qi Zheng
2021-08-19  3:18 ` [PATCH v2 1/9] mm: introduce pmd_install() helper Qi Zheng
2021-08-24 16:26   ` David Hildenbrand
2021-08-25 16:20     ` Qi Zheng
2021-08-25 16:32       ` David Hildenbrand
2021-08-26  3:04         ` Qi Zheng
2021-08-19  3:18 ` [PATCH v2 2/9] mm: remove redundant smp_wmb() Qi Zheng
2021-08-19  3:18 ` [PATCH v2 3/9] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-08-19  3:18 ` [PATCH v2 4/9] mm: move pte_alloc{,_map,_map_lock}() to a separate file Qi Zheng
2021-08-19  3:18 ` [PATCH v2 5/9] mm: pte_refcount infrastructure Qi Zheng
2021-08-19  3:18 ` [PATCH v2 6/9] mm: free user PTE page table pages Qi Zheng
2021-08-19  7:01   ` David Hildenbrand
2021-08-19 10:18     ` [External] " Qi Zheng
2021-09-01 13:53   ` Jason Gunthorpe
2021-09-01 13:57     ` David Hildenbrand
2021-09-01 15:32       ` Jason Gunthorpe
2021-09-01 16:13         ` David Hildenbrand
2021-09-01 16:16           ` Jason Gunthorpe
2021-09-01 16:19             ` David Hildenbrand
2021-09-01 17:10               ` Jason Gunthorpe
2021-09-01 17:49                 ` David Hildenbrand
2021-09-01 17:55                   ` Jason Gunthorpe
2021-09-01 17:58                     ` David Hildenbrand
2021-09-01 18:09                       ` Jason Gunthorpe
2021-09-01 18:10                         ` David Hildenbrand
2021-09-02  7:04                     ` Qi Zheng
2021-09-02  6:53         ` Qi Zheng
2021-08-19  3:18 ` [PATCH v2 7/9] mm: add THP support for pte_ref Qi Zheng
2021-08-19  3:18 ` [PATCH v2 8/9] mm: free PTE page table by using rcu mechanism Qi Zheng
2021-08-19  3:18 ` [PATCH v2 9/9] mm: use mmu_gather to free PTE page table Qi Zheng
2021-09-01 12:32 ` [PATCH v2 0/9] Free user PTE page table pages David Hildenbrand
2021-09-01 16:07   ` Jason Gunthorpe
2021-09-01 16:10     ` David Hildenbrand
2021-09-02  3:37   ` Qi Zheng
2021-09-15 14:52   ` Qi Zheng
2021-09-15 14:59     ` Jason Gunthorpe
2021-09-16  5:32       ` Qi Zheng
2021-09-16  8:30         ` David Hildenbrand
2021-09-16  8:41           ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).