LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd
@ 2021-09-02 20:17 Peter Xu
  2021-09-02 20:17 ` [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:17 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, peterx,
	Jerome Glisse, Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple

[Based on tag v5.14, but it should still apply to -mm too.  If not, I can
 repost anytime]

Hugh,

So I found one thing that I feel like a bug of commit 22061a1ffabdb9c3, but I'm
not sure.  If that's the case, patch 5 of this series may be the fix for it.

The problem is unmap_mapping_page() in current tree is calling
unmap_mapping_range_tree() with a details pointer, while by default when detail
pointer is specified, it means "we want to skip zapping swap entries".

I didn't mention this in v1 simply because I thought it was fine, e.g., swap
entry won't be kept in shmem ptes so skipped is okay (it is never okay with
shmem uffd-wp but uffd-wp code is not landed yet).  However I just remembered
there could also be e.g. shmem migration entries if I'm not wrong.  From that
pov, skipping swap entries for unmap_mapping_page() seems wrong.  Would you
please help check?

It'll be great if you can comment on patch 1 too, to see whether that's okay
and what I could have been missing there (e.g., if you have other concern on
breaking stuff, I'll be happy to test).

v2:
- Patch "mm: Clear vmf->pte after pte_unmap_same() returns"
  - Remove one comment [David]
- Collect r-b for patch 2/3
- Rewrite the last two patches to drop ZAP_FLAG_CHECK_MAPPING, dropping
  Alistair's r-b on patch 5 because it changed [David, Matthew]

===== v1 cover letter =====

I picked up these patches from uffd-wp v5 series here:

https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

IMHO all of them are very nice cleanups to existing code already, they're all
small and self-contained.  They'll be needed by uffd-wp coming series.  I would
appreciate if they can be accepted earlier, so as to not carry them over always
with the uffd-wp series.

I removed some CC from the uffd-wp v5 series to reduce the noise, and added a
few more into it.

Reviews are greatly welcomed, thanks.

Peter Xu (5):
  mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm: Drop first_index/last_index in zap_details
  mm: Add zap_skip_check_mapping() helper
  mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags

 include/linux/mm.h | 33 ++++++++++++++++++--
 mm/memory.c        | 76 +++++++++++++++++++---------------------------
 mm/shmem.c         |  1 -
 mm/userfaultfd.c   |  3 +-
 4 files changed, 62 insertions(+), 51 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
@ 2021-09-02 20:17 ` Peter Xu
  2021-09-03  7:42   ` David Hildenbrand
  2021-09-02 20:17 ` [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:17 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, peterx,
	Jerome Glisse, Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple, Axel Rasmussen

It was conditionally done previously, as there's one shmem special case that we
use SetPageDirty() instead.  However that's not necessary and it should be
easier and cleaner to do it unconditionally in mfill_atomic_install_pte().

The most recent discussion about this is here, where Hugh explained the history
of SetPageDirty() and why it's possible that it's not required at all:

https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/

Currently mfill_atomic_install_pte() has three callers:

        1. shmem_mfill_atomic_pte
        2. mcopy_atomic_pte
        3. mcontinue_atomic_pte

After the change: case (1) should have its SetPageDirty replaced by the dirty
bit on pte (so we unify them together, finally), case (2) should have no
functional change at all as it has page_in_cache==false, case (3) may add a
dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
it's merely 100% sure the page is dirty after all, so should not make a real
difference either.

This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
no special handling of SetPageDirty() if there's no need.

Cc: Hugh Dickins <hughd@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/shmem.c       | 1 -
 mm/userfaultfd.c | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index dacda7463d54..3f91c8ce4d02 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2437,7 +2437,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
-	SetPageDirty(page);
 	unlock_page(page);
 	return 0;
 out_delete_from_cache:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0e2132834bc7..b30a3724c701 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
-	if (writable || !page_in_cache)
-		_dst_pte = pte_mkdirty(_dst_pte);
 	if (writable) {
 		if (wp_copy)
 			_dst_pte = pte_mkuffd_wp(_dst_pte);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  2021-09-02 20:17 ` [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
@ 2021-09-02 20:17 ` Peter Xu
  2021-09-08  1:12   ` Liam Howlett
  2021-09-02 20:17 ` [PATCH v2 3/5] mm: Drop first_index/last_index in zap_details Peter Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:17 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, peterx,
	Jerome Glisse, Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple

pte_unmap_same() will always unmap the pte pointer.  After the unmap, vmf->pte
will not be valid any more, we should clear it.

It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.

Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
parameter list too, which should be a nice cleanup.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..7b095f07c4ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2724,19 +2724,19 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
 	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
+		spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
+		same = pte_same(*vmf->pte, vmf->orig_pte);
 		spin_unlock(ptl);
 	}
 #endif
-	pte_unmap(page_table);
+	pte_unmap(vmf->pte);
+	vmf->pte = NULL;
 	return same;
 }
 
@@ -3487,7 +3487,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+	if (!pte_unmap_same(vmf))
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 3/5] mm: Drop first_index/last_index in zap_details
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  2021-09-02 20:17 ` [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
  2021-09-02 20:17 ` [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
@ 2021-09-02 20:17 ` Peter Xu
  2021-09-02 20:18 ` [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper Peter Xu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:17 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, peterx,
	Jerome Glisse, Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple

The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree().  At the meantime, this function is only called by
unmap_mapping_pages() once.  Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 --
 mm/memory.c        | 29 ++++++++++++++++-------------
 2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..69259229f090 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1721,8 +1721,6 @@ extern void user_shm_unlock(size_t, struct ucounts *);
  */
 struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
-	pgoff_t	first_index;			/* Lowest page->index to unmap */
-	pgoff_t last_index;			/* Highest page->index to unmap */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 7b095f07c4ef..6bba3b9fef7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3321,20 +3321,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 }
 
 static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+					    pgoff_t first_index,
+					    pgoff_t last_index,
 					    struct zap_details *details)
 {
 	struct vm_area_struct *vma;
 	pgoff_t vba, vea, zba, zea;
 
-	vma_interval_tree_foreach(vma, root,
-			details->first_index, details->last_index) {
-
+	vma_interval_tree_foreach(vma, root, first_index, last_index) {
 		vba = vma->vm_pgoff;
 		vea = vba + vma_pages(vma) - 1;
-		zba = details->first_index;
+		zba = first_index;
 		if (zba < vba)
 			zba = vba;
-		zea = details->last_index;
+		zea = last_index;
 		if (zea > vea)
 			zea = vea;
 
@@ -3360,18 +3360,21 @@ void unmap_mapping_page(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
 	struct zap_details details = { };
+	pgoff_t	first_index, last_index;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageTail(page));
 
+	first_index = page->index;
+	last_index = page->index + thp_nr_pages(page) - 1;
+
 	details.check_mapping = mapping;
-	details.first_index = page->index;
-	details.last_index = page->index + thp_nr_pages(page) - 1;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 
@@ -3390,17 +3393,17 @@ void unmap_mapping_page(struct page *page)
 void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 		pgoff_t nr, bool even_cows)
 {
+	pgoff_t	first_index = start, last_index = start + nr - 1;
 	struct zap_details details = { };
 
 	details.check_mapping = even_cows ? NULL : mapping;
-	details.first_index = start;
-	details.last_index = start + nr - 1;
-	if (details.last_index < details.first_index)
-		details.last_index = ULONG_MAX;
+	if (last_index < first_index)
+		last_index = ULONG_MAX;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
                   ` (2 preceding siblings ...)
  2021-09-02 20:17 ` [PATCH v2 3/5] mm: Drop first_index/last_index in zap_details Peter Xu
@ 2021-09-02 20:18 ` Peter Xu
  2021-09-03  0:58   ` Alistair Popple
  2021-09-02 20:18 ` [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags Peter Xu
  2021-09-08  0:35 ` [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  5 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:18 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Jerome Glisse, peterx, Andrea Arcangeli,
	Miaohe Lin, Yang Shi, Matthew Wilcox, Alistair Popple,
	Mike Rapoport, David Hildenbrand

Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
because "check_mapping" looks like a bool but in fact it stores the mapping
itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
cleared we skip the check, which works like the old way.

Move the duplicated comments to the helper too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 15 ++++++++++++++-
 mm/memory.c        | 29 ++++++-----------------------
 2 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69259229f090..81e402a5fbc9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1720,10 +1720,23 @@ extern void user_shm_unlock(size_t, struct ucounts *);
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
 struct zap_details {
-	struct address_space *check_mapping;	/* Check page->mapping if set */
+	struct address_space *zap_mapping;	/* Check page->mapping if set */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
+/*
+ * We set details->zap_mappings when we want to unmap shared but keep private
+ * pages. Return true if skip zapping this page, false otherwise.
+ */
+static inline bool
+zap_skip_check_mapping(struct zap_details *details, struct page *page)
+{
+	if (!details || !page)
+		return false;
+
+	return details->zap_mapping != page_rmapping(page);
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 6bba3b9fef7c..e5ee8399d270 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
-			if (unlikely(details && details->check_mapping)) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping !=
-				    page_rmapping(page))
-					continue;
-			}
-
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
 
@@ -3368,7 +3351,7 @@ void unmap_mapping_page(struct page *page)
 	first_index = page->index;
 	last_index = page->index + thp_nr_pages(page) - 1;
 
-	details.check_mapping = mapping;
+	details.zap_mapping = mapping;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
@@ -3396,7 +3379,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 	pgoff_t	first_index = start, last_index = start + nr - 1;
 	struct zap_details details = { };
 
-	details.check_mapping = even_cows ? NULL : mapping;
+	details.zap_mapping = even_cows ? NULL : mapping;
 	if (last_index < first_index)
 		last_index = ULONG_MAX;
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
                   ` (3 preceding siblings ...)
  2021-09-02 20:18 ` [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper Peter Xu
@ 2021-09-02 20:18 ` Peter Xu
  2021-09-03  7:25   ` David Hildenbrand
  2021-09-08  0:35 ` [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
  5 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-02 20:18 UTC (permalink / raw)
  To: linux-mm, Hugh Dickins, Andrew Morton, linux-kernel
  Cc: Miaohe Lin, Matthew Wilcox, David Hildenbrand, Yang Shi,
	Kirill A . Shutemov, peterx, Jerome Glisse, Alistair Popple,
	Andrea Arcangeli, Mike Rapoport

Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.

Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries.  New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.

Actually very recently we introduced unmap_mapping_page() in 22061a1ffabd, I
think that should also look into swap entries.  Add a comment there.  IOW, this
patch will be a functional change to unmap_mapping_page() but hopefully in the
right way to do it.

This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which should still
be the major case here).

Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h | 16 ++++++++++++++++
 mm/memory.c        |  6 +++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 81e402a5fbc9..a7bcdb2ec956 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1716,12 +1716,18 @@ static inline bool can_do_mlock(void) { return false; }
 extern int user_shm_lock(size_t, struct ucounts *);
 extern void user_shm_unlock(size_t, struct ucounts *);
 
+typedef unsigned int __bitwise zap_flags_t;
+
+/* Whether to skip zapping swap entries */
+#define  ZAP_FLAG_SKIP_SWAP  ((__force zap_flags_t) BIT(0))
+
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
 struct zap_details {
 	struct address_space *zap_mapping;	/* Check page->mapping if set */
 	struct page *single_page;		/* Locked page to be unmapped */
+	zap_flags_t zap_flags;			/* Extra flags for zapping */
 };
 
 /*
@@ -1737,6 +1743,16 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
 	return details->zap_mapping != page_rmapping(page);
 }
 
+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+	if (!details)
+		return false;
+
+	return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index e5ee8399d270..4cb269ca8249 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,8 +1379,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		if (unlikely(zap_skip_swap(details)))
 			continue;
 
 		if (!non_swap_entry(entry))
@@ -3351,6 +3350,7 @@ void unmap_mapping_page(struct page *page)
 	first_index = page->index;
 	last_index = page->index + thp_nr_pages(page) - 1;
 
+	/* Keep ZAP_FLAG_SKIP_SWAP cleared because we're truncating */
 	details.zap_mapping = mapping;
 	details.single_page = page;
 
@@ -3377,7 +3377,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
 		pgoff_t nr, bool even_cows)
 {
 	pgoff_t	first_index = start, last_index = start + nr - 1;
-	struct zap_details details = { };
+	struct zap_details details = { .zap_flags = ZAP_FLAG_SKIP_SWAP };
 
 	details.zap_mapping = even_cows ? NULL : mapping;
 	if (last_index < first_index)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper
  2021-09-02 20:18 ` [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper Peter Xu
@ 2021-09-03  0:58   ` Alistair Popple
  2021-09-03  1:39     ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2021-09-03  0:58 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Peter Xu
  Cc: Kirill A . Shutemov, Jerome Glisse, peterx, Andrea Arcangeli,
	Miaohe Lin, Yang Shi, Matthew Wilcox, Mike Rapoport,
	David Hildenbrand

On Friday, 3 September 2021 6:18:19 AM AEST Peter Xu wrote:
> Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
> because "check_mapping" looks like a bool but in fact it stores the mapping
> itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
> cleared we skip the check, which works like the old way.
>
> Move the duplicated comments to the helper too.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h | 15 ++++++++++++++-
>  mm/memory.c        | 29 ++++++-----------------------
>  2 files changed, 20 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 69259229f090..81e402a5fbc9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1720,10 +1720,23 @@ extern void user_shm_unlock(size_t, struct ucounts *);
>   * Parameter block passed down to zap_pte_range in exceptional cases.
>   */
>  struct zap_details {
> -	struct address_space *check_mapping;	/* Check page->mapping if set */
> +	struct address_space *zap_mapping;	/* Check page->mapping if set */
>  	struct page *single_page;		/* Locked page to be unmapped */
>  };
>  
> +/*
> + * We set details->zap_mappings when we want to unmap shared but keep private
> + * pages. Return true if skip zapping this page, false otherwise.
> + */
> +static inline bool
> +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> +{
> +	if (!details || !page)
> +		return false;
> +
> +	return details->zap_mapping != page_rmapping(page);

Shouldn't this check still be
details->zap_mapping && details->zap_mapping != page_rmapping(page)?

Previously we wouldn't skip zapping pages if even_cows == true (ie.
details->check_mapping == NULL). With this change the check when
even_cows == true becomes NULL != page_rmapping(page). Doesn't this mean we
will now skip zapping any pages with a mapping when even_cows == true?

> +}
> +
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index 6bba3b9fef7c..e5ee8399d270 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			struct page *page;
>  
>  			page = vm_normal_page(vma, addr, ptent);
> -			if (unlikely(details) && page) {
> -				/*
> -				 * unmap_shared_mapping_pages() wants to
> -				 * invalidate cache without truncating:
> -				 * unmap shared but keep private pages.
> -				 */
> -				if (details->check_mapping &&
> -				    details->check_mapping != page_rmapping(page))
> -					continue;
> -			}
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;
>
>  			ptent = ptep_get_and_clear_full(mm, addr, pte,
>  							tlb->fullmm);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
> @@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		    is_device_exclusive_entry(entry)) {
>  			struct page *page = pfn_swap_entry_to_page(entry);
>  
> -			if (unlikely(details && details->check_mapping)) {
> -				/*
> -				 * unmap_shared_mapping_pages() wants to
> -				 * invalidate cache without truncating:
> -				 * unmap shared but keep private pages.
> -				 */
> -				if (details->check_mapping !=
> -				    page_rmapping(page))
> -					continue;
> -			}
> -
> +			if (unlikely(zap_skip_check_mapping(details, page)))
> +				continue;
>  			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			rss[mm_counter(page)]--;
>  
> @@ -3368,7 +3351,7 @@ void unmap_mapping_page(struct page *page)
>  	first_index = page->index;
>  	last_index = page->index + thp_nr_pages(page) - 1;
>  
> -	details.check_mapping = mapping;
> +	details.zap_mapping = mapping;
>  	details.single_page = page;
>  
>  	i_mmap_lock_write(mapping);
> @@ -3396,7 +3379,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>  	pgoff_t	first_index = start, last_index = start + nr - 1;
>  	struct zap_details details = { };
>  
> -	details.check_mapping = even_cows ? NULL : mapping;
> +	details.zap_mapping = even_cows ? NULL : mapping;
>  	if (last_index < first_index)
>  		last_index = ULONG_MAX;
>  
> 





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper
  2021-09-03  0:58   ` Alistair Popple
@ 2021-09-03  1:39     ` Peter Xu
  2021-09-03  1:50       ` Alistair Popple
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-03  1:39 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Kirill A . Shutemov, Jerome Glisse, Andrea Arcangeli, Miaohe Lin,
	Yang Shi, Matthew Wilcox, Mike Rapoport, David Hildenbrand

On Fri, Sep 03, 2021 at 10:58:53AM +1000, Alistair Popple wrote:
> On Friday, 3 September 2021 6:18:19 AM AEST Peter Xu wrote:
> > Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
> > because "check_mapping" looks like a bool but in fact it stores the mapping
> > itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
> > cleared we skip the check, which works like the old way.
> >
> > Move the duplicated comments to the helper too.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/mm.h | 15 ++++++++++++++-
> >  mm/memory.c        | 29 ++++++-----------------------
> >  2 files changed, 20 insertions(+), 24 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 69259229f090..81e402a5fbc9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1720,10 +1720,23 @@ extern void user_shm_unlock(size_t, struct ucounts *);
> >   * Parameter block passed down to zap_pte_range in exceptional cases.
> >   */
> >  struct zap_details {
> > -	struct address_space *check_mapping;	/* Check page->mapping if set */
> > +	struct address_space *zap_mapping;	/* Check page->mapping if set */
> >  	struct page *single_page;		/* Locked page to be unmapped */
> >  };
> >  
> > +/*
> > + * We set details->zap_mappings when we want to unmap shared but keep private
> > + * pages. Return true if skip zapping this page, false otherwise.
> > + */
> > +static inline bool
> > +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> > +{
> > +	if (!details || !page)
> > +		return false;
> > +
> > +	return details->zap_mapping != page_rmapping(page);
> 
> Shouldn't this check still be
> details->zap_mapping && details->zap_mapping != page_rmapping(page)?
> 
> Previously we wouldn't skip zapping pages if even_cows == true (ie.
> details->check_mapping == NULL). With this change the check when
> even_cows == true becomes NULL != page_rmapping(page). Doesn't this mean we
> will now skip zapping any pages with a mapping when even_cows == true?

Yes I think so.  Thanks for pointing that out, Alistair, I'll fix in v3.

But frankly I really think we should simply have that flag I used to introduce.
It'll make everything much clearer.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper
  2021-09-03  1:39     ` Peter Xu
@ 2021-09-03  1:50       ` Alistair Popple
  2021-09-03  7:07         ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Alistair Popple @ 2021-09-03  1:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Kirill A . Shutemov, Jerome Glisse, Andrea Arcangeli, Miaohe Lin,
	Yang Shi, Matthew Wilcox, Mike Rapoport, David Hildenbrand

On Friday, 3 September 2021 11:39:32 AM AEST Peter Xu wrote:
> On Fri, Sep 03, 2021 at 10:58:53AM +1000, Alistair Popple wrote:
> > On Friday, 3 September 2021 6:18:19 AM AEST Peter Xu wrote:
> > > Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
> > > because "check_mapping" looks like a bool but in fact it stores the mapping
> > > itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
> > > cleared we skip the check, which works like the old way.
> > >
> > > Move the duplicated comments to the helper too.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/linux/mm.h | 15 ++++++++++++++-
> > >  mm/memory.c        | 29 ++++++-----------------------
> > >  2 files changed, 20 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 69259229f090..81e402a5fbc9 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1720,10 +1720,23 @@ extern void user_shm_unlock(size_t, struct ucounts *);
> > >   * Parameter block passed down to zap_pte_range in exceptional cases.
> > >   */
> > >  struct zap_details {
> > > -	struct address_space *check_mapping;	/* Check page->mapping if set */
> > > +	struct address_space *zap_mapping;	/* Check page->mapping if set */
> > >  	struct page *single_page;		/* Locked page to be unmapped */
> > >  };
> > >  
> > > +/*
> > > + * We set details->zap_mappings when we want to unmap shared but keep private
> > > + * pages. Return true if skip zapping this page, false otherwise.
> > > + */
> > > +static inline bool
> > > +zap_skip_check_mapping(struct zap_details *details, struct page *page)
> > > +{
> > > +	if (!details || !page)
> > > +		return false;
> > > +
> > > +	return details->zap_mapping != page_rmapping(page);
> > 
> > Shouldn't this check still be
> > details->zap_mapping && details->zap_mapping != page_rmapping(page)?
> > 
> > Previously we wouldn't skip zapping pages if even_cows == true (ie.
> > details->check_mapping == NULL). With this change the check when
> > even_cows == true becomes NULL != page_rmapping(page). Doesn't this mean we
> > will now skip zapping any pages with a mapping when even_cows == true?
> 
> Yes I think so.  Thanks for pointing that out, Alistair, I'll fix in v3.
> 
> But frankly I really think we should simply have that flag I used to introduce.
> It'll make everything much clearer.

Yeah, I think a flag would also be fine.

 - Alistair




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper
  2021-09-03  1:50       ` Alistair Popple
@ 2021-09-03  7:07         ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:07 UTC (permalink / raw)
  To: Alistair Popple, Peter Xu
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Kirill A . Shutemov, Jerome Glisse, Andrea Arcangeli, Miaohe Lin,
	Yang Shi, Matthew Wilcox, Mike Rapoport

On 03.09.21 03:50, Alistair Popple wrote:
> On Friday, 3 September 2021 11:39:32 AM AEST Peter Xu wrote:
>> On Fri, Sep 03, 2021 at 10:58:53AM +1000, Alistair Popple wrote:
>>> On Friday, 3 September 2021 6:18:19 AM AEST Peter Xu wrote:
>>>> Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
>>>> because "check_mapping" looks like a bool but in fact it stores the mapping
>>>> itself.  When it's set, we check the mapping (it must be non-NULL).  When it's
>>>> cleared we skip the check, which works like the old way.
>>>>
>>>> Move the duplicated comments to the helper too.
>>>>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>> ---
>>>>   include/linux/mm.h | 15 ++++++++++++++-
>>>>   mm/memory.c        | 29 ++++++-----------------------
>>>>   2 files changed, 20 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 69259229f090..81e402a5fbc9 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1720,10 +1720,23 @@ extern void user_shm_unlock(size_t, struct ucounts *);
>>>>    * Parameter block passed down to zap_pte_range in exceptional cases.
>>>>    */
>>>>   struct zap_details {
>>>> -	struct address_space *check_mapping;	/* Check page->mapping if set */
>>>> +	struct address_space *zap_mapping;	/* Check page->mapping if set */
>>>>   	struct page *single_page;		/* Locked page to be unmapped */
>>>>   };
>>>>   
>>>> +/*
>>>> + * We set details->zap_mappings when we want to unmap shared but keep private
>>>> + * pages. Return true if skip zapping this page, false otherwise.
>>>> + */
>>>> +static inline bool
>>>> +zap_skip_check_mapping(struct zap_details *details, struct page *page)
>>>> +{
>>>> +	if (!details || !page)
>>>> +		return false;
>>>> +
>>>> +	return details->zap_mapping != page_rmapping(page);
>>>
>>> Shouldn't this check still be
>>> details->zap_mapping && details->zap_mapping != page_rmapping(page)?
>>>
>>> Previously we wouldn't skip zapping pages if even_cows == true (ie.
>>> details->check_mapping == NULL). With this change the check when
>>> even_cows == true becomes NULL != page_rmapping(page). Doesn't this mean we
>>> will now skip zapping any pages with a mapping when even_cows == true?
>>
>> Yes I think so.  Thanks for pointing that out, Alistair, I'll fix in v3.
>>
>> But frankly I really think we should simply have that flag I used to introduce.
>> It'll make everything much clearer.
> 
> Yeah, I think a flag would also be fine.

I still don't see the need for a flag quite frankly. Just factor out the 
checks we already have ... no change in behavior.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags
  2021-09-02 20:18 ` [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags Peter Xu
@ 2021-09-03  7:25   ` David Hildenbrand
  2021-09-03  7:31     ` David Hildenbrand
  2021-09-08  0:43     ` Peter Xu
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:25 UTC (permalink / raw)
  To: Peter Xu, linux-mm, Hugh Dickins, Andrew Morton, linux-kernel
  Cc: Miaohe Lin, Matthew Wilcox, Yang Shi, Kirill A . Shutemov,
	Jerome Glisse, Alistair Popple, Andrea Arcangeli, Mike Rapoport

On 02.09.21 22:18, Peter Xu wrote:
> Firstly, the comment in zap_pte_range() is misleading because it checks against
> details rather than check_mappings, so it's against what the code did.
> 
> Meanwhile, it's confusing too on not explaining why passing in the details

s/on//

> pointer would mean to skip all swap entries.  New user of zap_details could
> very possibly miss this fact if they don't read deep until zap_pte_range()
> because there's no comment at zap_details talking about it at all, so swap
> entries could be errornously skipped without being noticed.

s/errornously/erroneously/

> 
> Actually very recently we introduced unmap_mapping_page() in 22061a1ffabd, I
> think that should also look into swap entries.  Add a comment there.  IOW, this
> patch will be a functional change to unmap_mapping_page() but hopefully in the
> right way to do it.
> 
> This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
> but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
> "details" parameter: the caller should explicitly set this to skip swap
> entries, otherwise swap entries will always be considered (which should still
> be the major case here).
> 
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/mm.h | 16 ++++++++++++++++
>   mm/memory.c        |  6 +++---
>   2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 81e402a5fbc9..a7bcdb2ec956 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1716,12 +1716,18 @@ static inline bool can_do_mlock(void) { return false; }
>   extern int user_shm_lock(size_t, struct ucounts *);
>   extern void user_shm_unlock(size_t, struct ucounts *);
>   
> +typedef unsigned int __bitwise zap_flags_t;
> +
> +/* Whether to skip zapping swap entries */
> +#define  ZAP_FLAG_SKIP_SWAP  ((__force zap_flags_t) BIT(0))

Interestingly, this will also skip fake some swap entries (e.g., 
migration entries but not private/exclusive entries). Maybe extend that 
documentation a bit.

... but, looking into zap_pmd_range(), we don't care about "details" 
when calling zap_huge_pmd(), which will zap pmd migration entries IIUC 
... so it's really unclear to me what the flag (and current behavior) 
really is and what should be documented. Should we maybe really only 
care about "real" swap entries?

Most probably I'm just missing something important.

> +
>   /*
>    * Parameter block passed down to zap_pte_range in exceptional cases.
>    */
>   struct zap_details {
>   	struct address_space *zap_mapping;	/* Check page->mapping if set */
>   	struct page *single_page;		/* Locked page to be unmapped */
> +	zap_flags_t zap_flags;			/* Extra flags for zapping */
>   };
>   
>   /*
> @@ -1737,6 +1743,16 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
>   	return details->zap_mapping != page_rmapping(page);
>   }
>   
> +/* Return true if skip swap entries, false otherwise */
> +static inline bool
> +zap_skip_swap(struct zap_details *details)
> +{
> +	if (!details)
> +		return false;
> +
> +	return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> +}
> +
>   struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>   			     pte_t pte);
>   struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index e5ee8399d270..4cb269ca8249 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1379,8 +1379,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   			continue;
>   		}
>   
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		if (unlikely(zap_skip_swap(details)))
>   			continue;
>   
>   		if (!non_swap_entry(entry))
> @@ -3351,6 +3350,7 @@ void unmap_mapping_page(struct page *page)
>   	first_index = page->index;
>   	last_index = page->index + thp_nr_pages(page) - 1;
>   
> +	/* Keep ZAP_FLAG_SKIP_SWAP cleared because we're truncating */
>   	details.zap_mapping = mapping;
>   	details.single_page = page;
>   
> @@ -3377,7 +3377,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>   		pgoff_t nr, bool even_cows)
>   {
>   	pgoff_t	first_index = start, last_index = start + nr - 1;
> -	struct zap_details details = { };
> +	struct zap_details details = { .zap_flags = ZAP_FLAG_SKIP_SWAP };
>   
>   	details.zap_mapping = even_cows ? NULL : mapping;
>   	if (last_index < first_index)
> 

I think what would really help is to add a high-level description what 
unmap_mapping_page() vs. unmap_mapping_pages() really does, and what the 
expectations/use cases are. The names are just way too similar ...

I wonder if it would make sense to split this into two parts

a) Introduce ZAP_FLAG_SKIP_SWAP and use it accordingly for existing cases
b) Stop setting it for unmap_mapping_page()

So we'd have the change in behavior isolated. But not sure if it's worth 
the trouble, especially if we want to backport the fix ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags
  2021-09-03  7:25   ` David Hildenbrand
@ 2021-09-03  7:31     ` David Hildenbrand
  2021-09-08  0:43     ` Peter Xu
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:31 UTC (permalink / raw)
  To: Peter Xu, linux-mm, Hugh Dickins, Andrew Morton, linux-kernel
  Cc: Miaohe Lin, Matthew Wilcox, Yang Shi, Kirill A . Shutemov,
	Jerome Glisse, Alistair Popple, Andrea Arcangeli, Mike Rapoport

>> @@ -3377,7 +3377,7 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
>>    		pgoff_t nr, bool even_cows)
>>    {
>>    	pgoff_t	first_index = start, last_index = start + nr - 1;
>> -	struct zap_details details = { };
>> +	struct zap_details details = { .zap_flags = ZAP_FLAG_SKIP_SWAP };
>>    
>>    	details.zap_mapping = even_cows ? NULL : mapping;
>>    	if (last_index < first_index)
>>
> 
> I think what would really help is to add a high-level description what
> unmap_mapping_page() vs. unmap_mapping_pages() really does, and what the
> expectations/use cases are. The names are just way too similar ...

aaaaand staring only at this patch I missed that we have nice 
descriptions already :)
-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-02 20:17 ` [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
@ 2021-09-03  7:42   ` David Hildenbrand
  2021-09-03 20:00     ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:42 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, Jerome Glisse,
	Mike Rapoport, Kirill A . Shutemov, Miaohe Lin, Alistair Popple,
	Axel Rasmussen

On 02.09.21 22:17, Peter Xu wrote:
> It was conditionally done previously, as there's one shmem special case that we
> use SetPageDirty() instead.  However that's not necessary and it should be
> easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> 
> The most recent discussion about this is here, where Hugh explained the history
> of SetPageDirty() and why it's possible that it's not required at all:
> 
> https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
> 
> Currently mfill_atomic_install_pte() has three callers:
> 
>          1. shmem_mfill_atomic_pte
>          2. mcopy_atomic_pte
>          3. mcontinue_atomic_pte
> 
> After the change: case (1) should have its SetPageDirty replaced by the dirty
> bit on pte (so we unify them together, finally), case (2) should have no
> functional change at all as it has page_in_cache==false, case (3) may add a
> dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
> it's merely 100% sure the page is dirty after all, so should not make a real
> difference either.

Would it be worth adding VM_BUG_ON() to make sure that "100%" is really 
the case?

> 
> This should make it much easier to follow on which case will set dirty for
> uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
> no special handling of SetPageDirty() if there's no need.

To me this all sounds sane, but I'm certainly not an expert on that 
code, so ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-03  7:42   ` David Hildenbrand
@ 2021-09-03 20:00     ` Peter Xu
  2021-09-03 20:02       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2021-09-03 20:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm,
	Andrea Arcangeli, Yang Shi, Matthew Wilcox, Jerome Glisse,
	Mike Rapoport, Kirill A . Shutemov, Miaohe Lin, Alistair Popple,
	Axel Rasmussen

On Fri, Sep 03, 2021 at 09:42:34AM +0200, David Hildenbrand wrote:
> On 02.09.21 22:17, Peter Xu wrote:
> > It was conditionally done previously, as there's one shmem special case that we
> > use SetPageDirty() instead.  However that's not necessary and it should be
> > easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
> > 
> > The most recent discussion about this is here, where Hugh explained the history
> > of SetPageDirty() and why it's possible that it's not required at all:
> > 
> > https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
> > 
> > Currently mfill_atomic_install_pte() has three callers:
> > 
> >          1. shmem_mfill_atomic_pte
> >          2. mcopy_atomic_pte
> >          3. mcontinue_atomic_pte
> > 
> > After the change: case (1) should have its SetPageDirty replaced by the dirty
> > bit on pte (so we unify them together, finally), case (2) should have no
> > functional change at all as it has page_in_cache==false, case (3) may add a
> > dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
> > it's merely 100% sure the page is dirty after all, so should not make a real
> > difference either.
> 
> Would it be worth adding VM_BUG_ON() to make sure that "100%" is really the
> case?

I won't be able to make it 100% sure (and that's where I put it "merely").  The
example discussed between Axel and me in the other thread could be an outlier
(when two processes, uffd target, and uffd minor resolver, map the region as
RO), it's just that neither do I think that's a great matter, nor do I think it
would be worth a BUG_ON(), not to mention we use BUG_ON so carefully.

> 
> > 
> > This should make it much easier to follow on which case will set dirty for
> > uffd, as we'll simply set it all now for all uffd related ioctls.  Meanwhile,
> > no special handling of SetPageDirty() if there's no need.
> 
> To me this all sounds sane, but I'm certainly not an expert on that code, so
> ...

No problem.  I hope this patch didn't bring much headache to a lot of people.
It's just that I do think this is the right thing to do so I will insist until
someone says no to me.  Already appreciate a lot for all the comments and r-bs!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
  2021-09-03 20:00     ` Peter Xu
@ 2021-09-03 20:02       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-09-03 20:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm,
	Andrea Arcangeli, Yang Shi, Matthew Wilcox, Jerome Glisse,
	Mike Rapoport, Kirill A . Shutemov, Miaohe Lin, Alistair Popple,
	Axel Rasmussen

On 03.09.21 22:00, Peter Xu wrote:
> On Fri, Sep 03, 2021 at 09:42:34AM +0200, David Hildenbrand wrote:
>> On 02.09.21 22:17, Peter Xu wrote:
>>> It was conditionally done previously, as there's one shmem special case that we
>>> use SetPageDirty() instead.  However that's not necessary and it should be
>>> easier and cleaner to do it unconditionally in mfill_atomic_install_pte().
>>>
>>> The most recent discussion about this is here, where Hugh explained the history
>>> of SetPageDirty() and why it's possible that it's not required at all:
>>>
>>> https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/
>>>
>>> Currently mfill_atomic_install_pte() has three callers:
>>>
>>>           1. shmem_mfill_atomic_pte
>>>           2. mcopy_atomic_pte
>>>           3. mcontinue_atomic_pte
>>>
>>> After the change: case (1) should have its SetPageDirty replaced by the dirty
>>> bit on pte (so we unify them together, finally), case (2) should have no
>>> functional change at all as it has page_in_cache==false, case (3) may add a
>>> dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE for shmem,
>>> it's merely 100% sure the page is dirty after all, so should not make a real
>>> difference either.
>>
>> Would it be worth adding VM_BUG_ON() to make sure that "100%" is really the
>> case?
> 
> I won't be able to make it 100% sure (and that's where I put it "merely").  The
> example discussed between Axel and me in the other thread could be an outlier
> (when two processes, uffd target, and uffd minor resolver, map the region as
> RO), it's just that neither do I think that's a great matter, nor do I think it
> would be worth a BUG_ON(), not to mention we use BUG_ON so carefully.

Agreed then, if we really expect there are corner cases and that the 
corner cases are fine!

(VM_BUG_ON() could have helped to catch these while testing)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd
  2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
                   ` (4 preceding siblings ...)
  2021-09-02 20:18 ` [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags Peter Xu
@ 2021-09-08  0:35 ` Peter Xu
  5 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2021-09-08  0:35 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm
  Cc: Andrea Arcangeli, Yang Shi, Matthew Wilcox, Jerome Glisse,
	Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple

On Thu, Sep 02, 2021 at 04:17:16PM -0400, Peter Xu wrote:
> [Based on tag v5.14, but it should still apply to -mm too.  If not, I can
>  repost anytime]
> 
> Hugh,
> 
> So I found one thing that I feel like a bug of commit 22061a1ffabdb9c3, but I'm
> not sure.  If that's the case, patch 5 of this series may be the fix for it.
> 
> The problem is unmap_mapping_page() in current tree is calling
> unmap_mapping_range_tree() with a details pointer, while by default when detail
> pointer is specified, it means "we want to skip zapping swap entries".
> 
> I didn't mention this in v1 simply because I thought it was fine, e.g., swap
> entry won't be kept in shmem ptes so skipped is okay (it is never okay with
> shmem uffd-wp but uffd-wp code is not landed yet).  However I just remembered
> there could also be e.g. shmem migration entries if I'm not wrong.  From that
> pov, skipping swap entries for unmap_mapping_page() seems wrong.  Would you
> please help check?

I figured this seems to have no problem: firstly commit 22061a1ffabdb9c3 didn't
really change the behavior of the code because previously it was using
unmap_mapping_range(), which will always pass in a "details" pointer anyway.

Meanwhile there won't be migration entry for this page because the new helper
unmap_mapping_page() has page lock held, while migration requires that too
during moving the pages.

Anyway, sorry for the noise. I'll respin but drop these paragraphs, also in the
last patch's commit message.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags
  2021-09-03  7:25   ` David Hildenbrand
  2021-09-03  7:31     ` David Hildenbrand
@ 2021-09-08  0:43     ` Peter Xu
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Xu @ 2021-09-08  0:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Hugh Dickins, Andrew Morton, linux-kernel, Miaohe Lin,
	Matthew Wilcox, Yang Shi, Kirill A . Shutemov, Jerome Glisse,
	Alistair Popple, Andrea Arcangeli, Mike Rapoport

On Fri, Sep 03, 2021 at 09:25:51AM +0200, David Hildenbrand wrote:
> On 02.09.21 22:18, Peter Xu wrote:
> > Firstly, the comment in zap_pte_range() is misleading because it checks against
> > details rather than check_mappings, so it's against what the code did.
> > 
> > Meanwhile, it's confusing too on not explaining why passing in the details
> 
> s/on//
> 
> > pointer would mean to skip all swap entries.  New user of zap_details could
> > very possibly miss this fact if they don't read deep until zap_pte_range()
> > because there's no comment at zap_details talking about it at all, so swap
> > entries could be errornously skipped without being noticed.
> 
> s/errornously/erroneously/

Will fix, thanks.

> 
> > 
> > Actually very recently we introduced unmap_mapping_page() in 22061a1ffabd, I
> > think that should also look into swap entries.  Add a comment there.  IOW, this
> > patch will be a functional change to unmap_mapping_page() but hopefully in the
> > right way to do it.

I'll remove this paragraph, as explained elsewhere.

> > 
> > This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
> > but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
> > "details" parameter: the caller should explicitly set this to skip swap
> > entries, otherwise swap entries will always be considered (which should still
> > be the major case here).
> > 
> > Cc: Kirill A. Shutemov <kirill@shutemov.name>
> > Cc: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/mm.h | 16 ++++++++++++++++
> >   mm/memory.c        |  6 +++---
> >   2 files changed, 19 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 81e402a5fbc9..a7bcdb2ec956 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1716,12 +1716,18 @@ static inline bool can_do_mlock(void) { return false; }
> >   extern int user_shm_lock(size_t, struct ucounts *);
> >   extern void user_shm_unlock(size_t, struct ucounts *);
> > +typedef unsigned int __bitwise zap_flags_t;
> > +
> > +/* Whether to skip zapping swap entries */
> > +#define  ZAP_FLAG_SKIP_SWAP  ((__force zap_flags_t) BIT(0))
> 
> Interestingly, this will also skip fake some swap entries (e.g., migration
> entries but not private/exclusive entries). Maybe extend that documentation
> a bit.
> 
> ... but, looking into zap_pmd_range(), we don't care about "details" when
> calling zap_huge_pmd(), which will zap pmd migration entries IIUC ... so
> it's really unclear to me what the flag (and current behavior) really is and
> what should be documented. Should we maybe really only care about "real"
> swap entries?

Indeed, I tried to look into it today and see why we wanted to skip swap
entries but I failed to figure it out easily - it goes back to the 1st git
commit of Linux.

Maybe there'll be some experienced developer who knows the history, but before
that I decided to just keep the behavior.

The final goal of mine is to make the code clean and also prepares for the
uffd-wp that will allow free-style use of "details" pointer, rather than have
an implicit hint on "skip swap entry" then it'll be enough for this patch.

> 
> Most probably I'm just missing something important.
> 
> > +
> >   /*
> >    * Parameter block passed down to zap_pte_range in exceptional cases.
> >    */
> >   struct zap_details {
> >   	struct address_space *zap_mapping;	/* Check page->mapping if set */
> >   	struct page *single_page;		/* Locked page to be unmapped */
> > +	zap_flags_t zap_flags;			/* Extra flags for zapping */
> >   };
> >   /*
> > @@ -1737,6 +1743,16 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
> >   	return details->zap_mapping != page_rmapping(page);
> >   }
> > +/* Return true if skip swap entries, false otherwise */
> > +static inline bool
> > +zap_skip_swap(struct zap_details *details)
> > +{
> > +	if (!details)
> > +		return false;
> > +
> > +	return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > +}
> > +
> >   struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >   			     pte_t pte);
> >   struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e5ee8399d270..4cb269ca8249 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1379,8 +1379,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >   			continue;
> >   		}
> > -		/* If details->check_mapping, we leave swap entries. */
> > -		if (unlikely(details))
> > +		if (unlikely(zap_skip_swap(details)))
> >   			continue;
> >   		if (!non_swap_entry(entry))
> > @@ -3351,6 +3350,7 @@ void unmap_mapping_page(struct page *page)
> >   	first_index = page->index;
> >   	last_index = page->index + thp_nr_pages(page) - 1;
> > +	/* Keep ZAP_FLAG_SKIP_SWAP cleared because we're truncating */

As to keep the behavior, I shouldn't fiddle around with this, so I'll also
attach ZAP_FLAG_SKIP_SWAP to unmap_mapping_page() too in v3.

> >   	details.zap_mapping = mapping;
> >   	details.single_page = page;

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns
  2021-09-02 20:17 ` [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
@ 2021-09-08  1:12   ` Liam Howlett
  0 siblings, 0 replies; 18+ messages in thread
From: Liam Howlett @ 2021-09-08  1:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, linux-mm,
	Andrea Arcangeli, Yang Shi, Matthew Wilcox, Jerome Glisse,
	Mike Rapoport, Kirill A . Shutemov, Miaohe Lin,
	David Hildenbrand, Alistair Popple

* Peter Xu <peterx@redhat.com> [210902 16:17]:
> pte_unmap_same() will always unmap the pte pointer.  After the unmap, vmf->pte
> will not be valid any more, we should clear it.
> 
> It was safe only because no one is accessing vmf->pte after pte_unmap_same()
> returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
> where vmf->pte will in most cases be overwritten very soon.
> 
> Directly pass in vmf into pte_unmap_same() and then we can also avoid the long
> parameter list too, which should be a nice cleanup.
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 25fc46e87214..7b095f07c4ef 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2724,19 +2724,19 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
>   * proceeding (but do_wp_page is only called after already making such a check;
>   * and do_anonymous_page can safely check later on).
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> -				pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
>  	int same = 1;
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
>  	if (sizeof(pte_t) > sizeof(unsigned long)) {
> -		spinlock_t *ptl = pte_lockptr(mm, pmd);
> +		spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>  		spin_lock(ptl);
> -		same = pte_same(*page_table, orig_pte);
> +		same = pte_same(*vmf->pte, vmf->orig_pte);
>  		spin_unlock(ptl);
>  	}
>  #endif
> -	pte_unmap(page_table);
> +	pte_unmap(vmf->pte);
> +	vmf->pte = NULL;
>  	return same;
>  }
>  
> @@ -3487,7 +3487,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
>  
> -	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> +	if (!pte_unmap_same(vmf))
>  		goto out;
>  
>  	entry = pte_to_swp_entry(vmf->orig_pte);
> -- 
> 2.31.1
> 

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-09-08  1:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 20:17 [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu
2021-09-02 20:17 ` [PATCH v2 1/5] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
2021-09-03  7:42   ` David Hildenbrand
2021-09-03 20:00     ` Peter Xu
2021-09-03 20:02       ` David Hildenbrand
2021-09-02 20:17 ` [PATCH v2 2/5] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
2021-09-08  1:12   ` Liam Howlett
2021-09-02 20:17 ` [PATCH v2 3/5] mm: Drop first_index/last_index in zap_details Peter Xu
2021-09-02 20:18 ` [PATCH v2 4/5] mm: Add zap_skip_check_mapping() helper Peter Xu
2021-09-03  0:58   ` Alistair Popple
2021-09-03  1:39     ` Peter Xu
2021-09-03  1:50       ` Alistair Popple
2021-09-03  7:07         ` David Hildenbrand
2021-09-02 20:18 ` [PATCH v2 5/5] mm: Add ZAP_FLAG_SKIP_SWAP and zap_flags Peter Xu
2021-09-03  7:25   ` David Hildenbrand
2021-09-03  7:31     ` David Hildenbrand
2021-09-08  0:43     ` Peter Xu
2021-09-08  0:35 ` [PATCH v2 0/5] mm: A few cleanup patches around zap, shmem and uffd Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).