LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast()
       [not found] <0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
@ 2020-11-10 23:44 ` Jason Gunthorpe
  2020-11-11 18:27 ` [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Linus Torvalds
       [not found] ` <2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
  2 siblings, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 23:44 UTC (permalink / raw)
  To: linux-kernel, Peter Xu, Linus Torvalds
  Cc: Ahmed S. Darwish, Andrea Arcangeli, Andrew Morton,
	Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn,
	John Hubbard, Kirill Shutemov, Kirill Tkhai, Linux-MM,
	Michal Hocko, Oleg Nesterov

The next patch in this series makes the lockless flow a little more
complex, so move the entire block into a new function and remove a level
of indention. Tidy a bit of cruft:

 - addr is always the same as start, so use start

 - Use the modern check_add_overflow() for computing end = start + len

 - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
   avoid shift overflow, make the variables unsigned long to avoid coding
   casts in both places. nr_pinned was missing its cast

 - The handling of ret and nr_pinned can be streamlined a bit

No functional change.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 99 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 54 insertions(+), 45 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609c3..c7e24301860abb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2677,13 +2677,43 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
-static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+static unsigned long lockless_pages_from_mm(unsigned long start,
+					    unsigned long end,
+					    unsigned int gup_flags,
+					    struct page **pages)
+{
+	unsigned long flags;
+	int nr_pinned = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
+	    !gup_fast_permitted(start, end))
+		return 0;
+
+	/*
+	 * Disable interrupts. The nested form is used, in order to allow full,
+	 * general purpose use of this routine.
+	 *
+	 * With interrupts disabled, we block page table pages from being freed
+	 * from under us. See struct mmu_table_batch comments in
+	 * include/asm-generic/tlb.h for more details.
+	 *
+	 * We do not adopt an rcu_read_lock() here as we also want to block IPIs
+	 * that come from THPs splitting.
+	 */
+	local_irq_save(flags);
+	gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
+	local_irq_restore(flags);
+	return nr_pinned;
+}
+
+static int internal_get_user_pages_fast(unsigned long start,
+					unsigned long nr_pages,
 					unsigned int gup_flags,
 					struct page **pages)
 {
-	unsigned long addr, len, end;
-	unsigned long flags;
-	int nr_pinned = 0, ret = 0;
+	unsigned long len, end;
+	unsigned long nr_pinned;
+	int ret;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
@@ -2697,54 +2727,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 		might_lock_read(&current->mm->mmap_lock);
 
 	start = untagged_addr(start) & PAGE_MASK;
-	addr = start;
-	len = (unsigned long) nr_pages << PAGE_SHIFT;
-	end = start + len;
-
-	if (end <= start)
+	len = nr_pages << PAGE_SHIFT;
+	if (check_add_overflow(start, len, &end))
 		return 0;
 	if (unlikely(!access_ok((void __user *)start, len)))
 		return -EFAULT;
 
-	/*
-	 * Disable interrupts. The nested form is used, in order to allow
-	 * full, general purpose use of this routine.
-	 *
-	 * With interrupts disabled, we block page table pages from being
-	 * freed from under us. See struct mmu_table_batch comments in
-	 * include/asm-generic/tlb.h for more details.
-	 *
-	 * We do not adopt an rcu_read_lock(.) here as we also want to
-	 * block IPIs that come from THPs splitting.
-	 */
-	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) {
-		unsigned long fast_flags = gup_flags;
-
-		local_irq_save(flags);
-		gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
-		local_irq_restore(flags);
-		ret = nr_pinned;
-	}
-
-	if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) {
-		/* Try to get the remaining pages with get_user_pages */
-		start += nr_pinned << PAGE_SHIFT;
-		pages += nr_pinned;
+	nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages);
+	if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY)
+		return nr_pinned;
 
-		ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
-					      gup_flags, pages);
-
-		/* Have to be a bit careful with return values */
-		if (nr_pinned > 0) {
-			if (ret < 0)
-				ret = nr_pinned;
-			else
-				ret += nr_pinned;
-		}
+	/* Slow path: try to get the remaining pages with get_user_pages */
+	start += nr_pinned << PAGE_SHIFT;
+	pages += nr_pinned;
+	ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags,
+				      pages);
+	if (ret < 0) {
+		/*
+		 * The caller has to unpin the pages we already pinned so
+		 * returning -errno is not an option
+		 */
+		if (nr_pinned)
+			return nr_pinned;
+		return ret;
 	}
-
-	return ret;
+	return ret + nr_pinned;
 }
+
 /**
  * get_user_pages_fast_only() - pin user pages in memory
  * @start:      starting user address
-- 
2.29.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range()
       [not found] <0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
  2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
@ 2020-11-11 18:27 ` Linus Torvalds
       [not found] ` <2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
  2 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2020-11-11 18:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linux Kernel Mailing List, Peter Xu, Ahmed S. Darwish,
	Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V,
	Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn,
	John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky,
	Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 3:44 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As discussed and suggested by Linus use a seqcount to close the small race
> between gup_fast and copy_page_range().
>
> Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
> in this case and it doesn't trigger any lockdeps.
>
> I was able to test it using two threads, one forking and the other using
> ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
> made the window large enough to reliably hit to test the logic.

Looks all good to me.

             Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
       [not found] ` <2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
@ 2020-11-11 19:59   ` Peter Xu
  2020-11-12  7:41   ` Ahmed S. Darwish
  1 sibling, 0 replies; 4+ messages in thread
From: Peter Xu @ 2020-11-11 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Linus Torvalds, Ahmed S. Darwish, Andrea Arcangeli,
	Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins,
	Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 07:44:09PM -0400, Jason Gunthorpe wrote:
> Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
> fork() for ptes") pages under a FOLL_PIN will not be write protected
> during COW for fork. This means that pages returned from
> pin_user_pages(FOLL_WRITE) should not become write protected while the pin
> is active.
> 
> However, there is a small race where get_user_pages_fast(FOLL_PIN) can
> establish a FOLL_PIN at the same time copy_present_page() is write
> protecting it:
> 
>         CPU 0                             CPU 1
>    get_user_pages_fast()
>     internal_get_user_pages_fast()
>                                        copy_page_range()
>                                          pte_alloc_map_lock()
>                                            copy_present_page()
>                                              atomic_read(has_pinned) == 0
> 					     page_maybe_dma_pinned() == false
>      atomic_set(has_pinned, 1);
>      gup_pgd_range()
>       gup_pte_range()
>        pte_t pte = gup_get_pte(ptep)
>        pte_access_permitted(pte)
>        try_grab_compound_head()
>                                              pte = pte_wrprotect(pte)
> 	                                     set_pte_at();
>                                          pte_unmap_unlock()
>       // GUP now returns with a write protected page
> 
> The first attempt to resolve this by using the write protect caused
> problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
> early COW write protect games during fork()")
> 
> Instead wrap copy_p4d_range() with the write side of a seqcount and check
> the read side around gup_pgd_range(). If there is a collision then
> get_user_pages_fast() fails and falls back to slow GUP.
> 
> Slow GUP is safe against this race because copy_page_range() is only
> called while holding the exclusive side of the mmap_lock on the src
> mm_struct.
> 
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
       [not found] ` <2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
  2020-11-11 19:59   ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Peter Xu
@ 2020-11-12  7:41   ` Ahmed S. Darwish
  1 sibling, 0 replies; 4+ messages in thread
From: Ahmed S. Darwish @ 2020-11-12  7:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli,
	Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins,
	Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 07:44:09PM -0400, Jason Gunthorpe wrote:
...
>
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

Thanks for the v3 and v4 updates.

For the seqcount_t parts:

  Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
2020-11-11 18:27 ` [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Linus Torvalds
     [not found] ` <2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com>
2020-11-11 19:59   ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Peter Xu
2020-11-12  7:41   ` Ahmed S. Darwish

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lkml.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lkml.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git