LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: "Ahmed S. Darwish" <a.darwish@linutronix.de>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>, Hugh Dickins <hughd@google.com>,
	Jan Kara <jack@suse.cz>, Jann Horn <jannh@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: Re: [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork
Date: Sat, 31 Oct 2020 11:26:05 -0400
Message-ID: <20201031152605.GD6357@xz-x1> (raw)
In-Reply-To: <20201030235121.GQ2620339@nvidia.com>

On Fri, Oct 30, 2020 at 08:51:21PM -0300, Jason Gunthorpe wrote:
> > Another pure question: I'm just curious how you find all the statically
> > definied mm_structs, and to make sure all of them are covered (just in case
> > un-initialized seqcount could fail strangely).
> 
> I searched for all MMAP_LOCK_INITIALIZER() places and assumed that
> Michel got them all when he added it :)

Hmm, I should have noticed that before I ask.. :)

> 
> > Actually I'm thinking whether we should have one place to keep all the init
> > vars for all the statically definied mm_structs, so we don't need to find them
> > everytime, but only change that one place.
> 
> I was thinking that as well, most of the places are all the same

Yes, we can work on top.

> 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c48f8df6e50268..294c2c3c4fe00d 100644
> > > +++ b/mm/memory.c
> > > @@ -1171,6 +1171,12 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > >  		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > >  					0, src_vma, src_mm, addr, end);
> > >  		mmu_notifier_invalidate_range_start(&range);
> > > +		/*
> > > +		 * The read side doesn't spin, it goes to the mmap_lock, so the
> > > +		 * raw version is used to avoid disabling preemption here
> > > +		 */
> > > +		mmap_assert_write_locked(src_mm);
> > > +		raw_write_seqcount_t_begin(&src_mm->write_protect_seq);
> > 
> > Would raw_write_seqcount_begin() be better here?
> 
> Hum..
> 
> I felt no because it had the preempt stuff added into it, however it
> would work - __seqcount_lock_preemptible() == false for the seqcount_t
> case (see below)
> 
> Looking more closely, maybe the right API to pick is
> write_seqcount_t_begin() and write_seqcount_t_end() ??
> 
> However, no idea what the intention of the '*_seqcount_t_*' family is
> - it only seems to be used to implement the seqlock..
> 
> Lets add Amhed, perhaps he can give some guidance (see next section)?

IMHO we shouldn't directly use these helpers since they seem to only be used by
lock-associated versions of seqcount types.  But yeah, Amhed would be the best
one to answer...

> 
> > My understanding is that we used raw_write_seqcount_t_begin() because we're
> > with spin lock so assuming we disabled preemption already.
> 
> Here we rely on the exclusive mmap_lock, not a spinlock. This ensures
> only one write side is running concurrently as required by seqcount.

So imho here we have these things to consider during one thread updating the
seqcount_t:

  0. Concurrent read is perfectly welcomed, for sure.

  1. Concurrent writes on seqcount_t: mm sem protects it.

  2. Preempted write (if possible, maybe on RT?): I think it's also protected
     by mm sem, so looks ok too to me.

  3. Preempted/interrupted read on seqcount_t.  Seems to be the one discussed
     below.  Looks safe to me now with below explanation.  However...

> 
> The concern about preemption disable is that it wasn't held for fork()
> before, and we don't need it.. I understand preemption disable regions
> must be short or the RT people will not be happy, holding one across
> all of copy_page_range() sounds bad.
> 
> Ahmed explained in commit 8117ab508f the reason the seqcount_t write
> side has preemption disabled is because it can livelock RT kernels if
> the read side is spinning after preempting the write side. eg look at
> how __read_seqcount_begin() is implemented:
> 
> 	while ((seq = __seqcount_sequence(s)) & 1)			\
> 		cpu_relax();						\
> 
> However, in this patch, we don't spin on the read side.

... Shall we document this explicitly (if this patch still needs a repost)?
Seems not straightforward since that seems not the usual way to use seqcount,
not sure whether I'm the only one that feels this way, though.

> 
> If the read side collides with a writer it immediately goes to the
> mmap_lock, which is sleeping, and so it will sort itself out properly,
> even if it was preempted.
> 
> > An even further pure question on __seqcount_preemptible() (feel free to ignore
> > this question!): I saw that __seqcount_preemptible() seems to have been
> > constantly defined as "return false".  Not sure what happened there..
> 
> The new code has a range of seqcount_t types see
> Documentation/locking/seqlock.rst 'Sequence counters with associated
> locks'
> 
> It uses _Generic to do a bit of meta-programming and creates a compile
> time table of lock properties:
> 
> SEQCOUNT_LOCKNAME(raw_spinlock, raw_spinlock_t,  false,    s->lock,        raw_spin, raw_spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(spinlock,     spinlock_t,      __SEQ_RT, s->lock,        spin,     spin_lock(s->lock))
> SEQCOUNT_LOCKNAME(rwlock,       rwlock_t,        __SEQ_RT, s->lock,        read,     read_lock(s->lock))
> SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     s->lock,        mutex,    mutex_lock(s->lock))
> SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mutex, ww_mutex_lock(s->lock, NULL))
> 
> As well as as default set of properties for normal seqcount_t. The
> __seqcount_preemptible() is selected by the _Generic for seqcount_t:
> 
> #define __seqprop(s, prop) _Generic(*(s),				\
> 	seqcount_t:		__seqprop_##prop((void *)(s)),		\
> 
> And it says preemption must be disabled before using the lock:
> 
> static inline void __seqprop_assert(const seqcount_t *s)
> {
> 	lockdep_assert_preemption_disabled();
> }
> 
> And thus no need to have an automatic disable preemption:
> 
> static inline bool __seqprop_preemptible(const seqcount_t *s)
> {
> 	return false;
> }
> 
> Other lock subtypes are different, eg the codegen for mutex will use
> lockdep_assert_held(s->lock) for _assert and true for _preemptible()
> 
> So if we map the 'write begin' entry points:
> 
>  write_seqcount_begin - Enforces preemption off
>  raw_write_seqcount_begin - Auto disable preemption if required (false)
>  raw_write_seqcount_t_begin - No preemption stuff
>  write_seqcount_t_begin - No preemption stuff

Thanks for listing these details.

As a summary, I think I'm convinced maybe we can have this work without disable
preemtion.  It's just that some more comment might be even better.

The other thing is, considering this use of seqcount seems to be quite special
as explained below, I'm just not sure whether this would confuse lockdep or
kcsan, etc., if we decide to use write_seqcount_t_begin().

Thanks,

-- 
Peter Xu


  parent reply index

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com>
2020-10-30 14:46 ` [PATCH v2 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
2020-10-30 16:29   ` Jan Kara
2020-10-30 21:31   ` John Hubbard
2020-10-30 22:36   ` Peter Xu
2020-11-02 22:19 ` [PATCH v2 0/2] Add a seqcount between gup_fast and copy_page_range() Ahmed S. Darwish
2020-11-02 22:39   ` Linus Torvalds
2020-11-02 23:18     ` Ahmed S. Darwish
     [not found] ` <2-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com>
2020-10-30 16:51   ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Jan Kara
     [not found]     ` <20201030170226.GF2620339@nvidia.com>
2020-11-02  8:31       ` Jan Kara
2020-10-30 22:52   ` Peter Xu
     [not found]     ` <20201030235121.GQ2620339@nvidia.com>
2020-10-31 15:26       ` Peter Xu [this message]
2020-11-03  0:33         ` Ahmed S. Darwish
2020-11-03  0:17       ` Ahmed S. Darwish
     [not found]         ` <20201103002532.GL2620339@nvidia.com>
2020-11-03  0:41           ` Ahmed S. Darwish
2020-11-03  2:20             ` John Hubbard
2020-11-03  6:52               ` Ahmed S. Darwish
2020-11-03 17:40                 ` Linus Torvalds
2020-11-04  1:32                   ` Ahmed S. Darwish
2020-11-04  2:01                     ` John Hubbard
2020-11-04  3:17                       ` Ahmed S. Darwish
2020-11-04 18:38                         ` Linus Torvalds
2020-11-04 19:54                           ` Ahmed S. Darwish
2020-12-09 18:38                           ` [tip: locking/core] seqlock: Prefix internal seqcount_t-only macros with a "do_" tip-bot2 for Ahmed S. Darwish
2020-12-09 18:38                           ` [tip: locking/core] seqlock: kernel-doc: Specify when preemption is automatically altered tip-bot2 for Ahmed S. Darwish
2020-11-10 11:53                         ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Zijlstra
2020-12-03 10:35                           ` [tip: locking/core] seqlock: Rename __seqprop() users tip-bot2 for Peter Zijlstra
2020-11-03 17:03         ` [PATCH v2 2/2] mm: prevent gup_fast from racing with COW during fork Peter Xu
2020-11-02 23:58   ` Ahmed S. Darwish

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201031152605.GD6357@xz-x1 \
    --to=peterx@redhat.com \
    --cc=a.darwish@linutronix.de \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lkml.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lkml.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lkml.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lkml.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lkml.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lkml.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lkml.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lkml.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lkml.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lkml.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lkml.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lkml.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git