LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Niklas Schnelle <schnelle@linux.ibm.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Paul Mackerras <paulus@samba.org>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-s390@vger.kernel.org, linux-arch@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h
Date: Thu, 26 Aug 2021 14:36:27 +0200	[thread overview]
Message-ID: <87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@linux.ibm.com> (raw)
In-Reply-To: <20210825190444.GA3593752@bjorn-Precision-5520>

On Wed, 2021-08-25 at 14:04 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 23, 2021 at 12:53:39PM +0200, Niklas Schnelle wrote:
> > On Fri, 2021-08-20 at 17:37 -0500, Bjorn Helgaas wrote:
> > > On Tue, Jul 20, 2021 at 05:01:45PM +0200, Niklas Schnelle wrote:
> > > > The helper function pci_dev_is_added() from drivers/pci/pci.h is used in
> > > > PCI arch code of both s390 and powerpc leading to awkward relative
> > > > includes. Move it to the global include/linux/pci.h and get rid of these
> > > > includes just for that one function.
> > > 
> > > I agree the includes are awkward.
> > > 
> > > But the arch code *using* pci_dev_is_added() seems awkward, too.
> > 
> > See below for my interpretation why s390 has some driver like
> > functionality in its arch code which isn't necessarily awkward.
> > 
> > Independent from that I have found pci_dev_is_added() as the only way
> > deal with the case that one might be looking at a struct pci_dev
> > reference that has been removed via pci_stop_and_remove_bus_device() or
> > has never been fully scanned. This is quite useful when handling error
> > events which on s390 are part of the adapter event mechanism shared
> > with channel I/O devices.
> > 
> > > AFAICS, in powerpc, pci_dev_is_added() is only used by
> > > pnv_pci_ioda_fixup_iov() and pseries_pci_fixup_iov_resources().  Those
> > > are only called from pcibios_add_device(), which is only called from
> > > pci_device_add().
> > > 
> > > Is it even possible for pci_dev_is_added() to be true in that path?
> 
> If the pci_dev_is_added() in powerpc is unreachable, we can remove it
> and at least reduce this to an s390-only problem.

Ok. I might be missing something but I agree it does look these are
called from within pcibios_add_device() only so pci_dev_is_added() can
never be true. This looks pretty clear as pci_dev_assign_added() is
called after pcibios_add_device() in pci_bus_add_device().

> 
> > > s390 uses pci_dev_is_added() in recover_store()
> > 
> > I'm actually looking into this as I'm working on an s390 implementation
> > of the PCI recovery flow described in Documentation/PCI/pci-error-
> > recovery.rst that would also call pci_dev_is_added() because when we
> > get a platform notification of a PCI reset done by firmware it may be
> > that the struct pci_dev is going away i.e. we still have a ref count
> > but it is not added to the PCI bus anymore. And pci_dev_is_added() is
> > the only way I've found to check for this state.
> > 
> > > , but I don't know what
> > > that is (looks like a sysfs file, but it's not documented) or why s390
> > > is the only arch that does this.
> > 
> > Good point about this not being documented, I'll look into adding docs.
> > 
> > This is a sysfs attribute that basically removes the pci_dev and re-
> > adds it. This has the complication that since the attribute sits at
> > /sys/bus/pci/devices/<dev>/recover it deletes its own parent directory
> > which requires extra caution and means concurrent accesses block on
> > pci_lock_rescan_remove() instead of a kernfs lock.
> > Long story short when concurrently triggering the attribute one thread
> > proceeds into the pci_lock_rescan_remove() section and does the
> > removal, while others would block on pci_lock_rescan_remove(). Now when
> > the threads unblock the removal is done. In this case there is a new
> > struct pci_dev found in the rescan but the previously blocked threads
> > still have references to the old struct pci_dev which was removed and
> > as far as I could tell can only be distinguished by checking
> > pci_dev_is_added().
> 
> Is this locking issue different from concurrently writing to
> /sys/.../remove on other architectures?

In principle it is very similar except that we re-scan and thus the
removed pdev may co-exist with a new pdev for the same actual device if
there are other references to the pdev.

There is however also a significant difference in locking that fixes a
possible deadlock that I just confirmed also affects /sys/../remove
where it is hidden by a lockdep ignore, see lockdep splash below. 

For /sys/../recover this was fixed by my commit dd712f0a53ca
("s390/pci: Fix possible deadlock in  recover_store()") which also
introduced the need for pci_dev_is_added().

The change in the above commit is very similar to what is done in
drivers/scsi/scsi_sysfs.c:sdev_store_delete() which also has a self
deletion and in commit 0ee223b2e1f6 ("scsi: core: Avoid that SCSI
device removal through sysfs triggers a deadlock") fixed a very very
similar lock order problem.

Like the SCSI code I use sysfs_break_active_protection() which allows a
concurrent access to /sys/../recover to advance into recover_store().
It may then block on pci_lock_rescan_remove() instead of the kernfs
lock it would block on with just delete_remove_file_self(). Thus we
have to handle unblocking from pci_lock_rescan_remove() while holding
the stale struct pci_dev of the removed device. Together with the re-
scan this means I have to be able to determine after unblocking if I'm
looking at the old pdev.

I just tested that when removing the lockdep ignore from remove_store()
and do /sys/../power and /sys/../remove I do hit the same lock order
inversion issue there for remove. Note that unlike in the commit
introducing the ignore this is a completely flat hiearchy:

[  234.196509] ======================================================
[  234.196511] WARNING: possible circular locking dependency detected
[  234.196512] 5.14.0-rc7-08025-gf6d7568b37df-dirty #18 Not tainted
[  234.196514] ------------------------------------------------------
[  234.196516] zsh/1214 is trying to acquire lock:
[  234.196518] 000000013f296e30 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196529]
               but task is already holding lock:
[  234.196530] 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196538]
               which lock already depends on the new lock.

[  234.196539]
               the existing dependency chain (in reverse order) is:
[  234.196541]
               -> #1 (kn->active#363){++++}-{0:0}:
[  234.196545]        validate_chain+0x9ca/0xde8
[  234.196548]        __lock_acquire+0x64c/0xc40
[  234.196550]        lock_acquire.part.0+0xec/0x258
[  234.196552]        lock_acquire+0xb0/0x200
[  234.196554]        kernfs_drain+0x17a/0x1c8
[  234.196556]        __kernfs_remove+0x1f4/0x230
[  234.196558]        kernfs_remove_by_name_ns+0x5c/0xa8
[  234.196560]        remove_files+0x46/0x90
[  234.196563]        sysfs_remove_group+0x5a/0xb8
[  234.196565]        sysfs_remove_groups+0x46/0x68
[  234.196567]        device_remove_attrs+0x90/0xb8
[  234.196598]        device_del+0x192/0x3f8
[  234.196600]        pci_remove_bus_device+0x8a/0x128
[  234.196602]        pci_stop_and_remove_bus_device_locked+0x3a/0x48
[  234.196604]        zpci_bus_remove_device+0x68/0xa8
[  234.196607]        zpci_deconfigure_device+0x3a/0xe0
[  234.196610]        power_write_file+0x7c/0x130
[  234.196615]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196617]        new_sync_write+0x100/0x190
[  234.196620]        vfs_write+0x21e/0x2d0
[  234.196622]        ksys_write+0x6c/0xf8
[  234.196624]        __do_syscall+0x1c2/0x1f0
[  234.196628]        system_call+0x78/0xa0
[  234.196632]
               -> #0 (pci_rescan_remove_lock){+.+.}-{3:3}:
[  234.196636]        check_noncircular+0x168/0x188
[  234.196637]        check_prev_add+0xe0/0xed8
[  234.196639]        validate_chain+0x9ca/0xde8
[  234.196641]        __lock_acquire+0x64c/0xc40
[  234.196642]        lock_acquire.part.0+0xec/0x258
[  234.196644]        lock_acquire+0xb0/0x200
[  234.196646]        __mutex_lock+0xa2/0x8d8
[  234.196648]        mutex_lock_nested+0x32/0x40
[  234.196649]        pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196652]        remove_store+0x7a/0x88
[  234.196654]        kernfs_fop_write_iter+0x13e/0x1e0
[  234.196656]        new_sync_write+0x100/0x190
[  234.196657]        vfs_write+0x21e/0x2d0
[  234.196659]        ksys_write+0x6c/0xf8
[  234.196661]        __do_syscall+0x1c2/0x1f0
[  234.196663]        system_call+0x78/0xa0
[  234.196665]
               other info that might help us debug this:

[  234.196666]  Possible unsafe locking scenario:

[  234.196668]        CPU0                    CPU1
[  234.196669]        ----                    ----
[  234.196670]   lock(kn->active#363);
[  234.196672]                                lock(pci_rescan_remove_lock);
[  234.196674]                                lock(kn->active#363);
[  234.196677]   lock(pci_rescan_remove_lock);
[  234.196679]
                *** DEADLOCK ***

[  234.196680] 3 locks held by zsh/1214:
[  234.196682]  #0: 0000000099d44498 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x6c/0xf8
[  234.196688]  #1: 00000000b214ee90 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x102/0x1e0
[  234.196693]  #2: 000000009b6c3a10 (kn->active#363){++++}-{0:0}, at: sysfs_remove_file_self+0x42/0x78
[  234.196699]
               stack backtrace:
[  234.196701] CPU: 6 PID: 1214 Comm: zsh Not tainted 5.14.0-rc7-08025-gf6d7568b37df-dirty #18
[  234.196703] Hardware name: IBM 8561 T01 703 (LPAR)
[  234.196705] Call Trace:
[  234.196706]  [<000000013ebba5d0>] show_stack+0x90/0xf8
[  234.196711]  [<000000013ebcc28e>] dump_stack_lvl+0x8e/0xc8
[  234.196714]  [<000000013dfbe908>] check_noncircular+0x168/0x188
[  234.196716]  [<000000013dfbf9c8>] check_prev_add+0xe0/0xed8
[  234.196718]  [<000000013dfc118a>] validate_chain+0x9ca/0xde8
[  234.196720]  [<000000013dfc4184>] __lock_acquire+0x64c/0xc40
[  234.196721]  [<000000013dfc2d8c>] lock_acquire.part.0+0xec/0x258
[  234.196724]  [<000000013dfc2fa8>] lock_acquire+0xb0/0x200
[  234.196725]  [<000000013ebdac7a>] __mutex_lock+0xa2/0x8d8
[  234.196727]  [<000000013ebdb4e2>] mutex_lock_nested+0x32/0x40
[  234.196729]  [<000000013e7daaf6>] pci_stop_and_remove_bus_device_locked+0x26/0x48
[  234.196732]  [<000000013e7e753a>] remove_store+0x7a/0x88
[  234.196734]  [<000000013e35b4be>] kernfs_fop_write_iter+0x13e/0x1e0
[  234.196736]  [<000000013e264098>] new_sync_write+0x100/0x190
[  234.196738]  [<000000013e266f06>] vfs_write+0x21e/0x2d0
[  234.196740]  [<000000013e267224>] ksys_write+0x6c/0xf8
[  234.196742]  [<000000013ebcf9da>] __do_syscall+0x1c2/0x1f0
[  234.196744]  [<000000013ebe2238>] system_call+0x78/0xa0


> 
> > > Maybe we should make powerpc and s390 less special?
> > 
> > On s390, as I see it, the reason for this is that all of the PCI
> > functionality is directly defined in the Architecture as special CPU
> > instructions which are kind of hypercalls but also an ISA extension.
> > 
> > These instructions range from the basic PCI memory accesses (no real
> > MMIO) to enumeration of the devices and on to reporting of hot-plug and
> > and resets/recovery events. Importantly we do not have any kind of
> > direct access to a real or virtual PCI controller and the architecture
> > has no concept of a comparable entity.
> > 
> > So in my opinion while there is some of the functionality of a PCI
> > controller in arch/s390/pci the cut off between controller
> > functionality and arch support isn't clear at all and exposing PCI
> > support as CPU instructions doesn't map well to the controller concept.
> > 
> > That said, in principle I'm open to moving some of that into
> > drivers/pci/controller/ if you think that would improve things and we
> > can find a good argument what should go where. One possible cut off
> > would be to have arch/s390/pci/ provide wrappers to the PCI
> > instructions but move all their uses to  e.g.
> > drivers/pci/controller/s390/. This would of course be a major
> > refactoring and none of that code would be useful on any other
> > architecture but it would move a lot the accesses to PCI common code
> > functionality out of the arch code.
> 
> Looks like hotplug is already in drivers/pci/hotplug/s390_pci_hpc.c.
> 
> Might be worth considering putting the other PCI core-ish code in
> drivers/pci as well, though it doesn't feel urgent to me.  Maybe a
> good internship or mentoring project.

I agree

> 
> I'm not sure this juggling around is worth it basically to just clean
> up the include path.  The downside to me is exposing
> pci_dev_is_added() to outside the PCI core, because I don't want
> to encourage any other users.

Hmm, for me the big question how to then handle the need for
pci_dev_is_added() in my upcoming recovery code, that would be a new
user outside the PCI core unless I move arch/s390/pci/pci_event.c to
drivers/pci.

That said I'm not sure I understand yet what makes pci_dev_is_added()
unsuitable for outside the PCI core. I do understand that struct
pci_dev::priv_flags is supposed to be private to PCI drivers but on the
other hand the functionality of checking whether a PCI device has been
added to a PCI bus seems rather basic and useful whenever working with
a PCI device.

> 
> > > > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > > > ---
> > > > Since v1 (and bad v2):
> > > > - Fixed accidental removal of PCI_DPC_RECOVERED, PCI_DPC_RECOVERING
> > > >   defines and also move these to include/linux/pci.h
> > > > 
> > > >  arch/powerpc/platforms/powernv/pci-sriov.c |  3 ---
> > > >  arch/powerpc/platforms/pseries/setup.c     |  1 -
> > > >  arch/s390/pci/pci_sysfs.c                  |  2 --
> > > >  drivers/pci/hotplug/acpiphp_glue.c         |  1 -
> > > >  drivers/pci/pci.h                          | 15 ---------------
> > > >  include/linux/pci.h                        | 15 +++++++++++++++
> > > >  6 files changed, 15 insertions(+), 22 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > index 28aac933a439..2e0ca5451e85 100644
> > > > --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> > > > @@ -9,9 +9,6 @@
> > > >  
> > > >  #include "pci.h"
> > > >  
> > > > -/* for pci_dev_is_added() */
> > > > -#include "../../../../drivers/pci/pci.h"
> > > > 
> > .. snip ..
> > 


  reply	other threads:[~2021-08-26 12:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-20 15:01 Niklas Schnelle
2021-08-20 22:37 ` Bjorn Helgaas
2021-08-23 10:53   ` Niklas Schnelle
2021-08-25 19:04     ` Bjorn Helgaas
2021-08-26 12:36       ` Niklas Schnelle [this message]
2021-10-08 22:13 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@linux.ibm.com \
    --to=schnelle@linux.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@samba.org \
    --subject='Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).