LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Keith Busch <keith.busch@linux.intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Andrew Lutomirski <amluto@gmail.com>,
	Jesse Vincent <jesse@fsck.com>, Sagi Grimberg <sagi@grimberg.me>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, Jens Axboe <axboe@fb.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: Another NVMe failure, this time with AER info
Date: Fri, 11 May 2018 11:26:11 -0600	[thread overview]
Message-ID: <20180511172610.GB7344@localhost.localdomain> (raw)
In-Reply-To: <20180511165752.GG190385@bhelgaas-glaptop.roam.corp.google.com>

On Fri, May 11, 2018 at 11:57:52AM -0500, Bjorn Helgaas wrote:
> We reported several corrected errors before the nvme timeout:
> 
>   [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>   [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller
>   [12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002)
>   [12750.305465] nvme nvme0: Removing after probe failure status: -19
>   [12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller
>   [12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller
> 
> The corrected errors are supposedly recovered in hardware without
> software intervention, and AER logs them for informational purposes.
> 
> But it seems very likely that these corrected errors are related to
> the nvme timeout: the first corrected errors were logged at
> 12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme
> timeout was at 12750.281158.

The nvme_timeout handling is broken at the moment, but I'm not sure any
of the fixes being considered will help here if we're really getting
MMIO errors (that's what it looks like).
 
> I don't have any good ideas.  As a shot in the dark, you could try
> running these commands before doing a suspend:
> 
>   # setpci -s01:00.0 0x98.W
>   # setpci -s00:1c.0 0x68.W
>   # setpci -s01:00.0 0x198.L
>   # setpci -s00:1c.0 0x208.L
> 
>   # setpci -s01:00.0 0x198.L=0x00000000
>   # setpci -s01:00.0 0x98.W=0x0000
>   # setpci -s00:1c.0 0x208.L=0x00000000
>   # setpci -s00:1c.0 0x68.W=0x0000
> 
>   # lspci -vv -s00:1c.0
>   # lspci -vv -s01:00.0
> 
> The idea is to turn off ASPM L1.2 and LTR, just because that's new and
> we've had issues with it before.  If you try this, please collect the
> output of the commands above in addition to the dmesg log, in case my
> math is bad.

I trust you know the offsets here, but it's hard to tell what this
is doing with hard-coded addresses. Just to be safe and for clarity,
I recommend the 'CAP_*+<offset>' with a mask.

For example, disabling ASPM L1.2 can look like:

 # setpci -s <B:D.f> CAP_PM+8.l=0:4

And disabling LTR:

 # setpci -s <B:D.f> CAP_EXP+28.w=0:400

  reply	other threads:[~2018-05-11 17:24 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAObL_7G5k7XNqXqDZASChiQd1mt+AJPAdNKz-DE+xQawhUy6ZA@mail.gmail.com>
2018-05-11 16:57 ` Bjorn Helgaas
2018-05-11 17:26   ` Keith Busch [this message]
2018-05-11 17:42     ` Keith Busch
2018-05-11 17:55       ` Bjorn Helgaas
2018-05-12  2:38   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180511172610.GB7344@localhost.localdomain \
    --to=keith.busch@linux.intel.com \
    --cc=amluto@gmail.com \
    --cc=axboe@fb.com \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=jesse@fsck.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sagi@grimberg.me \
    --subject='Re: Another NVMe failure, this time with AER info' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).