LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip
@ 2021-08-30 12:37 Jonas Dreßler
  2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Jonas Dreßler @ 2021-08-30 12:37 UTC (permalink / raw)
  To: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski
  Cc: Jonas Dreßler, Tsuchiya Yuto, linux-wireless, netdev,
	linux-kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

Two patches which should fix all the random resets of the Marvell 88W8897 PCIe chip.
The first one works around a bug in the firmware that causes it to crash, the second
one makes us try harder to wake up the firmware before we consider it non-responsive.
More explanations in the commit messages.

Jonas Dreßler (2):
  mwifiex: Use non-posted PCI register writes
  mwifiex: Try waking the firmware until we get an interrupt

 drivers/net/wireless/marvell/mwifiex/pcie.c | 35 +++++++++++++++++----
 1 file changed, 29 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-08-30 12:37 [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Jonas Dreßler
@ 2021-08-30 12:37 ` Jonas Dreßler
  2021-08-30 12:49   ` Andy Shevchenko
  2021-09-01 19:40   ` Brian Norris
  2021-08-30 12:37 ` [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt Jonas Dreßler
  2021-09-25 17:32 ` [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Pali Rohár
  2 siblings, 2 replies; 32+ messages in thread
From: Jonas Dreßler @ 2021-08-30 12:37 UTC (permalink / raw)
  To: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski
  Cc: Jonas Dreßler, Tsuchiya Yuto, linux-wireless, netdev,
	linux-kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On the 88W8897 card it's very important the TX ring write pointer is
updated correctly to its new value before setting the TX ready
interrupt, otherwise the firmware appears to crash (probably because
it's trying to DMA-read from the wrong place).

Since PCI uses "posted writes" when writing to a register, it's not
guaranteed that a write will happen immediately. That means the pointer
might be outdated when setting the TX ready interrupt, leading to
firmware crashes especially when ASPM L1 and L1 substates are enabled
(because of the higher link latency, the write will probably take
longer).

So fix those firmware crashes by always forcing non-posted writes. We do
that by simply reading back the register after writing it, just as a lot
of other drivers do.

There are two reproducers that are fixed with this patch:

1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
substates are platform dependent), the firmware crashes and eventually a
command timeout appears in the logs. That crash is fixed by using a
non-posted write in mwifiex_pcie_send_data().

2) When sending lots of commands to the card, waking it up from sleep in
very quick intervals, the firmware eventually crashes. That crash
appears to be fixed by some other non-posted write included here.

Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
---
 drivers/net/wireless/marvell/mwifiex/pcie.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
index c6ccce426b49..bfd6e135ed99 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -237,6 +237,12 @@ static int mwifiex_write_reg(struct mwifiex_adapter *adapter, int reg, u32 data)
 
 	iowrite32(data, card->pci_mmap1 + reg);
 
+	/* Do a read-back, which makes the write non-posted, ensuring the
+	 * completion before returning.
+	 * The firmware of the 88W8897 card is buggy and this avoids crashes.
+	 */
+	ioread32(card->pci_mmap1 + reg);
+
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt
  2021-08-30 12:37 [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Jonas Dreßler
  2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
@ 2021-08-30 12:37 ` Jonas Dreßler
  2021-08-30 12:51   ` Andy Shevchenko
  2021-09-25 17:32 ` [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Pali Rohár
  2 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-08-30 12:37 UTC (permalink / raw)
  To: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski
  Cc: Jonas Dreßler, Tsuchiya Yuto, linux-wireless, netdev,
	linux-kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

It seems that the firmware of the 88W8897 card sometimes ignores or
misses when we try to wake it up by reading the firmware status
register. This leads to the firmware wakeup timeout expiring and the
driver resetting the card because we assume the firmware has hung up or
crashed (unfortunately that's not unlikely with this card).

Turns out that most of the time the firmware actually didn't hang up,
but simply "missed" our wakeup request and doesn't send us an AWAKE
event.

Trying again to read the firmware status register after a short timeout
usually makes the firmware wake we up as expected, so add a small retry
loop to mwifiex_pm_wakeup_card() that looks at the interrupt status to
check whether the card woke up.

The number of tries and timeout lengths for this were determined
experimentally: The firmware usually takes about 500 us to wake up
after we attempt to read the status register. In some cases where the
firmware is very busy (for example while doing a bluetooth scan) it
might even miss our requests for multiple milliseconds, which is why
after 15 tries the waiting time gets increased to 10 ms. The maximum
number of tries it took to wake the firmware when testing this was
around 20, so a maximum number of 50 tries should give us plenty of
safety margin.

A good reproducer for this issue is letting the firmware sleep and wake
up in very short intervals, for example by pinging an device on the
network every 0.1 seconds.

Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
---
 drivers/net/wireless/marvell/mwifiex/pcie.c | 29 ++++++++++++++++-----
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
index bfd6e135ed99..14742bdb96ef 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -658,6 +658,7 @@ static int mwifiex_pm_wakeup_card(struct mwifiex_adapter *adapter)
 {
 	struct pcie_service_card *card = adapter->card;
 	const struct mwifiex_pcie_card_reg *reg = card->pcie.reg;
+	int n_tries = 0;
 
 	mwifiex_dbg(adapter, EVENT,
 		    "event: Wakeup device...\n");
@@ -665,12 +666,28 @@ static int mwifiex_pm_wakeup_card(struct mwifiex_adapter *adapter)
 	if (reg->sleep_cookie)
 		mwifiex_pcie_dev_wakeup_delay(adapter);
 
-	/* Accessing fw_status register will wakeup device */
-	if (mwifiex_write_reg(adapter, reg->fw_status, FIRMWARE_READY_PCIE)) {
-		mwifiex_dbg(adapter, ERROR,
-			    "Writing fw_status register failed\n");
-		return -1;
-	}
+	/* Access the fw_status register to wake up the device.
+	 * Since the 88W8897 firmware sometimes appears to ignore or miss
+	 * that wakeup request, we continue trying until we receive an
+	 * interrupt from the card.
+	 */
+	do {
+		if (mwifiex_write_reg(adapter, reg->fw_status, FIRMWARE_READY_PCIE)) {
+			mwifiex_dbg(adapter, ERROR,
+				    "Writing fw_status register failed\n");
+			return -1;
+		}
+
+		n_tries++;
+
+		if (n_tries <= 15)
+			usleep_range(400, 700);
+		else
+			msleep(10);
+	} while (n_tries <= 50 && READ_ONCE(adapter->int_status) == 0);
+
+	mwifiex_dbg(adapter, EVENT,
+		    "event: Tried %d times until firmware woke up\n", n_tries);
 
 	if (reg->sleep_cookie) {
 		mwifiex_pcie_dev_wakeup_delay(adapter);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
@ 2021-08-30 12:49   ` Andy Shevchenko
  2021-09-01 14:01     ` Jonas Dreßler
  2021-09-01 19:40   ` Brian Norris
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Shevchenko @ 2021-08-30 12:49 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> On the 88W8897 card it's very important the TX ring write pointer is
> updated correctly to its new value before setting the TX ready
> interrupt, otherwise the firmware appears to crash (probably because
> it's trying to DMA-read from the wrong place).
>
> Since PCI uses "posted writes" when writing to a register, it's not
> guaranteed that a write will happen immediately. That means the pointer
> might be outdated when setting the TX ready interrupt, leading to
> firmware crashes especially when ASPM L1 and L1 substates are enabled
> (because of the higher link latency, the write will probably take
> longer).
>
> So fix those firmware crashes by always forcing non-posted writes. We do
> that by simply reading back the register after writing it, just as a lot
> of other drivers do.
>
> There are two reproducers that are fixed with this patch:
>
> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
> substates are platform dependent), the firmware crashes and eventually a
> command timeout appears in the logs. That crash is fixed by using a
> non-posted write in mwifiex_pcie_send_data().
>
> 2) When sending lots of commands to the card, waking it up from sleep in
> very quick intervals, the firmware eventually crashes. That crash
> appears to be fixed by some other non-posted write included here.

Thanks for all this work!

Nevertheless, do we have any commits that may be a good candidate to
be in the Fixes tag here?

> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
> ---
>  drivers/net/wireless/marvell/mwifiex/pcie.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c b/drivers/net/wireless/marvell/mwifiex/pcie.c
> index c6ccce426b49..bfd6e135ed99 100644
> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
> @@ -237,6 +237,12 @@ static int mwifiex_write_reg(struct mwifiex_adapter *adapter, int reg, u32 data)
>
>         iowrite32(data, card->pci_mmap1 + reg);
>
> +       /* Do a read-back, which makes the write non-posted, ensuring the
> +        * completion before returning.

> +        * The firmware of the 88W8897 card is buggy and this avoids crashes.

Any firmware version reference? Would be nice to have just for the
sake of record.

> +        */
> +       ioread32(card->pci_mmap1 + reg);
> +
>         return 0;
>  }


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt
  2021-08-30 12:37 ` [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt Jonas Dreßler
@ 2021-08-30 12:51   ` Andy Shevchenko
  2021-08-30 12:55     ` Andy Shevchenko
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Shevchenko @ 2021-08-30 12:51 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Mon, Aug 30, 2021 at 3:39 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> It seems that the firmware of the 88W8897 card sometimes ignores or
> misses when we try to wake it up by reading the firmware status
> register. This leads to the firmware wakeup timeout expiring and the
> driver resetting the card because we assume the firmware has hung up or
> crashed (unfortunately that's not unlikely with this card).
>
> Turns out that most of the time the firmware actually didn't hang up,
> but simply "missed" our wakeup request and doesn't send us an AWAKE

didn't

> event.
>
> Trying again to read the firmware status register after a short timeout
> usually makes the firmware wake we up as expected, so add a small retry

wake up

> loop to mwifiex_pm_wakeup_card() that looks at the interrupt status to
> check whether the card woke up.
>
> The number of tries and timeout lengths for this were determined
> experimentally: The firmware usually takes about 500 us to wake up
> after we attempt to read the status register. In some cases where the
> firmware is very busy (for example while doing a bluetooth scan) it
> might even miss our requests for multiple milliseconds, which is why
> after 15 tries the waiting time gets increased to 10 ms. The maximum
> number of tries it took to wake the firmware when testing this was
> around 20, so a maximum number of 50 tries should give us plenty of
> safety margin.
>
> A good reproducer for this issue is letting the firmware sleep and wake
> up in very short intervals, for example by pinging an device on the

a device

> network every 0.1 seconds.

...

> +       /* Access the fw_status register to wake up the device.
> +        * Since the 88W8897 firmware sometimes appears to ignore or miss
> +        * that wakeup request, we continue trying until we receive an
> +        * interrupt from the card.
> +        */
> +       do {
> +               if (mwifiex_write_reg(adapter, reg->fw_status, FIRMWARE_READY_PCIE)) {
> +                       mwifiex_dbg(adapter, ERROR,
> +                                   "Writing fw_status register failed\n");
> +                       return -1;
> +               }
> +
> +               n_tries++;
> +
> +               if (n_tries <= 15)
> +                       usleep_range(400, 700);
> +               else
> +                       msleep(10);
> +       } while (n_tries <= 50 && READ_ONCE(adapter->int_status) == 0);

NIH read_poll_timeout() from iopoll.h.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt
  2021-08-30 12:51   ` Andy Shevchenko
@ 2021-08-30 12:55     ` Andy Shevchenko
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Shevchenko @ 2021-08-30 12:55 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Mon, Aug 30, 2021 at 3:51 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Mon, Aug 30, 2021 at 3:39 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

...

> > +       do {
> > +               if (mwifiex_write_reg(adapter, reg->fw_status, FIRMWARE_READY_PCIE)) {
> > +                       mwifiex_dbg(adapter, ERROR,
> > +                                   "Writing fw_status register failed\n");

> > +                       return -1;

Please, use proper code. -EIO or so.

> > +               }
> > +
> > +               n_tries++;
> > +
> > +               if (n_tries <= 15)
> > +                       usleep_range(400, 700);
> > +               else
> > +                       msleep(10);
> > +       } while (n_tries <= 50 && READ_ONCE(adapter->int_status) == 0);

Can you use definitions for 15 and 50?

> NIH read_poll_timeout() from iopoll.h.

On the second thought it might be not optimal to use it (requires
anyway an additional function and doesn't provide an abortion on
error). Just see above.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-08-30 12:49   ` Andy Shevchenko
@ 2021-09-01 14:01     ` Jonas Dreßler
  2021-09-01 15:47       ` Andy Shevchenko
  2021-09-01 15:51       ` Pali Rohár
  0 siblings, 2 replies; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-01 14:01 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On 8/30/21 2:49 PM, Andy Shevchenko wrote:
 > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
 >>
 >> On the 88W8897 card it's very important the TX ring write pointer is
 >> updated correctly to its new value before setting the TX ready
 >> interrupt, otherwise the firmware appears to crash (probably because
 >> it's trying to DMA-read from the wrong place).
 >>
 >> Since PCI uses "posted writes" when writing to a register, it's not
 >> guaranteed that a write will happen immediately. That means the pointer
 >> might be outdated when setting the TX ready interrupt, leading to
 >> firmware crashes especially when ASPM L1 and L1 substates are enabled
 >> (because of the higher link latency, the write will probably take
 >> longer).
 >>
 >> So fix those firmware crashes by always forcing non-posted writes. We do
 >> that by simply reading back the register after writing it, just as a lot
 >> of other drivers do.
 >>
 >> There are two reproducers that are fixed with this patch:
 >>
 >> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
 >> substates are platform dependent), the firmware crashes and eventually a
 >> command timeout appears in the logs. That crash is fixed by using a
 >> non-posted write in mwifiex_pcie_send_data().
 >>
 >> 2) When sending lots of commands to the card, waking it up from sleep in
 >> very quick intervals, the firmware eventually crashes. That crash
 >> appears to be fixed by some other non-posted write included here.
 >
 > Thanks for all this work!
 >
 > Nevertheless, do we have any commits that may be a good candidate to
 > be in the Fixes tag here?
 >

I don't think there's any commit we could point to, given that the bug 
is probably somewhere in the firmware code.

 >> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>
 >> ---
 >>   drivers/net/wireless/marvell/mwifiex/pcie.c | 6 ++++++
 >>   1 file changed, 6 insertions(+)
 >>
 >> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 
b/drivers/net/wireless/marvell/mwifiex/pcie.c
 >> index c6ccce426b49..bfd6e135ed99 100644
 >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
 >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
 >> @@ -237,6 +237,12 @@ static int mwifiex_write_reg(struct 
mwifiex_adapter *adapter, int reg, u32 data)
 >>
 >>          iowrite32(data, card->pci_mmap1 + reg);
 >>
 >> +       /* Do a read-back, which makes the write non-posted, 
ensuring the
 >> +        * completion before returning.
 >
 >> +        * The firmware of the 88W8897 card is buggy and this avoids 
crashes.
 >
 > Any firmware version reference? Would be nice to have just for the
 > sake of record.
 >

Pretty sure the crash is present in every firmware that has been 
released, I've tried most of them. FTR, the current firmware version is 
15.68.19.p21.

 >> +        */
 >> +       ioread32(card->pci_mmap1 + reg);
 >> +
 >>          return 0;
 >>   }
 >
 >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 14:01     ` Jonas Dreßler
@ 2021-09-01 15:47       ` Andy Shevchenko
  2021-09-01 15:51       ` Pali Rohár
  1 sibling, 0 replies; 32+ messages in thread
From: Andy Shevchenko @ 2021-09-01 15:47 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Wed, Sep 1, 2021 at 5:02 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> On 8/30/21 2:49 PM, Andy Shevchenko wrote:
>  > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

...

>  > Thanks for all this work!
>  >
>  > Nevertheless, do we have any commits that may be a good candidate to
>  > be in the Fixes tag here?

> I don't think there's any commit we could point to, given that the bug
> is probably somewhere in the firmware code.

Here (in the commit message)...

>  >> +       /* Do a read-back, which makes the write non-posted,
> ensuring the
>  >> +        * completion before returning.
>  >
>  >> +        * The firmware of the 88W8897 card is buggy and this avoids
> crashes.
>  >
>  > Any firmware version reference? Would be nice to have just for the
>  > sake of record.
>
> Pretty sure the crash is present in every firmware that has been
> released, I've tried most of them. FTR, the current firmware version is
> 15.68.19.p21.

...and here it would be nice to state this version, so in the future
we will have a right landmark.

>  >> +        */

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 14:01     ` Jonas Dreßler
  2021-09-01 15:47       ` Andy Shevchenko
@ 2021-09-01 15:51       ` Pali Rohár
  2021-09-01 16:51         ` Heiner Kallweit
  1 sibling, 1 reply; 32+ messages in thread
From: Pali Rohár @ 2021-09-01 15:51 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
> On 8/30/21 2:49 PM, Andy Shevchenko wrote:
> > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> >>
> >> On the 88W8897 card it's very important the TX ring write pointer is
> >> updated correctly to its new value before setting the TX ready
> >> interrupt, otherwise the firmware appears to crash (probably because
> >> it's trying to DMA-read from the wrong place).
> >>
> >> Since PCI uses "posted writes" when writing to a register, it's not
> >> guaranteed that a write will happen immediately. That means the pointer
> >> might be outdated when setting the TX ready interrupt, leading to
> >> firmware crashes especially when ASPM L1 and L1 substates are enabled
> >> (because of the higher link latency, the write will probably take
> >> longer).
> >>
> >> So fix those firmware crashes by always forcing non-posted writes. We do
> >> that by simply reading back the register after writing it, just as a lot
> >> of other drivers do.
> >>
> >> There are two reproducers that are fixed with this patch:
> >>
> >> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
> >> substates are platform dependent), the firmware crashes and eventually a
> >> command timeout appears in the logs. That crash is fixed by using a
> >> non-posted write in mwifiex_pcie_send_data().
> >>
> >> 2) When sending lots of commands to the card, waking it up from sleep in
> >> very quick intervals, the firmware eventually crashes. That crash
> >> appears to be fixed by some other non-posted write included here.
> >
> > Thanks for all this work!
> >
> > Nevertheless, do we have any commits that may be a good candidate to
> > be in the Fixes tag here?
> >
> 
> I don't think there's any commit we could point to, given that the bug is
> probably somewhere in the firmware code.

Then please add Cc: stable@vger.kernel.org tag into commit message. Such
bugfix is a good candidate for backporting into stable releases.

> >> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 15:51       ` Pali Rohár
@ 2021-09-01 16:51         ` Heiner Kallweit
  2021-09-01 17:07           ` Johannes Berg
  0 siblings, 1 reply; 32+ messages in thread
From: Heiner Kallweit @ 2021-09-01 16:51 UTC (permalink / raw)
  To: Pali Rohár, Jonas Dreßler
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On 01.09.2021 17:51, Pali Rohár wrote:
> On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
>> On 8/30/21 2:49 PM, Andy Shevchenko wrote:
>>> On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>>>
>>>> On the 88W8897 card it's very important the TX ring write pointer is
>>>> updated correctly to its new value before setting the TX ready
>>>> interrupt, otherwise the firmware appears to crash (probably because
>>>> it's trying to DMA-read from the wrong place).
>>>>

This sounds somehow like the typical case where you write DMA descriptors
and then ring the doorbell. This normally requires a dma_wmb().
Maybe something like that is missing here?

Reading back all register writes may cause a certain performance impact,
and if it can be avoided we should try to avoid it.

>>>> Since PCI uses "posted writes" when writing to a register, it's not
>>>> guaranteed that a write will happen immediately. That means the pointer
>>>> might be outdated when setting the TX ready interrupt, leading to
>>>> firmware crashes especially when ASPM L1 and L1 substates are enabled
>>>> (because of the higher link latency, the write will probably take
>>>> longer).
>>>>
>>>> So fix those firmware crashes by always forcing non-posted writes. We do
>>>> that by simply reading back the register after writing it, just as a lot
>>>> of other drivers do.
>>>>
>>>> There are two reproducers that are fixed with this patch:
>>>>
>>>> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
>>>> substates are platform dependent), the firmware crashes and eventually a
>>>> command timeout appears in the logs. That crash is fixed by using a
>>>> non-posted write in mwifiex_pcie_send_data().
>>>>
>>>> 2) When sending lots of commands to the card, waking it up from sleep in
>>>> very quick intervals, the firmware eventually crashes. That crash
>>>> appears to be fixed by some other non-posted write included here.
>>>
>>> Thanks for all this work!
>>>
>>> Nevertheless, do we have any commits that may be a good candidate to
>>> be in the Fixes tag here?
>>>
>>
>> I don't think there's any commit we could point to, given that the bug is
>> probably somewhere in the firmware code.
> 
> Then please add Cc: stable@vger.kernel.org tag into commit message. Such
> bugfix is a good candidate for backporting into stable releases.
> 
>>>> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 16:51         ` Heiner Kallweit
@ 2021-09-01 17:07           ` Johannes Berg
  2021-09-01 19:07             ` Heiner Kallweit
  2021-09-01 22:41             ` Bjorn Helgaas
  0 siblings, 2 replies; 32+ messages in thread
From: Johannes Berg @ 2021-09-01 17:07 UTC (permalink / raw)
  To: Heiner Kallweit, Pali Rohár, Jonas Dreßler
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On Wed, 2021-09-01 at 18:51 +0200, Heiner Kallweit wrote:
> On 01.09.2021 17:51, Pali Rohár wrote:
> > On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
> > > On 8/30/21 2:49 PM, Andy Shevchenko wrote:
> > > > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > > 
> > > > > On the 88W8897 card it's very important the TX ring write pointer is
> > > > > updated correctly to its new value before setting the TX ready
> > > > > interrupt, otherwise the firmware appears to crash (probably because
> > > > > it's trying to DMA-read from the wrong place).
> > > > > 
> 
> This sounds somehow like the typical case where you write DMA descriptors
> and then ring the doorbell. This normally requires a dma_wmb().
> Maybe something like that is missing here?

But it looks like this "TX ring write pointer" is actually the register?

However, I would agree that doing it in mwifiex_write_reg() is possibly
too big a hammer - could be done only for reg->tx_wrptr, not all the
registers?

Actually, can two writes actually cross on PCI?

johannes


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 17:07           ` Johannes Berg
@ 2021-09-01 19:07             ` Heiner Kallweit
  2021-09-01 22:41             ` Bjorn Helgaas
  1 sibling, 0 replies; 32+ messages in thread
From: Heiner Kallweit @ 2021-09-01 19:07 UTC (permalink / raw)
  To: Johannes Berg, Pali Rohár, Jonas Dreßler
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On 01.09.2021 19:07, Johannes Berg wrote:
> On Wed, 2021-09-01 at 18:51 +0200, Heiner Kallweit wrote:
>> On 01.09.2021 17:51, Pali Rohár wrote:
>>> On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
>>>> On 8/30/21 2:49 PM, Andy Shevchenko wrote:
>>>>> On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>>>>>
>>>>>> On the 88W8897 card it's very important the TX ring write pointer is
>>>>>> updated correctly to its new value before setting the TX ready
>>>>>> interrupt, otherwise the firmware appears to crash (probably because
>>>>>> it's trying to DMA-read from the wrong place).
>>>>>>
>>
>> This sounds somehow like the typical case where you write DMA descriptors
>> and then ring the doorbell. This normally requires a dma_wmb().
>> Maybe something like that is missing here?
> 
> But it looks like this "TX ring write pointer" is actually the register?
> 
> However, I would agree that doing it in mwifiex_write_reg() is possibly
> too big a hammer - could be done only for reg->tx_wrptr, not all the
> registers?
> 
> Actually, can two writes actually cross on PCI?
> 
> johannes
> 

In case we're talking about the following piece of code both register
writes are IOMEM writes that are ordered. Maybe the writes arrive properly
ordered but some chip-internal delays cause the issue? Then the read-back
would be something like an ordinary udelay()?
Instead of always reading back register writes, is it sufficient to read
an arbitrary register after mwifiex_write_reg(adapter, reg->tx_wrptr ?


		/* Write the TX ring write pointer in to reg->tx_wrptr */
		if (mwifiex_write_reg(adapter, reg->tx_wrptr,
				      card->txbd_wrptr | rx_val)) {
			mwifiex_dbg(adapter, ERROR,
				    "SEND DATA: failed to write reg->tx_wrptr\n");
			ret = -1;
			goto done_unmap;
		}
		if ((mwifiex_pcie_txbd_not_full(card)) &&
		    tx_param->next_pkt_len) {
			/* have more packets and TxBD still can hold more */
			mwifiex_dbg(adapter, DATA,
				    "SEND DATA: delay dnld-rdy interrupt.\n");
			adapter->data_sent = false;
		} else {
			/* Send the TX ready interrupt */
			if (mwifiex_write_reg(adapter, PCIE_CPU_INT_EVENT,
					      CPU_INTR_DNLD_RDY)) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
  2021-08-30 12:49   ` Andy Shevchenko
@ 2021-09-01 19:40   ` Brian Norris
  2021-09-01 20:40     ` Andy Shevchenko
  1 sibling, 1 reply; 32+ messages in thread
From: Brian Norris @ 2021-09-01 19:40 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto, linux-wireless,
	<netdev@vger.kernel.org>,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> On the 88W8897 card it's very important the TX ring write pointer is
> updated correctly to its new value before setting the TX ready
> interrupt, otherwise the firmware appears to crash (probably because
> it's trying to DMA-read from the wrong place).
>
> Since PCI uses "posted writes" when writing to a register, it's not
> guaranteed that a write will happen immediately. That means the pointer
> might be outdated when setting the TX ready interrupt, leading to
> firmware crashes especially when ASPM L1 and L1 substates are enabled
> (because of the higher link latency, the write will probably take
> longer).
>
> So fix those firmware crashes by always forcing non-posted writes. We do
> that by simply reading back the register after writing it, just as a lot
> of other drivers do.
>
> There are two reproducers that are fixed with this patch:
>
> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
> substates are platform dependent), the firmware crashes and eventually a
> command timeout appears in the logs. That crash is fixed by using a
> non-posted write in mwifiex_pcie_send_data().
>
> 2) When sending lots of commands to the card, waking it up from sleep in
> very quick intervals, the firmware eventually crashes. That crash
> appears to be fixed by some other non-posted write included here.
>
> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>

This might be good for many cases, but please read this commit:

https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3
mwifiex: pcie: use posted write to wake up firmware

It's very much intentional that this is a posted write in some cases.

Without ensuring this doesn't regress, NAK from me.

Brian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 19:40   ` Brian Norris
@ 2021-09-01 20:40     ` Andy Shevchenko
  2021-09-01 21:04       ` Brian Norris
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Shevchenko @ 2021-09-01 20:40 UTC (permalink / raw)
  To: Brian Norris
  Cc: Jonas Dreßler, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, <netdev@vger.kernel.org>,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On Wed, Sep 1, 2021 at 11:25 PM Brian Norris <briannorris@chromium.org> wrote:
> On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:

...

> This might be good for many cases, but please read this commit:
>
> https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3
> mwifiex: pcie: use posted write to wake up firmware
>
> It's very much intentional that this is a posted write in some cases.
>
> Without ensuring this doesn't regress, NAK from me.

Can you ensure that from Chrome / Google perspective, please?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 20:40     ` Andy Shevchenko
@ 2021-09-01 21:04       ` Brian Norris
  2021-09-01 21:07         ` Brian Norris
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Norris @ 2021-09-01 21:04 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Jonas Dreßler, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, <netdev@vger.kernel.org>,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On Wed, Sep 1, 2021 at 1:41 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
>
> On Wed, Sep 1, 2021 at 11:25 PM Brian Norris <briannorris@chromium.org> wrote:
> > On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> ...
>
> > This might be good for many cases, but please read this commit:
> >
> > https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3
> > mwifiex: pcie: use posted write to wake up firmware
> >
> > It's very much intentional that this is a posted write in some cases.
> >
> > Without ensuring this doesn't regress, NAK from me.
>
> Can you ensure that from Chrome / Google perspective, please?

I mean, the context link should be pretty obvious. But sure: with this patch:

(a) all my least-favorite audio issues to debug return (audio is
especially latency-sensitive on the Rockchip RK3399 systems I have)
(b) latency spikes to ~6ms:
# trace-cmd record -p function_graph -O funcgraph-abstime -l
mwifiex_pm_wakeup_card
# trace-cmd report
   kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      #
6219.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      #
6267.625 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      #
6184.250 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      #
6553.166 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      #
6212.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      #
6230.292 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      #
6451.667 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      #
6204.042 us |  mwifiex_pm_wakeup_card();

whereas it used to look more like:

   kworker/u13:1-173   [005]   212.230542: funcgraph_entry:
7.000 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:
9.334 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      +
11.375 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:
5.542 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      +
13.125 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      +
11.959 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:
8.459 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      +
15.166 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      +
11.083 us  |  mwifiex_pm_wakeup_card();

I think it would be enough to only modify mwifiex_pm_wakeup_card() to
retain the posted version, but I'm not sure how that fits in with the
rest of the author's work. I suppose I can give a later version a run
if needed.

Brian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 21:04       ` Brian Norris
@ 2021-09-01 21:07         ` Brian Norris
  2021-09-18  7:37           ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Norris @ 2021-09-01 21:07 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Jonas Dreßler, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, <netdev@vger.kernel.org>,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

Apologies for the brain-dead mailer. I forget that I should only reply
via web when I _want_ text wrapping:

On Wed, Sep 01, 2021 at 02:04:04PM -0700, Brian Norris wrote:
> (b) latency spikes to ~6ms:
> # trace-cmd record -p function_graph -O funcgraph-abstime -l
> mwifiex_pm_wakeup_card
> # trace-cmd report
>    kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      #
> 6219.500 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      #
> 6267.625 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      #
> 6184.250 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      #
> 6553.166 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      #
> 6212.500 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      #
> 6230.292 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      #
> 6451.667 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      #
> 6204.042 us |  mwifiex_pm_wakeup_card();
> 
> whereas it used to look more like:
> 
>    kworker/u13:1-173   [005]   212.230542: funcgraph_entry:
> 7.000 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:
> 9.334 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      +
> 11.375 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:
> 5.542 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      +
> 13.125 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      +
> 11.959 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:
> 8.459 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      +
> 15.166 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      +
> 11.083 us  |  mwifiex_pm_wakeup_card();

That should read:

# trace-cmd record -p function_graph -O funcgraph-abstime -l mwifiex_pm_wakeup_card
# trace-cmd report
   kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      # 6219.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      # 6267.625 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      # 6184.250 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      # 6553.166 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      # 6212.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      # 6230.292 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      # 6451.667 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      # 6204.042 us |  mwifiex_pm_wakeup_card();

vs.

   kworker/u13:1-173   [005]   212.230542: funcgraph_entry:        7.000 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:        9.334 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      + 11.375 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:        5.542 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      + 13.125 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      + 11.959 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:        8.459 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      + 15.166 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      + 11.083 us  |  mwifiex_pm_wakeup_card();

Brian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 17:07           ` Johannes Berg
  2021-09-01 19:07             ` Heiner Kallweit
@ 2021-09-01 22:41             ` Bjorn Helgaas
  2021-09-02 14:05               ` Bjorn Helgaas
  1 sibling, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2021-09-01 22:41 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Heiner Kallweit, Pali Rohár, Jonas Dreßler,
	Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On Wed, Sep 01, 2021 at 07:07:58PM +0200, Johannes Berg wrote:
> On Wed, 2021-09-01 at 18:51 +0200, Heiner Kallweit wrote:
> > On 01.09.2021 17:51, Pali Rohár wrote:
> > > On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
> > > > On 8/30/21 2:49 PM, Andy Shevchenko wrote:
> > > > > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > > > 
> > > > > > On the 88W8897 card it's very important the TX ring write pointer is
> > > > > > updated correctly to its new value before setting the TX ready
> > > > > > interrupt, otherwise the firmware appears to crash (probably because
> > > > > > it's trying to DMA-read from the wrong place).
> > > > > > 
> > 
> > This sounds somehow like the typical case where you write DMA descriptors
> > and then ring the doorbell. This normally requires a dma_wmb().
> > Maybe something like that is missing here?
> 
> But it looks like this "TX ring write pointer" is actually the register?
> 
> However, I would agree that doing it in mwifiex_write_reg() is possibly
> too big a hammer - could be done only for reg->tx_wrptr, not all the
> registers?
> 
> Actually, can two writes actually cross on PCI?

Per PCIe r5.0, sec 2.4.1,

  A2a  A Posted Request must not pass another Posted Request unless A2b
       applies.

  A2b  A Posted Request with RO Set is permitted to pass another
       Posted Request.  A Posted Request with IDO Set is permitted to
       pass another Posted Request if the two Requester IDs are
       different or if both Requests contain a PASID TLP Prefix and
       the two PASID values are different.

A few drivers enable RO (Relaxed Ordering) for their devices, which
means the *device* is permitted to set the RO bit in transactions it
initiates.

BUt IIUC we're talking about MMIO writes initiated by a CPU, and they
won't have the RO bit set unless the Root Port has Relaxed Ordering
enabled, and Linux generally does not enable that.  So A2a should
apply, and writes should be ordered on PCI.

There are a few wrinkles that I worry about:

  d1e714db8129 ("mtip32xx: Fix ERO and NoSnoop values in PCIe upstream
  on AMD systems") [1] turns off RO for some AMD Root Ports, which
  makes me think BIOS might be enabling RO in these Root Ports.

  c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5
  Completion erratum") [2] turns off RO for all Root Ports leading to
  Chelsio T5 devices, which again makes me think there's firmware that
  enables RO in Root Ports.  Follow-up [3].

  77ffc1465cec ("tegra: add PCI Express support") [4] (see
  tegra_pcie_relax_enable()) enables RO for Tegra Root Ports due to
  some hardware issue.  I don't whether these Root Ports every
  actually *set* RO in the PCIe transactions they generate.  Follow-up
  [5].

These concern me because I don't think we have a way for drivers to
specify whether their writes should use strong ordering or relaxed
ordering, and I think they depend on strong ordering.  If Root Ports
have RO enabled, I think we are at risk, so I suspect Linux should
actively *disable* RO for Root Ports.

[1] https://git.kernel.org/linus/d1e714db8129
[2] https://git.kernel.org/linus/c56d4450eb68
[3] https://lore.kernel.org/r/20210901222353.GA251391@bjorn-Precision-5520
[4] https://git.kernel.org/linus/77ffc1465cec
[5] https://lore.kernel.org/r/20210901204045.GA236987@bjorn-Precision-5520

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 22:41             ` Bjorn Helgaas
@ 2021-09-02 14:05               ` Bjorn Helgaas
  0 siblings, 0 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2021-09-02 14:05 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Heiner Kallweit, Pali Rohár, Jonas Dreßler,
	Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	open list:TI WILINK WIRELES...,
	netdev, Linux Kernel Mailing List, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Keith Busch

[+cc Keith]

On Wed, Sep 01, 2021 at 05:41:56PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 01, 2021 at 07:07:58PM +0200, Johannes Berg wrote:
> > On Wed, 2021-09-01 at 18:51 +0200, Heiner Kallweit wrote:
> > > On 01.09.2021 17:51, Pali Rohár wrote:
> > > > On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:
> > > > > On 8/30/21 2:49 PM, Andy Shevchenko wrote:
> > > > > > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > > > > 
> > > > > > > On the 88W8897 card it's very important the TX ring write pointer is
> > > > > > > updated correctly to its new value before setting the TX ready
> > > > > > > interrupt, otherwise the firmware appears to crash (probably because
> > > > > > > it's trying to DMA-read from the wrong place).
> > > > > > > 
> > > 
> > > This sounds somehow like the typical case where you write DMA descriptors
> > > and then ring the doorbell. This normally requires a dma_wmb().
> > > Maybe something like that is missing here?
> > 
> > But it looks like this "TX ring write pointer" is actually the register?
> > 
> > However, I would agree that doing it in mwifiex_write_reg() is possibly
> > too big a hammer - could be done only for reg->tx_wrptr, not all the
> > registers?
> > 
> > Actually, can two writes actually cross on PCI?
> 
> Per PCIe r5.0, sec 2.4.1,
> 
>   A2a  A Posted Request must not pass another Posted Request unless A2b
>        applies.
> 
>   A2b  A Posted Request with RO Set is permitted to pass another
>        Posted Request.  A Posted Request with IDO Set is permitted to
>        pass another Posted Request if the two Requester IDs are
>        different or if both Requests contain a PASID TLP Prefix and
>        the two PASID values are different.
> 
> A few drivers enable RO (Relaxed Ordering) for their devices, which
> means the *device* is permitted to set the RO bit in transactions it
> initiates.
> 
> BUt IIUC we're talking about MMIO writes initiated by a CPU, and they
> won't have the RO bit set unless the Root Port has Relaxed Ordering
> enabled, and Linux generally does not enable that.  So A2a should
> apply, and writes should be ordered on PCI.
> 
> There are a few wrinkles that I worry about:
> 
>   d1e714db8129 ("mtip32xx: Fix ERO and NoSnoop values in PCIe upstream
>   on AMD systems") [1] turns off RO for some AMD Root Ports, which
>   makes me think BIOS might be enabling RO in these Root Ports.
> 
>   c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5
>   Completion erratum") [2] turns off RO for all Root Ports leading to
>   Chelsio T5 devices, which again makes me think there's firmware that
>   enables RO in Root Ports.  Follow-up [3].
> 
>   77ffc1465cec ("tegra: add PCI Express support") [4] (see
>   tegra_pcie_relax_enable()) enables RO for Tegra Root Ports due to
>   some hardware issue.  I don't whether these Root Ports every
>   actually *set* RO in the PCIe transactions they generate.  Follow-up
>   [5].
> 
> These concern me because I don't think we have a way for drivers to
> specify whether their writes should use strong ordering or relaxed
> ordering, and I think they depend on strong ordering.

Sorry, I was completely wrong about this part.  Keith Busch reminded
me that writel_relaxed() is an obvious way for drivers to specify that
they don't need strong ordering.

I apologize for the distraction.

> If Root Ports have RO enabled, I think we are at risk, so I suspect
> Linux should actively *disable* RO for Root Ports.
> 
> [1] https://git.kernel.org/linus/d1e714db8129
> [2] https://git.kernel.org/linus/c56d4450eb68
> [3] https://lore.kernel.org/r/20210901222353.GA251391@bjorn-Precision-5520
> [4] https://git.kernel.org/linus/77ffc1465cec
> [5] https://lore.kernel.org/r/20210901204045.GA236987@bjorn-Precision-5520

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-01 21:07         ` Brian Norris
@ 2021-09-18  7:37           ` Jonas Dreßler
  2021-09-20 17:48             ` Brian Norris
  0 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-18  7:37 UTC (permalink / raw)
  To: Brian Norris, Andy Shevchenko
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto, linux-wireless,
	netdev, Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On 9/1/21 11:07 PM, Brian Norris wrote:
> Apologies for the brain-dead mailer. I forget that I should only reply
> via web when I _want_ text wrapping:
> 
> On Wed, Sep 01, 2021 at 02:04:04PM -0700, Brian Norris wrote:
>> (b) latency spikes to ~6ms:
>> # trace-cmd record -p function_graph -O funcgraph-abstime -l
>> mwifiex_pm_wakeup_card
>> # trace-cmd report
>>     kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      #
>> 6219.500 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      #
>> 6267.625 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      #
>> 6184.250 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      #
>> 6553.166 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      #
>> 6212.500 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      #
>> 6230.292 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      #
>> 6451.667 us |  mwifiex_pm_wakeup_card();
>>     kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      #
>> 6204.042 us |  mwifiex_pm_wakeup_card();
>>
>> whereas it used to look more like:
>>
>>     kworker/u13:1-173   [005]   212.230542: funcgraph_entry:
>> 7.000 us   |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:
>> 9.334 us   |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      +
>> 11.375 us  |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:
>> 5.542 us   |  mwifiex_pm_wakeup_card();
>>     kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      +
>> 13.125 us  |  mwifiex_pm_wakeup_card();
>>     kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      +
>> 11.959 us  |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:
>> 8.459 us   |  mwifiex_pm_wakeup_card();
>>     kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      +
>> 15.166 us  |  mwifiex_pm_wakeup_card();
>>     kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      +
>> 11.083 us  |  mwifiex_pm_wakeup_card();
> 
> That should read:
> 
> # trace-cmd record -p function_graph -O funcgraph-abstime -l mwifiex_pm_wakeup_card
> # trace-cmd report
>     kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      # 6219.500 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      # 6267.625 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      # 6184.250 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      # 6553.166 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      # 6212.500 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      # 6230.292 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      # 6451.667 us |  mwifiex_pm_wakeup_card();
>     kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      # 6204.042 us |  mwifiex_pm_wakeup_card();
> 
> vs.
> 
>     kworker/u13:1-173   [005]   212.230542: funcgraph_entry:        7.000 us   |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:        9.334 us   |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      + 11.375 us  |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:        5.542 us   |  mwifiex_pm_wakeup_card();
>     kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      + 13.125 us  |  mwifiex_pm_wakeup_card();
>     kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      + 11.959 us  |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:        8.459 us   |  mwifiex_pm_wakeup_card();
>     kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      + 15.166 us  |  mwifiex_pm_wakeup_card();
>     kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      + 11.083 us  |  mwifiex_pm_wakeup_card();
> 
> Brian
> 

Thanks for the pointer to that commit Brian, it turns out this is 
actually the change that causes the "Firmware wakeup failed" issues that 
I'm trying to fix with the second patch here.

Also my approach is a lot messier than just reverting 
062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking 
even longer...

Does anyone have an idea what could be the reason for the posted write 
not going through, or could that also be a potential firmware bug in the 
chip?

Jonas

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-18  7:37           ` Jonas Dreßler
@ 2021-09-20 17:48             ` Brian Norris
  2021-09-22 12:50               ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Norris @ 2021-09-20 17:48 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, <netdev@vger.kernel.org>,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas, Pali Rohár

On Sat, Sep 18, 2021 at 12:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
> Thanks for the pointer to that commit Brian, it turns out this is
> actually the change that causes the "Firmware wakeup failed" issues that
> I'm trying to fix with the second patch here.

Huh. That's interesting, although I guess it makes some sense given
your theory of "dropped writes". FWIW, this strategy (post a single
write, then wait for wakeup) is the same used by some other
chips/drivers too (e.g., ath10k/pci), although in those cases card
wakeup is much much faster. But if the bus was dropping writes
somehow, those strategies would fail too.

> Also my approach is a lot messier than just reverting
> 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking
> even longer...

For the record, in case you're talking about my data ("blocking even
longer"): I was only testing patch 1. Patch 2 isn't really relevant to
my particular systems (Rockchip RK3399 + Marvell 8997/PCIe), because
(a) I'm pretty sure my system isn't "dropping" any reads or writes
(b) all my delay is in the read-back; the Rockchip PCIe bus is waiting
indefinitely for the card to wake up, instead of timing out and
reporting all-1's like many x86 systems appear to do (I've tested
this).

So, the 6ms delay is entirely sitting in the ioread32(), not a delay loop.

I haven't yet tried your version 2 (which avoids the blocking read to
wake up; good!), but it sounds like in theory it could solve your
problem while avoiding 6ms delays for me. I intend to test your v2
this week.

> Does anyone have an idea what could be the reason for the posted write
> not going through, or could that also be a potential firmware bug in the
> chip?

I have no clue about that. That does sound downright horrible, but so
are many things when dealing with this family of hardware/firmware.
I'm not sure how to prove out whether this is a host bus problem, or
an endpoint/firmware problem, other than perhaps trying the same
module/firmware on another system, if that's possible.

Anyway, to reiterate: I'm not fundamentally opposed to v2 (pending a
test run here), even if it is a bit ugly and perhaps not 100%
understood.

Brian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-20 17:48             ` Brian Norris
@ 2021-09-22 12:50               ` Jonas Dreßler
  2021-09-23 15:28                 ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-22 12:50 UTC (permalink / raw)
  To: Brian Norris
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, netdev, Linux Kernel, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On 9/20/21 7:48 PM, Brian Norris wrote:
> On Sat, Sep 18, 2021 at 12:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>> Thanks for the pointer to that commit Brian, it turns out this is
>> actually the change that causes the "Firmware wakeup failed" issues that
>> I'm trying to fix with the second patch here.
> 
> Huh. That's interesting, although I guess it makes some sense given
> your theory of "dropped writes". FWIW, this strategy (post a single
> write, then wait for wakeup) is the same used by some other
> chips/drivers too (e.g., ath10k/pci), although in those cases card
> wakeup is much much faster. But if the bus was dropping writes
> somehow, those strategies would fail too.
> 
>> Also my approach is a lot messier than just reverting
>> 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking
>> even longer...
> 
> For the record, in case you're talking about my data ("blocking even
> longer"): I was only testing patch 1. Patch 2 isn't really relevant to
> my particular systems (Rockchip RK3399 + Marvell 8997/PCIe), because
> (a) I'm pretty sure my system isn't "dropping" any reads or writes
> (b) all my delay is in the read-back; the Rockchip PCIe bus is waiting
> indefinitely for the card to wake up, instead of timing out and
> reporting all-1's like many x86 systems appear to do (I've tested
> this).
> 
> So, the 6ms delay is entirely sitting in the ioread32(), not a delay loop.
> 
> I haven't yet tried your version 2 (which avoids the blocking read to
> wake up; good!), but it sounds like in theory it could solve your
> problem while avoiding 6ms delays for me. I intend to test your v2
> this week.
> 

With "blocking even longer" I meant that (on my system) the delay-loop 
blocks even longer than waking up the card via mwifiex_read_reg() (both 
are in the orders of milliseconds). And given that in certain cases the 
card wakeup (or a write getting through to the card, I have no idea) can 
take extremely long, I'd feel more confident going with the 
mwifiex_read_reg() method to wake up the card.

Anyway, you know what's even weirder with all this: I've been testing 
the first commit of patch v2 (so just the single read-back instead of 
the big hammer) together with 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 
reverted for a good week now and haven't seen any wakeup failure yet. 
Otoh I'm fairly sure the big hammer with reading back every write wasn't 
enough to fix the wakeup failures, otherwise I wouldn't even have 
started working on the second commit.

So that would mean there's a difference between writing and then reading 
back vs only reading to wake up the card: Only the latter fixes the 
wakeup failures.

>> Does anyone have an idea what could be the reason for the posted write
>> not going through, or could that also be a potential firmware bug in the
>> chip?
> 
> I have no clue about that. That does sound downright horrible, but so
> are many things when dealing with this family of hardware/firmware.
> I'm not sure how to prove out whether this is a host bus problem, or
> an endpoint/firmware problem, other than perhaps trying the same
> module/firmware on another system, if that's possible.
> 
> Anyway, to reiterate: I'm not fundamentally opposed to v2 (pending a
> test run here), even if it is a bit ugly and perhaps not 100%
> understood.
> 

I'm not 100% sure about all this yet, I think I'm gonna try to confirm 
my older findings once again now and then we'll see. FTR, would you be 
fine with using the mwifiex_read_reg() method to wake up the card and 
somehow quirking your system to use write_reg()?

> Brian
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-22 12:50               ` Jonas Dreßler
@ 2021-09-23 15:28                 ` Jonas Dreßler
  2021-09-23 19:37                   ` Andy Shevchenko
  2021-09-23 19:41                   ` Andy Shevchenko
  0 siblings, 2 replies; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-23 15:28 UTC (permalink / raw)
  To: Brian Norris
  Cc: Andy Shevchenko, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, netdev, Linux Kernel, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> On 9/20/21 7:48 PM, Brian Norris wrote:
>> On Sat, Sep 18, 2021 at 12:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>> Thanks for the pointer to that commit Brian, it turns out this is
>>> actually the change that causes the "Firmware wakeup failed" issues that
>>> I'm trying to fix with the second patch here.
>>
>> Huh. That's interesting, although I guess it makes some sense given
>> your theory of "dropped writes". FWIW, this strategy (post a single
>> write, then wait for wakeup) is the same used by some other
>> chips/drivers too (e.g., ath10k/pci), although in those cases card
>> wakeup is much much faster. But if the bus was dropping writes
>> somehow, those strategies would fail too.
>>
>>> Also my approach is a lot messier than just reverting
>>> 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking
>>> even longer...
>>
>> For the record, in case you're talking about my data ("blocking even
>> longer"): I was only testing patch 1. Patch 2 isn't really relevant to
>> my particular systems (Rockchip RK3399 + Marvell 8997/PCIe), because
>> (a) I'm pretty sure my system isn't "dropping" any reads or writes
>> (b) all my delay is in the read-back; the Rockchip PCIe bus is waiting
>> indefinitely for the card to wake up, instead of timing out and
>> reporting all-1's like many x86 systems appear to do (I've tested
>> this).
>>
>> So, the 6ms delay is entirely sitting in the ioread32(), not a delay 
>> loop.
>>
>> I haven't yet tried your version 2 (which avoids the blocking read to
>> wake up; good!), but it sounds like in theory it could solve your
>> problem while avoiding 6ms delays for me. I intend to test your v2
>> this week.
>>
> 
> With "blocking even longer" I meant that (on my system) the delay-loop 
> blocks even longer than waking up the card via mwifiex_read_reg() (both 
> are in the orders of milliseconds). And given that in certain cases the 
> card wakeup (or a write getting through to the card, I have no idea) can 
> take extremely long, I'd feel more confident going with the 
> mwifiex_read_reg() method to wake up the card.
> 
> Anyway, you know what's even weirder with all this: I've been testing 
> the first commit of patch v2 (so just the single read-back instead of 
> the big hammer) together with 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 
> reverted for a good week now and haven't seen any wakeup failure yet. 
> Otoh I'm fairly sure the big hammer with reading back every write wasn't 
> enough to fix the wakeup failures, otherwise I wouldn't even have 
> started working on the second commit.
> 
> So that would mean there's a difference between writing and then reading 
> back vs only reading to wake up the card: Only the latter fixes the 
> wakeup failures.
> 
>>> Does anyone have an idea what could be the reason for the posted write
>>> not going through, or could that also be a potential firmware bug in the
>>> chip?
>>
>> I have no clue about that. That does sound downright horrible, but so
>> are many things when dealing with this family of hardware/firmware.
>> I'm not sure how to prove out whether this is a host bus problem, or
>> an endpoint/firmware problem, other than perhaps trying the same
>> module/firmware on another system, if that's possible.
>>
>> Anyway, to reiterate: I'm not fundamentally opposed to v2 (pending a
>> test run here), even if it is a bit ugly and perhaps not 100%
>> understood.
>>
> 
> I'm not 100% sure about all this yet, I think I'm gonna try to confirm 
> my older findings once again now and then we'll see. FTR, would you be 
> fine with using the mwifiex_read_reg() method to wake up the card and 
> somehow quirking your system to use write_reg()?
> 
>> Brian
>>
> 

Okay, so I finally managed to find my exact reproducer for the bug again:

1) Make sure wifi powersaving is enabled (iw dev wlp1s0 set power_save on)
2) Connect to any wifi network (makes firmware go into wifi powersaving 
mode, not deep sleep)
3) Make sure bluetooth is turned off (to ensure the firmware actually 
enters powersave mode and doesn't keep the radio active doing bluetooth 
stuff)
4) To confirm that wifi powersaving is entered ping a device on the LAN, 
pings should be a few ms higher than without powersaving
5) Run "while true; do iwconfig; sleep 0.0001; done", this wakes and 
suspends the firmware extremely often
6) Wait until things explode, for me it consistently takes <5 minutes

Using this reproducer I was able to clear things up a bit:

- There still are wakeup failures when using (only) mwifiex_read_reg() 
to wake the card, so there's no weird difference between waking up using 
read vs write+read-back

- Just calling mwifiex_write_reg() once and then blocking until the card 
wakes up using my delay-loop doesn't fix the issue, it's actually 
writing multiple times that fixes the issue

These observations sound a lot like writes (and even reads) are actually 
being dropped, don't they?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-23 15:28                 ` Jonas Dreßler
@ 2021-09-23 19:37                   ` Andy Shevchenko
  2021-09-23 19:41                   ` Andy Shevchenko
  1 sibling, 0 replies; 32+ messages in thread
From: Andy Shevchenko @ 2021-09-23 19:37 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Brian Norris, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, netdev, Linux Kernel, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> > On 9/20/21 7:48 PM, Brian Norris wrote:
> >> On Sat, Sep 18, 2021 at 12:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
> >>> Thanks for the pointer to that commit Brian, it turns out this is
> >>> actually the change that causes the "Firmware wakeup failed" issues that
> >>> I'm trying to fix with the second patch here.
> >>
> >> Huh. That's interesting, although I guess it makes some sense given
> >> your theory of "dropped writes". FWIW, this strategy (post a single
> >> write, then wait for wakeup) is the same used by some other
> >> chips/drivers too (e.g., ath10k/pci), although in those cases card
> >> wakeup is much much faster. But if the bus was dropping writes
> >> somehow, those strategies would fail too.
> >>
> >>> Also my approach is a lot messier than just reverting
> >>> 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking
> >>> even longer...
> >>
> >> For the record, in case you're talking about my data ("blocking even
> >> longer"): I was only testing patch 1. Patch 2 isn't really relevant to
> >> my particular systems (Rockchip RK3399 + Marvell 8997/PCIe), because
> >> (a) I'm pretty sure my system isn't "dropping" any reads or writes
> >> (b) all my delay is in the read-back; the Rockchip PCIe bus is waiting
> >> indefinitely for the card to wake up, instead of timing out and
> >> reporting all-1's like many x86 systems appear to do (I've tested
> >> this).
> >>
> >> So, the 6ms delay is entirely sitting in the ioread32(), not a delay
> >> loop.
> >>
> >> I haven't yet tried your version 2 (which avoids the blocking read to
> >> wake up; good!), but it sounds like in theory it could solve your
> >> problem while avoiding 6ms delays for me. I intend to test your v2
> >> this week.
> >>
> >
> > With "blocking even longer" I meant that (on my system) the delay-loop
> > blocks even longer than waking up the card via mwifiex_read_reg() (both
> > are in the orders of milliseconds). And given that in certain cases the
> > card wakeup (or a write getting through to the card, I have no idea) can
> > take extremely long, I'd feel more confident going with the
> > mwifiex_read_reg() method to wake up the card.
> >
> > Anyway, you know what's even weirder with all this: I've been testing
> > the first commit of patch v2 (so just the single read-back instead of
> > the big hammer) together with 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3
> > reverted for a good week now and haven't seen any wakeup failure yet.
> > Otoh I'm fairly sure the big hammer with reading back every write wasn't
> > enough to fix the wakeup failures, otherwise I wouldn't even have
> > started working on the second commit.
> >
> > So that would mean there's a difference between writing and then reading
> > back vs only reading to wake up the card: Only the latter fixes the
> > wakeup failures.
> >
> >>> Does anyone have an idea what could be the reason for the posted write
> >>> not going through, or could that also be a potential firmware bug in the
> >>> chip?
> >>
> >> I have no clue about that. That does sound downright horrible, but so
> >> are many things when dealing with this family of hardware/firmware.
> >> I'm not sure how to prove out whether this is a host bus problem, or
> >> an endpoint/firmware problem, other than perhaps trying the same
> >> module/firmware on another system, if that's possible.
> >>
> >> Anyway, to reiterate: I'm not fundamentally opposed to v2 (pending a
> >> test run here), even if it is a bit ugly and perhaps not 100%
> >> understood.
> >>
> >
> > I'm not 100% sure about all this yet, I think I'm gonna try to confirm
> > my older findings once again now and then we'll see. FTR, would you be
> > fine with using the mwifiex_read_reg() method to wake up the card and
> > somehow quirking your system to use write_reg()?
> >
> >> Brian
> >>
> >
>
> Okay, so I finally managed to find my exact reproducer for the bug again:
>
> 1) Make sure wifi powersaving is enabled (iw dev wlp1s0 set power_save on)
> 2) Connect to any wifi network (makes firmware go into wifi powersaving
> mode, not deep sleep)
> 3) Make sure bluetooth is turned off (to ensure the firmware actually
> enters powersave mode and doesn't keep the radio active doing bluetooth
> stuff)
> 4) To confirm that wifi powersaving is entered ping a device on the LAN,
> pings should be a few ms higher than without powersaving
> 5) Run "while true; do iwconfig; sleep 0.0001; done", this wakes and
> suspends the firmware extremely often
> 6) Wait until things explode, for me it consistently takes <5 minutes
>
> Using this reproducer I was able to clear things up a bit:
>
> - There still are wakeup failures when using (only) mwifiex_read_reg()
> to wake the card, so there's no weird difference between waking up using
> read vs write+read-back
>
> - Just calling mwifiex_write_reg() once and then blocking until the card
> wakes up using my delay-loop doesn't fix the issue, it's actually
> writing multiple times that fixes the issue
>
> These observations sound a lot like writes (and even reads) are actually
> being dropped, don't they?



-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-23 15:28                 ` Jonas Dreßler
  2021-09-23 19:37                   ` Andy Shevchenko
@ 2021-09-23 19:41                   ` Andy Shevchenko
  2021-09-23 20:22                     ` Pali Rohár
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Shevchenko @ 2021-09-23 19:41 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Brian Norris, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, netdev, Linux Kernel, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas, Pali Rohár

On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> On 9/22/21 2:50 PM, Jonas Dreßler wrote:

...

> - Just calling mwifiex_write_reg() once and then blocking until the card
> wakes up using my delay-loop doesn't fix the issue, it's actually
> writing multiple times that fixes the issue
>
> These observations sound a lot like writes (and even reads) are actually
> being dropped, don't they?

It sounds like you're writing into a not ready (fully powered on) device.

To check this, try to put a busy loop for reading and check the value
till it gets 0.

Something like

  unsigned int count = 1000;

  do {
    if (mwifiex_read_reg(...) == 0)
      break;
  } while (--count);


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-23 19:41                   ` Andy Shevchenko
@ 2021-09-23 20:22                     ` Pali Rohár
  2021-09-30 15:38                       ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Pali Rohár @ 2021-09-23 20:22 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Jonas Dreßler, Brian Norris, Amitkumar Karwar,
	Ganapathi Bhat, Xinming Hu, Kalle Valo, David S. Miller,
	Jakub Kicinski, Tsuchiya Yuto, linux-wireless, netdev,
	Linux Kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas

On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
> On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> 
> ...
> 
> > - Just calling mwifiex_write_reg() once and then blocking until the card
> > wakes up using my delay-loop doesn't fix the issue, it's actually
> > writing multiple times that fixes the issue
> >
> > These observations sound a lot like writes (and even reads) are actually
> > being dropped, don't they?
> 
> It sounds like you're writing into a not ready (fully powered on) device.

This reminds me a discussion with Bjorn about CRS response returned
after firmware crash / reset when device is not ready yet:
https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/

Could not be this similar issue? You could check it via reading
PCI_VENDOR_ID register from config space. And if it is not valid value
then card is not really ready yet.

> To check this, try to put a busy loop for reading and check the value
> till it gets 0.
> 
> Something like
> 
>   unsigned int count = 1000;
> 
>   do {
>     if (mwifiex_read_reg(...) == 0)
>       break;
>   } while (--count);
> 
> 
> -- 
> With Best Regards,
> Andy Shevchenko

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip
  2021-08-30 12:37 [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Jonas Dreßler
  2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
  2021-08-30 12:37 ` [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt Jonas Dreßler
@ 2021-09-25 17:32 ` Pali Rohár
  2 siblings, 0 replies; 32+ messages in thread
From: Pali Rohár @ 2021-09-25 17:32 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Amitkumar Karwar, Ganapathi Bhat, Xinming Hu, Kalle Valo,
	David S. Miller, Jakub Kicinski, Tsuchiya Yuto, linux-wireless,
	netdev, linux-kernel, linux-pci, Maximilian Luz, Andy Shevchenko,
	Bjorn Helgaas

FYI: There is a new NXP driver for these ex-Marvell wifi chips:
https://source.codeaurora.org/external/nxpwifi/mwifiex/tree/mxm_wifiex/wlan_src/mlinux?h=caf/master

And NXP has there also some modified mwifiex driver:
https://source.codeaurora.org/external/nxpwifi/mwifiex/tree/mwifiex_8997/mlinux?h=caf/master&id=070397228c46e0a6afce4f627b032d3818cc1356

Maybe somebody could be interested in looking into those NXP driver
modifications... Maybe NXP fixed some issues here.

Also look at another new bug report related to mwifiex firmware:
https://lore.kernel.org/linux-wireless/edeb34bc-7c85-7f1d-79e4-e3e21df86334@gk2.net/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-23 20:22                     ` Pali Rohár
@ 2021-09-30 15:38                       ` Jonas Dreßler
  2021-09-30 15:42                         ` Pali Rohár
  0 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-30 15:38 UTC (permalink / raw)
  To: Pali Rohár, Andy Shevchenko
  Cc: Brian Norris, Amitkumar Karwar, Ganapathi Bhat, Xinming Hu,
	Kalle Valo, David S. Miller, Jakub Kicinski, Tsuchiya Yuto,
	linux-wireless, netdev, Linux Kernel, linux-pci, Maximilian Luz,
	Andy Shevchenko, Bjorn Helgaas

On 9/23/21 10:22 PM, Pali Rohár wrote:
> On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
>> On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>> On 9/22/21 2:50 PM, Jonas Dreßler wrote:
>>
>> ...
>>
>>> - Just calling mwifiex_write_reg() once and then blocking until the card
>>> wakes up using my delay-loop doesn't fix the issue, it's actually
>>> writing multiple times that fixes the issue
>>>
>>> These observations sound a lot like writes (and even reads) are actually
>>> being dropped, don't they?
>>
>> It sounds like you're writing into a not ready (fully powered on) device.
> 
> This reminds me a discussion with Bjorn about CRS response returned
> after firmware crash / reset when device is not ready yet:
> https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
> 
> Could not be this similar issue? You could check it via reading
> PCI_VENDOR_ID register from config space. And if it is not valid value
> then card is not really ready yet.
> 
>> To check this, try to put a busy loop for reading and check the value
>> till it gets 0.
>>
>> Something like
>>
>>    unsigned int count = 1000;
>>
>>    do {
>>      if (mwifiex_read_reg(...) == 0)
>>        break;
>>    } while (--count);
>>
>>
>> -- 
>> With Best Regards,
>> Andy Shevchenko

I've tried both reading PCI_VENDOR_ID and the firmware status using a 
busy loop now, but sadly none of them worked. It looks like the card 
always replies with the correct values even though it sometimes won't 
wake up after that.

I do have one new observation though, although I've no clue what could 
be happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we 
can "predict" the wakeup failure because exactly one (usually around the 
20th) of those 1000 reads will fail. Maybe the firmware actually tries 
to wake up, encounters an error somewhere in its wakeup routines and 
then goes down a special failure code path. That code path keeps the 
cards CPU so busy that at some point a PCI_VENDOR_ID request times out?

Or well, maybe the card actually wakes up fine, but we don't receive the 
interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-30 15:38                       ` Jonas Dreßler
@ 2021-09-30 15:42                         ` Pali Rohár
  2021-09-30 16:14                           ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Pali Rohár @ 2021-09-30 15:42 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Andy Shevchenko, Brian Norris, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu, Kalle Valo, David S. Miller, Jakub Kicinski,
	Tsuchiya Yuto, linux-wireless, netdev, Linux Kernel, linux-pci,
	Maximilian Luz, Andy Shevchenko, Bjorn Helgaas

On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:
> On 9/23/21 10:22 PM, Pali Rohár wrote:
> > On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
> > > On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> > > 
> > > ...
> > > 
> > > > - Just calling mwifiex_write_reg() once and then blocking until the card
> > > > wakes up using my delay-loop doesn't fix the issue, it's actually
> > > > writing multiple times that fixes the issue
> > > > 
> > > > These observations sound a lot like writes (and even reads) are actually
> > > > being dropped, don't they?
> > > 
> > > It sounds like you're writing into a not ready (fully powered on) device.
> > 
> > This reminds me a discussion with Bjorn about CRS response returned
> > after firmware crash / reset when device is not ready yet:
> > https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
> > 
> > Could not be this similar issue? You could check it via reading
> > PCI_VENDOR_ID register from config space. And if it is not valid value
> > then card is not really ready yet.
> > 
> > > To check this, try to put a busy loop for reading and check the value
> > > till it gets 0.
> > > 
> > > Something like
> > > 
> > >    unsigned int count = 1000;
> > > 
> > >    do {
> > >      if (mwifiex_read_reg(...) == 0)
> > >        break;
> > >    } while (--count);
> > > 
> > > 
> > > -- 
> > > With Best Regards,
> > > Andy Shevchenko
> 
> I've tried both reading PCI_VENDOR_ID and the firmware status using a busy
> loop now, but sadly none of them worked. It looks like the card always
> replies with the correct values even though it sometimes won't wake up after
> that.
> 
> I do have one new observation though, although I've no clue what could be
> happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can
> "predict" the wakeup failure because exactly one (usually around the 20th)
> of those 1000 reads will fail.

What does "fail" means here?

> Maybe the firmware actually tries to wake up,
> encounters an error somewhere in its wakeup routines and then goes down a
> special failure code path. That code path keeps the cards CPU so busy that
> at some point a PCI_VENDOR_ID request times out?
> 
> Or well, maybe the card actually wakes up fine, but we don't receive the
> interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-30 15:42                         ` Pali Rohár
@ 2021-09-30 16:14                           ` Jonas Dreßler
  2021-09-30 16:19                             ` Pali Rohár
  0 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-30 16:14 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Andy Shevchenko, Brian Norris, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu, Kalle Valo, David S. Miller, Jakub Kicinski,
	Tsuchiya Yuto, linux-wireless, netdev, Linux Kernel, linux-pci,
	Maximilian Luz, Andy Shevchenko, Bjorn Helgaas

On 9/30/21 5:42 PM, Pali Rohár wrote:
> On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:
>> On 9/23/21 10:22 PM, Pali Rohár wrote:
>>> On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
>>>> On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>>>> On 9/22/21 2:50 PM, Jonas Dreßler wrote:
>>>>
>>>> ...
>>>>
>>>>> - Just calling mwifiex_write_reg() once and then blocking until the card
>>>>> wakes up using my delay-loop doesn't fix the issue, it's actually
>>>>> writing multiple times that fixes the issue
>>>>>
>>>>> These observations sound a lot like writes (and even reads) are actually
>>>>> being dropped, don't they?
>>>>
>>>> It sounds like you're writing into a not ready (fully powered on) device.
>>>
>>> This reminds me a discussion with Bjorn about CRS response returned
>>> after firmware crash / reset when device is not ready yet:
>>> https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
>>>
>>> Could not be this similar issue? You could check it via reading
>>> PCI_VENDOR_ID register from config space. And if it is not valid value
>>> then card is not really ready yet.
>>>
>>>> To check this, try to put a busy loop for reading and check the value
>>>> till it gets 0.
>>>>
>>>> Something like
>>>>
>>>>     unsigned int count = 1000;
>>>>
>>>>     do {
>>>>       if (mwifiex_read_reg(...) == 0)
>>>>         break;
>>>>     } while (--count);
>>>>
>>>>
>>>> -- 
>>>> With Best Regards,
>>>> Andy Shevchenko
>>
>> I've tried both reading PCI_VENDOR_ID and the firmware status using a busy
>> loop now, but sadly none of them worked. It looks like the card always
>> replies with the correct values even though it sometimes won't wake up after
>> that.
>>
>> I do have one new observation though, although I've no clue what could be
>> happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can
>> "predict" the wakeup failure because exactly one (usually around the 20th)
>> of those 1000 reads will fail.
> 
> What does "fail" means here?

ioread32() returns all ones, that's interpreted as failure by 
mwifiex_read_reg().

> 
>> Maybe the firmware actually tries to wake up,
>> encounters an error somewhere in its wakeup routines and then goes down a
>> special failure code path. That code path keeps the cards CPU so busy that
>> at some point a PCI_VENDOR_ID request times out?
>>
>> Or well, maybe the card actually wakes up fine, but we don't receive the
>> interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-30 16:14                           ` Jonas Dreßler
@ 2021-09-30 16:19                             ` Pali Rohár
  2021-09-30 16:22                               ` Jonas Dreßler
  0 siblings, 1 reply; 32+ messages in thread
From: Pali Rohár @ 2021-09-30 16:19 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Andy Shevchenko, Brian Norris, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu, Kalle Valo, David S. Miller, Jakub Kicinski,
	Tsuchiya Yuto, linux-wireless, netdev, Linux Kernel, linux-pci,
	Maximilian Luz, Andy Shevchenko, Bjorn Helgaas

On Thursday 30 September 2021 18:14:04 Jonas Dreßler wrote:
> On 9/30/21 5:42 PM, Pali Rohár wrote:
> > On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:
> > > On 9/23/21 10:22 PM, Pali Rohár wrote:
> > > > On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
> > > > > On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > > > On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> > > > > 
> > > > > ...
> > > > > 
> > > > > > - Just calling mwifiex_write_reg() once and then blocking until the card
> > > > > > wakes up using my delay-loop doesn't fix the issue, it's actually
> > > > > > writing multiple times that fixes the issue
> > > > > > 
> > > > > > These observations sound a lot like writes (and even reads) are actually
> > > > > > being dropped, don't they?
> > > > > 
> > > > > It sounds like you're writing into a not ready (fully powered on) device.
> > > > 
> > > > This reminds me a discussion with Bjorn about CRS response returned
> > > > after firmware crash / reset when device is not ready yet:
> > > > https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
> > > > 
> > > > Could not be this similar issue? You could check it via reading
> > > > PCI_VENDOR_ID register from config space. And if it is not valid value
> > > > then card is not really ready yet.
> > > > 
> > > > > To check this, try to put a busy loop for reading and check the value
> > > > > till it gets 0.
> > > > > 
> > > > > Something like
> > > > > 
> > > > >     unsigned int count = 1000;
> > > > > 
> > > > >     do {
> > > > >       if (mwifiex_read_reg(...) == 0)
> > > > >         break;
> > > > >     } while (--count);
> > > > > 
> > > > > 
> > > > > -- 
> > > > > With Best Regards,
> > > > > Andy Shevchenko
> > > 
> > > I've tried both reading PCI_VENDOR_ID and the firmware status using a busy
> > > loop now, but sadly none of them worked. It looks like the card always
> > > replies with the correct values even though it sometimes won't wake up after
> > > that.
> > > 
> > > I do have one new observation though, although I've no clue what could be
> > > happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can
> > > "predict" the wakeup failure because exactly one (usually around the 20th)
> > > of those 1000 reads will fail.
> > 
> > What does "fail" means here?
> 
> ioread32() returns all ones, that's interpreted as failure by
> mwifiex_read_reg().

Ok. And can you check if PCI Bridge above this card has enabled CRSSVE
bit (CRSVisible in RootCtl+RootCap in lspci output)? To determinate if
Bridge could convert CRS response to all-ones as failed transaction.

> > 
> > > Maybe the firmware actually tries to wake up,
> > > encounters an error somewhere in its wakeup routines and then goes down a
> > > special failure code path. That code path keeps the cards CPU so busy that
> > > at some point a PCI_VENDOR_ID request times out?
> > > 
> > > Or well, maybe the card actually wakes up fine, but we don't receive the
> > > interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-30 16:19                             ` Pali Rohár
@ 2021-09-30 16:22                               ` Jonas Dreßler
  2021-09-30 16:39                                 ` Pali Rohár
  0 siblings, 1 reply; 32+ messages in thread
From: Jonas Dreßler @ 2021-09-30 16:22 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Andy Shevchenko, Brian Norris, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu, Kalle Valo, David S. Miller, Jakub Kicinski,
	Tsuchiya Yuto, linux-wireless, netdev, Linux Kernel, linux-pci,
	Maximilian Luz, Andy Shevchenko, Bjorn Helgaas

On 9/30/21 6:19 PM, Pali Rohár wrote:
> On Thursday 30 September 2021 18:14:04 Jonas Dreßler wrote:
>> On 9/30/21 5:42 PM, Pali Rohár wrote:
>>> On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:
>>>> On 9/23/21 10:22 PM, Pali Rohár wrote:
>>>>> On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
>>>>>> On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
>>>>>>> On 9/22/21 2:50 PM, Jonas Dreßler wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> - Just calling mwifiex_write_reg() once and then blocking until the card
>>>>>>> wakes up using my delay-loop doesn't fix the issue, it's actually
>>>>>>> writing multiple times that fixes the issue
>>>>>>>
>>>>>>> These observations sound a lot like writes (and even reads) are actually
>>>>>>> being dropped, don't they?
>>>>>>
>>>>>> It sounds like you're writing into a not ready (fully powered on) device.
>>>>>
>>>>> This reminds me a discussion with Bjorn about CRS response returned
>>>>> after firmware crash / reset when device is not ready yet:
>>>>> https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
>>>>>
>>>>> Could not be this similar issue? You could check it via reading
>>>>> PCI_VENDOR_ID register from config space. And if it is not valid value
>>>>> then card is not really ready yet.
>>>>>
>>>>>> To check this, try to put a busy loop for reading and check the value
>>>>>> till it gets 0.
>>>>>>
>>>>>> Something like
>>>>>>
>>>>>>      unsigned int count = 1000;
>>>>>>
>>>>>>      do {
>>>>>>        if (mwifiex_read_reg(...) == 0)
>>>>>>          break;
>>>>>>      } while (--count);
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> With Best Regards,
>>>>>> Andy Shevchenko
>>>>
>>>> I've tried both reading PCI_VENDOR_ID and the firmware status using a busy
>>>> loop now, but sadly none of them worked. It looks like the card always
>>>> replies with the correct values even though it sometimes won't wake up after
>>>> that.
>>>>
>>>> I do have one new observation though, although I've no clue what could be
>>>> happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can
>>>> "predict" the wakeup failure because exactly one (usually around the 20th)
>>>> of those 1000 reads will fail.
>>>
>>> What does "fail" means here?
>>
>> ioread32() returns all ones, that's interpreted as failure by
>> mwifiex_read_reg().
> 
> Ok. And can you check if PCI Bridge above this card has enabled CRSSVE
> bit (CRSVisible in RootCtl+RootCap in lspci output)? To determinate if
> Bridge could convert CRS response to all-ones as failed transaction.
> 

Seems like that bit is disabled:
 > RootCap: CRSVisible-
 > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-


>>>
>>>> Maybe the firmware actually tries to wake up,
>>>> encounters an error somewhere in its wakeup routines and then goes down a
>>>> special failure code path. That code path keeps the cards CPU so busy that
>>>> at some point a PCI_VENDOR_ID request times out?
>>>>
>>>> Or well, maybe the card actually wakes up fine, but we don't receive the
>>>> interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] mwifiex: Use non-posted PCI register writes
  2021-09-30 16:22                               ` Jonas Dreßler
@ 2021-09-30 16:39                                 ` Pali Rohár
  0 siblings, 0 replies; 32+ messages in thread
From: Pali Rohár @ 2021-09-30 16:39 UTC (permalink / raw)
  To: Jonas Dreßler
  Cc: Andy Shevchenko, Brian Norris, Amitkumar Karwar, Ganapathi Bhat,
	Xinming Hu, Kalle Valo, David S. Miller, Jakub Kicinski,
	Tsuchiya Yuto, linux-wireless, netdev, Linux Kernel, linux-pci,
	Maximilian Luz, Andy Shevchenko, Bjorn Helgaas

On Thursday 30 September 2021 18:22:42 Jonas Dreßler wrote:
> On 9/30/21 6:19 PM, Pali Rohár wrote:
> > On Thursday 30 September 2021 18:14:04 Jonas Dreßler wrote:
> > > On 9/30/21 5:42 PM, Pali Rohár wrote:
> > > > On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:
> > > > > On 9/23/21 10:22 PM, Pali Rohár wrote:
> > > > > > On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:
> > > > > > > On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:
> > > > > > > > On 9/22/21 2:50 PM, Jonas Dreßler wrote:
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > > - Just calling mwifiex_write_reg() once and then blocking until the card
> > > > > > > > wakes up using my delay-loop doesn't fix the issue, it's actually
> > > > > > > > writing multiple times that fixes the issue
> > > > > > > > 
> > > > > > > > These observations sound a lot like writes (and even reads) are actually
> > > > > > > > being dropped, don't they?
> > > > > > > 
> > > > > > > It sounds like you're writing into a not ready (fully powered on) device.
> > > > > > 
> > > > > > This reminds me a discussion with Bjorn about CRS response returned
> > > > > > after firmware crash / reset when device is not ready yet:
> > > > > > https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/
> > > > > > 
> > > > > > Could not be this similar issue? You could check it via reading
> > > > > > PCI_VENDOR_ID register from config space. And if it is not valid value
> > > > > > then card is not really ready yet.
> > > > > > 
> > > > > > > To check this, try to put a busy loop for reading and check the value
> > > > > > > till it gets 0.
> > > > > > > 
> > > > > > > Something like
> > > > > > > 
> > > > > > >      unsigned int count = 1000;
> > > > > > > 
> > > > > > >      do {
> > > > > > >        if (mwifiex_read_reg(...) == 0)
> > > > > > >          break;
> > > > > > >      } while (--count);
> > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > With Best Regards,
> > > > > > > Andy Shevchenko
> > > > > 
> > > > > I've tried both reading PCI_VENDOR_ID and the firmware status using a busy
> > > > > loop now, but sadly none of them worked. It looks like the card always
> > > > > replies with the correct values even though it sometimes won't wake up after
> > > > > that.
> > > > > 
> > > > > I do have one new observation though, although I've no clue what could be
> > > > > happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can
> > > > > "predict" the wakeup failure because exactly one (usually around the 20th)
> > > > > of those 1000 reads will fail.
> > > > 
> > > > What does "fail" means here?
> > > 
> > > ioread32() returns all ones, that's interpreted as failure by
> > > mwifiex_read_reg().
> > 
> > Ok. And can you check if PCI Bridge above this card has enabled CRSSVE
> > bit (CRSVisible in RootCtl+RootCap in lspci output)? To determinate if
> > Bridge could convert CRS response to all-ones as failed transaction.
> > 
> 
> Seems like that bit is disabled:
> > RootCap: CRSVisible-
> > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-

So it means that CRSSVE is unsupported by upper bridge. In case card
returns CRS response to system (via bridge) that is not ready yet,
bridge re-issue read request, and after some failures it returns to
system all-ones to indicate failed transaction. But all-ones can be
returned also by bridge when card does not send any response.

So from this test we do not know what happened. It would be nice to know
it, but such test requires to connect this card into system which
supports CRSSVE, in which case CRS response it passed directly to OS as
value 0xffff0001. Look at the link above where I discussed with Bjorn
about buggy wifi cards which resets internally, for more details.

But in this setup when CRSSVE is not supported, I think there is no
other option than just adding sleep prior accessing card...

For debugging such issues I got the only advice to use PCIe analyzer and
look at what is really happening on the bus. But required equipment for
this debugging is not cheap...

> > > > 
> > > > > Maybe the firmware actually tries to wake up,
> > > > > encounters an error somewhere in its wakeup routines and then goes down a
> > > > > special failure code path. That code path keeps the cards CPU so busy that
> > > > > at some point a PCI_VENDOR_ID request times out?
> > > > > 
> > > > > Or well, maybe the card actually wakes up fine, but we don't receive the
> > > > > interrupt on our end, so many possibilities...

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-09-30 16:39 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 12:37 [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Jonas Dreßler
2021-08-30 12:37 ` [PATCH 1/2] mwifiex: Use non-posted PCI register writes Jonas Dreßler
2021-08-30 12:49   ` Andy Shevchenko
2021-09-01 14:01     ` Jonas Dreßler
2021-09-01 15:47       ` Andy Shevchenko
2021-09-01 15:51       ` Pali Rohár
2021-09-01 16:51         ` Heiner Kallweit
2021-09-01 17:07           ` Johannes Berg
2021-09-01 19:07             ` Heiner Kallweit
2021-09-01 22:41             ` Bjorn Helgaas
2021-09-02 14:05               ` Bjorn Helgaas
2021-09-01 19:40   ` Brian Norris
2021-09-01 20:40     ` Andy Shevchenko
2021-09-01 21:04       ` Brian Norris
2021-09-01 21:07         ` Brian Norris
2021-09-18  7:37           ` Jonas Dreßler
2021-09-20 17:48             ` Brian Norris
2021-09-22 12:50               ` Jonas Dreßler
2021-09-23 15:28                 ` Jonas Dreßler
2021-09-23 19:37                   ` Andy Shevchenko
2021-09-23 19:41                   ` Andy Shevchenko
2021-09-23 20:22                     ` Pali Rohár
2021-09-30 15:38                       ` Jonas Dreßler
2021-09-30 15:42                         ` Pali Rohár
2021-09-30 16:14                           ` Jonas Dreßler
2021-09-30 16:19                             ` Pali Rohár
2021-09-30 16:22                               ` Jonas Dreßler
2021-09-30 16:39                                 ` Pali Rohár
2021-08-30 12:37 ` [PATCH 2/2] mwifiex: Try waking the firmware until we get an interrupt Jonas Dreßler
2021-08-30 12:51   ` Andy Shevchenko
2021-08-30 12:55     ` Andy Shevchenko
2021-09-25 17:32 ` [PATCH 0/2] mwifiex: Work around firmware bugs on 88W8897 chip Pali Rohár

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).