LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH v13 0/6] Address error and recovery for AER and DPC
@ 2018-04-09 14:41 Oza Pawandeep
  2018-04-09 14:41 ` [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

DPC should behave identical to AER as far as error handling is concerned.
DPC should remove the devices and not to do recovery for hotplug enabled system.

Changes since v13:
    Bjorn's and Keith's Comments addressed.
    > Made DPC and AER error handling identical <aligned err.c>
    > hanldled cases for hotplug enabled system differently.
Changes since v11:
    Bjorn's comments addressed.
    > rename pcie-err.c to err.c
    > removed EXPORT_SYMBOL
    > made generic find_serivce function in port driver.
    > removed mutex patch as no need to have mutex in pcie_do_recovery
    > brough in DPC_FATAL in aer.h
    > so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
    Christoph Hellwig's, David Laight's and Randy Dunlap's
    comments addressed.
        > renamed pci_do_recovery to pcie_do_recovery
        > removed inner braces in conditional statements.
        > restrctured the code in pci_wait_for_link
        > EXPORT_SYMBOL_GPL
Changes since v9:
    Sinan's comments addressed.
        > bool active = true; unnecessary variable removed.
Changes since v8:
    Fixed Kbuild errors.
Changes since v7:
    Rebased the code on pci master
        > https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
    Sinan's and Stefan's comments implemented.
        > reordered patch 6 and 7
        > cleaned up
Changes since v5:
    Sinan's and Keith's comments incorporated.
        > made separate patch for mutex
        > unified error repotting codes into driver/pci/pci.h
        > got rid of wait link active/inactive and
          made generic function in driver/pci/pci.c
Changes since v4:
    Bjorn's comments incorporated.
        > Renamed only do_recovery.
        > moved the things more locally to drivers/pci/pci.h
Changes since v3:
    Bjorn's comments incorporated.
        > Made separate patch renaming generic pci_err.c
        > Introduce pci_err.h to contain all the error types and recovery
        > removed all the dependencies on pci.h
Changes since v2:
    Based on feedback from Keith:
    "
    When DPC is triggered due to receipt of an uncorrectable error Message,
    the Requester ID from the Message is recorded in the DPC Error
    Source ID register and that Message is discarded and not forwarded Upstream.
    "
    Removed the patch where AER checks if DPC service is active
Changes since v1:
    Kbuild errors fixed:
        > pci_find_dpc_dev made static
        > ras_event.h updated
        > pci_find_aer_service call with CONFIG check
        > pci_find_dpc_service call with CONFIG check

Oza Pawandeep (6):
  PCI/AER: Rename error recovery to generic PCI naming
  PCI/AER: Factor out error reporting from AER
  PCI/PORTDRV: Implement generic find service
  PCI/DPC: Unify and plumb error handling into DPC
  PCI: Unify wait for link active into generic PCI
  PCI/DPC: Do not do recovery for hotplug enabled system

 drivers/pci/hotplug/pciehp_hpc.c   |  20 +--
 drivers/pci/pci.c                  |  29 ++++
 drivers/pci/pci.h                  |   5 +
 drivers/pci/pcie/Makefile          |   2 +-
 drivers/pci/pcie/aer/aerdrv.h      |  30 ----
 drivers/pci/pcie/aer/aerdrv_core.c | 317 +---------------------------------
 drivers/pci/pcie/err.c             | 340 +++++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/pcie-dpc.c        | 107 +++++++++---
 drivers/pci/pcie/portdrv.h         |   2 +
 drivers/pci/pcie/portdrv_core.c    |  45 +++++
 include/linux/aer.h                |   2 +
 11 files changed, 509 insertions(+), 390 deletions(-)
 create mode 100644 drivers/pci/pcie/err.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-09 23:14   ` Keith Busch
  2018-04-09 14:41 ` [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch renames error recovery to generic name with pcie prefix

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fcd8191..abc514e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -342,6 +342,9 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 
 void pci_enable_acs(struct pci_dev *dev);
 
+/* PCI error reporting and recovery */
+void pcie_do_recovery(struct pci_dev *dev, int severity);
+
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea5..aeb83a0 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -478,7 +478,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 }
 
 /**
- * do_recovery - handle nonfatal/fatal error recovery process
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
  * @dev: pointer to a pci_dev data structure of agent detecting an error
  * @severity: error severity type
  *
@@ -486,7 +486,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
  * error detected message to all downstream drivers within a hierarchy in
  * question and return the returned code.
  */
-static void do_recovery(struct pci_dev *dev, int severity)
+static void pcie_do_recovery(struct pci_dev *dev, int severity)
 {
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
@@ -566,7 +566,7 @@ static void handle_error_source(struct pcie_device *aerdev,
 			pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 					info->status);
 	} else
-		do_recovery(dev, info->severity);
+		pcie_do_recovery(dev, info->severity);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -631,7 +631,7 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity != AER_CORRECTABLE)
-			do_recovery(pdev, entry.severity);
+			pcie_do_recovery(pdev, entry.severity);
 		pci_dev_put(pdev);
 	}
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
  2018-04-09 14:41 ` [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-09 23:15   ` Keith Busch
  2018-04-10 11:36   ` kbuild test robot
  2018-04-09 14:41 ` [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch factors out error reporting callbacks, which are currently
tightly coupled with AER.

DPC should be able to register callbacks and attempt recovery when DPC
trigger event occurs.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 223e4c3..f0b1a78 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -6,7 +6,7 @@
 # Build PCI Express ASPM if needed
 obj-$(CONFIG_PCIEASPM)		+= aspm.o
 
-pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
+pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o err.o
 pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
 
 obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 5449e5c..bc9db53 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -76,36 +76,6 @@ struct aer_rpc {
 					 */
 };
 
-struct aer_broadcast_data {
-	enum pci_channel_state state;
-	enum pci_ers_result result;
-};
-
-static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
-		enum pci_ers_result new)
-{
-	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
-		return PCI_ERS_RESULT_NO_AER_DRIVER;
-
-	if (new == PCI_ERS_RESULT_NONE)
-		return orig;
-
-	switch (orig) {
-	case PCI_ERS_RESULT_CAN_RECOVER:
-	case PCI_ERS_RESULT_RECOVERED:
-		orig = new;
-		break;
-	case PCI_ERS_RESULT_DISCONNECT:
-		if (new == PCI_ERS_RESULT_NEED_RESET)
-			orig = PCI_ERS_RESULT_NEED_RESET;
-		break;
-	default:
-		break;
-	}
-
-	return orig;
-}
-
 extern struct bus_type pcie_port_bus_type;
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index aeb83a0..4acec3b 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/kfifo.h>
 #include "aerdrv.h"
+#include "../../pci.h"
 
 #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
 				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
@@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-static int report_error_detected(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	dev->error_state = result_data->state;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->error_detected) {
-		if (result_data->state == pci_channel_io_frozen &&
-			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
-			/*
-			 * In case of fatal recovery, if one of down-
-			 * stream device has no driver. We might be
-			 * unable to recover because a later insmod
-			 * of a driver for this device is unaware of
-			 * its hw state.
-			 */
-			pci_printk(KERN_DEBUG, dev, "device has %s\n",
-				   dev->driver ?
-				   "no AER-aware driver" : "no driver");
-		}
-
-		/*
-		 * If there's any device in the subtree that does not
-		 * have an error_detected callback, returning
-		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
-		 * the subsequent mmio_enabled/slot_reset/resume
-		 * callbacks of "any" device in the subtree. All the
-		 * devices in the subtree are left in the error state
-		 * without recovery.
-		 */
-
-		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
-			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
-		else
-			vote = PCI_ERS_RESULT_NONE;
-	} else {
-		err_handler = dev->driver->err_handler;
-		vote = err_handler->error_detected(dev, result_data->state);
-		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
-	}
-
-	result_data->result = merge_result(result_data->result, vote);
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_mmio_enabled(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->mmio_enabled)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->mmio_enabled(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_slot_reset(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->slot_reset)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->slot_reset(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_resume(struct pci_dev *dev, void *data)
-{
-	const struct pci_error_handlers *err_handler;
-
-	device_lock(&dev->dev);
-	dev->error_state = pci_channel_io_normal;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->resume)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	err_handler->resume(dev);
-	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
-	enum pci_channel_state state,
-	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
-{
-	struct aer_broadcast_data result_data;
-
-	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
-	result_data.state = state;
-	if (cb == report_error_detected)
-		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
-	else
-		result_data.result = PCI_ERS_RESULT_RECOVERED;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		if (cb == report_error_detected)
-			dev->error_state = state;
-		pci_walk_bus(dev->subordinate, cb, &result_data);
-		if (cb == report_resume) {
-			pci_cleanup_aer_uncorrect_error_status(dev);
-			dev->error_state = pci_channel_io_normal;
-		}
-	} else {
-		/*
-		 * If the error is reported by an end point, we think this
-		 * error is related to the upstream link of the end point.
-		 */
-		if (state == pci_channel_io_normal)
-			/*
-			 * the error is non fatal so the bus is ok, just invoke
-			 * the callback for the function that logged the error.
-			 */
-			cb(dev, &result_data);
-		else
-			pci_walk_bus(dev->bus, cb, &result_data);
-	}
-
-	return result_data.result;
-}
-
-/**
- * default_reset_link - default reset function
- * @dev: pointer to pci_dev data structure
- *
- * Invoked when performing link reset on a Downstream Port or a
- * Root Port with no aer driver.
- */
-static pci_ers_result_t default_reset_link(struct pci_dev *dev)
-{
-	pci_reset_bridge_secondary_bus(dev);
-	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
-	return PCI_ERS_RESULT_RECOVERED;
-}
-
 static int find_aer_service_iter(struct device *device, void *data)
 {
 	struct pcie_port_service_driver *service_driver, **drv;
@@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device *device, void *data)
 	return 0;
 }
 
-static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
 {
 	struct pcie_port_service_driver *drv = NULL;
 
@@ -441,107 +257,6 @@ static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
 	return drv;
 }
 
-static pci_ers_result_t reset_link(struct pci_dev *dev)
-{
-	struct pci_dev *udev;
-	pci_ers_result_t status;
-	struct pcie_port_service_driver *driver;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/* Reset this port for all subordinates */
-		udev = dev;
-	} else {
-		/* Reset the upstream component (likely downstream port) */
-		udev = dev->bus->self;
-	}
-
-	/* Use the aer driver of the component firstly */
-	driver = find_aer_service(udev);
-
-	if (driver && driver->reset_link) {
-		status = driver->reset_link(udev);
-	} else if (udev->has_secondary_link) {
-		status = default_reset_link(udev);
-	} else {
-		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED) {
-		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	return status;
-}
-
-/**
- * pcie_do_recovery - handle nonfatal/fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- * @severity: error severity type
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-static void pcie_do_recovery(struct pci_dev *dev, int severity)
-{
-	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
-	enum pci_channel_state state;
-
-	if (severity == AER_FATAL)
-		state = pci_channel_io_frozen;
-	else
-		state = pci_channel_io_normal;
-
-	status = broadcast_error_message(dev,
-			state,
-			"error_detected",
-			report_error_detected);
-
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
-		if (result != PCI_ERS_RESULT_RECOVERED)
-			goto failed;
-	}
-
-	if (status == PCI_ERS_RESULT_CAN_RECOVER)
-		status = broadcast_error_message(dev,
-				state,
-				"mmio_enabled",
-				report_mmio_enabled);
-
-	if (status == PCI_ERS_RESULT_NEED_RESET) {
-		/*
-		 * TODO: Should call platform-specific
-		 * functions to reset slot before calling
-		 * drivers' slot_reset callbacks?
-		 */
-		status = broadcast_error_message(dev,
-				state,
-				"slot_reset",
-				report_slot_reset);
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED)
-		goto failed;
-
-	broadcast_error_message(dev,
-				state,
-				"resume",
-				report_resume);
-
-	pci_info(dev, "AER: Device recovery successful\n");
-	return;
-
-failed:
-	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-	/* TODO: Should kernel panic here? */
-	pci_info(dev, "AER: Device recovery failed\n");
-}
-
 /**
  * handle_error_source - handle logging error into an event log
  * @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
new file mode 100644
index 0000000..c48eb0a
--- /dev/null
+++ b/drivers/pci/pcie/err.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file implements the error recovery as a core part of PCIe error
+ * reporting. When a PCIe error is delivered, an error message will be
+ * collected and printed to console, then, an error recovery procedure
+ * will be executed by following the PCI error recovery rules.
+ *
+ * Copyright (C) 2006 Intel Corp.
+ *	Tom Long Nguyen (tom.l.nguyen@intel.com)
+ *	Zhang Yanmin (yanmin.zhang@intel.com)
+ *
+ */
+
+#include <linux/pci.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/aer.h>
+#include <linux/pcieport_if.h>
+#include "portdrv.h"
+
+struct aer_broadcast_data {
+	enum pci_channel_state state;
+	enum pci_ers_result result;
+};
+
+static pci_ers_result_t merge_result(enum pci_ers_result orig,
+				  enum pci_ers_result new)
+{
+	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+		return PCI_ERS_RESULT_NO_AER_DRIVER;
+
+	if (new == PCI_ERS_RESULT_NONE)
+		return orig;
+
+	switch (orig) {
+	case PCI_ERS_RESULT_CAN_RECOVER:
+	case PCI_ERS_RESULT_RECOVERED:
+		orig = new;
+		break;
+	case PCI_ERS_RESULT_DISCONNECT:
+		if (new == PCI_ERS_RESULT_NEED_RESET)
+			orig = PCI_ERS_RESULT_NEED_RESET;
+		break;
+	default:
+		break;
+	}
+
+	return orig;
+}
+
+static int report_mmio_enabled(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->mmio_enabled)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->mmio_enabled(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_slot_reset(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->slot_reset)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->slot_reset(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_resume(struct pci_dev *dev, void *data)
+{
+	const struct pci_error_handlers *err_handler;
+
+	device_lock(&dev->dev);
+	dev->error_state = pci_channel_io_normal;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->resume)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	err_handler->resume(dev);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_error_detected(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	dev->error_state = result_data->state;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->error_detected) {
+		if (result_data->state == pci_channel_io_frozen &&
+			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
+			/*
+			 * In case of fatal recovery, if one of down-
+			 * stream device has no driver. We might be
+			 * unable to recover because a later insmod
+			 * of a driver for this device is unaware of
+			 * its hw state.
+			 */
+			pci_printk(KERN_DEBUG, dev, "device has %s\n",
+				   dev->driver ?
+				   "no error-aware driver" : "no driver");
+		}
+
+		/*
+		 * If there's any device in the subtree that does not
+		 * have an error_detected callback, returning
+		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
+		 * the subsequent mmio_enabled/slot_reset/resume
+		 * callbacks of "any" device in the subtree. All the
+		 * devices in the subtree are left in the error state
+		 * without recovery.
+		 */
+
+		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+		else
+			vote = PCI_ERS_RESULT_NONE;
+	} else {
+		err_handler = dev->driver->err_handler;
+		vote = err_handler->error_detected(dev, result_data->state);
+	}
+
+	result_data->result = merge_result(result_data->result, vote);
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+/**
+ * default_reset_link - default reset function
+ * @dev: pointer to pci_dev data structure
+ *
+ * Invoked when performing link reset on a Downstream Port or a
+ * Root Port with no aer driver.
+ */
+static pci_ers_result_t default_reset_link(struct pci_dev *dev)
+{
+	pci_reset_bridge_secondary_bus(dev);
+	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+static pci_ers_result_t reset_link(struct pci_dev *dev)
+{
+	struct pci_dev *udev;
+	pci_ers_result_t status;
+	struct pcie_port_service_driver *driver = NULL;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/* Reset this port for all subordinates */
+		udev = dev;
+	} else {
+		/* Reset the upstream component (likely downstream port) */
+		udev = dev->bus->self;
+	}
+
+	/* Use the aer driver of the component firstly */
+	driver = find_aer_service(udev);
+
+	if (driver && driver->reset_link) {
+		status = driver->reset_link(udev);
+	} else if (udev->has_secondary_link) {
+		status = default_reset_link(udev);
+	} else {
+		pci_printk(KERN_DEBUG, dev,
+			"no link-reset support at upstream device %s\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED) {
+		pci_printk(KERN_DEBUG, dev,
+			"link reset at upstream device %s failed\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	return status;
+}
+
+/**
+ * broadcast_error_message - handle message broadcast to downstream drivers
+ * @dev: pointer to where in a hierarchy message is broadcasted down
+ * @state: error state
+ * @error_mesg: message to print
+ * @cb: callback to be broadcast
+ *
+ * Invoked during error recovery process. Once being invoked, the content
+ * of error severity will be broadcast to all downstream drivers in a
+ * hierarchy in question.
+ */
+static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
+	enum pci_channel_state state,
+	char *error_mesg,
+	int (*cb)(struct pci_dev *, void *))
+{
+	struct aer_broadcast_data result_data;
+
+	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
+	result_data.state = state;
+	if (cb == report_error_detected)
+		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
+	else
+		result_data.result = PCI_ERS_RESULT_RECOVERED;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/*
+		 * If the error is reported by a bridge, we think this error
+		 * is related to the downstream link of the bridge, so we
+		 * do error recovery on all subordinates of the bridge instead
+		 * of the bridge and clear the error status of the bridge.
+		 */
+		if (cb == report_error_detected)
+			dev->error_state = state;
+		pci_walk_bus(dev->subordinate, cb, &result_data);
+		if (cb == report_resume) {
+			pci_cleanup_aer_uncorrect_error_status(dev);
+			dev->error_state = pci_channel_io_normal;
+		}
+	} else {
+		/*
+		 * If the error is reported by an end point, we think this
+		 * error is related to the upstream link of the end point.
+		 */
+		pci_walk_bus(dev->bus, cb, &result_data);
+	}
+
+	return result_data.result;
+}
+
+/**
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
+ * @dev: pointer to a pci_dev data structure of agent detecting an error
+ * @severity: error severity type
+ *
+ * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
+ * error detected message to all downstream drivers within a hierarchy in
+ * question and return the returned code.
+ */
+void pcie_do_recovery(struct pci_dev *dev, int severity)
+{
+	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
+	enum pci_channel_state state;
+
+	if (severity == AER_FATAL)
+		state = pci_channel_io_frozen;
+	else
+		state = pci_channel_io_normal;
+
+	status = broadcast_error_message(dev,
+			state,
+			"error_detected",
+			report_error_detected);
+
+	if (severity == AER_FATAL) {
+		result = reset_link(dev);
+		if (result != PCI_ERS_RESULT_RECOVERED)
+			goto failed;
+	}
+
+	if (status == PCI_ERS_RESULT_CAN_RECOVER)
+		status = broadcast_error_message(dev,
+				state,
+				"mmio_enabled",
+				report_mmio_enabled);
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		/*
+		 * TODO: Should call platform-specific
+		 * functions to reset slot before calling
+		 * drivers' slot_reset callbacks?
+		 */
+		status = broadcast_error_message(dev,
+				state,
+				"slot_reset",
+				report_slot_reset);
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED)
+		goto failed;
+
+	broadcast_error_message(dev,
+				state,
+				"resume",
+				report_resume);
+
+	dev_info(&dev->dev, "Device recovery successful\n");
+	return;
+
+failed:
+	/* TODO: Should kernel panic here? */
+	dev_info(&dev->dev, "Device recovery failed\n");
+}
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index a854bc5..9a8d0dd 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
 static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
 #endif /* _PORTDRV_H_ */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
  2018-04-09 14:41 ` [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
  2018-04-09 14:41 ` [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-09 23:15   ` Keith Busch
  2018-04-09 14:41 ` [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch implements generic pcie_port_find_service() routine.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 4acec3b..aeb8236 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -231,32 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-static int find_aer_service_iter(struct device *device, void *data)
-{
-	struct pcie_port_service_driver *service_driver, **drv;
-
-	drv = (struct pcie_port_service_driver **) data;
-
-	if (device->bus == &pcie_port_bus_type && device->driver) {
-		service_driver = to_service_driver(device->driver);
-		if (service_driver->service == PCIE_PORT_SERVICE_AER) {
-			*drv = service_driver;
-			return 1;
-		}
-	}
-
-	return 0;
-}
-
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
-{
-	struct pcie_port_service_driver *drv = NULL;
-
-	device_for_each_child(&dev->dev, &drv, find_aer_service_iter);
-
-	return drv;
-}
-
 /**
  * handle_error_source - handle logging error into an event log
  * @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c48eb0a..98aeec4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -194,7 +194,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 	}
 
 	/* Use the aer driver of the component firstly */
-	driver = find_aer_service(udev);
+	driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
 
 	if (driver && driver->reset_link) {
 		status = driver->reset_link(udev);
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 9a8d0dd..419bdf3 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -79,5 +79,6 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
 static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+							u32 service);
 #endif /* _PORTDRV_H_ */
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index ef3bad4..94de1fa 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -23,6 +23,11 @@
 
 bool pciehp_msi_disabled;
 
+struct portdrv_service_data {
+	struct pcie_port_service_driver *drv;
+	u32 service;
+};
+
 static int __init pciehp_setup(char *str)
 {
 	if (!strncmp(str, "nomsi", 5))
@@ -414,6 +419,46 @@ static int remove_iter(struct device *dev, void *data)
 	return 0;
 }
 
+static int find_service_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct portdrv_service_data *pdrvs;
+	u32 service;
+
+	pdrvs = (struct portdrv_service_data *) data;
+	service = pdrvs->service;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == service) {
+			pdrvs->drv = service_driver;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+/**
+ * pcie_port_find_service - find the service driver
+ * @dev: PCI Express port the service devices associated with
+ * @service: Service to find
+ *
+ * Find PCI Express port service driver associated with given service
+ */
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+							u32 service)
+{
+	struct pcie_port_service_driver *drv;
+	struct portdrv_service_data pdrvs;
+
+	pdrvs.drv = NULL;
+	pdrvs.service = service;
+	device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
+
+	drv = pdrvs.drv;
+	return drv;
+}
+
 /**
  * pcie_port_device_remove - unregister PCI Express port service devices
  * @dev: PCI Express port the service devices to unregister are associated with
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (2 preceding siblings ...)
  2018-04-09 14:41 ` [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-09 23:29   ` Keith Busch
  2018-04-09 14:41 ` [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Current DPC driver does not do recovery, e.g. calling end-point's driver's
callbacks, which sanitize the sw.

DPC driver implements link_reset callback, and calls pci_do_recovery().

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 98aeec4..d02e029 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -19,6 +19,7 @@
 #include <linux/aer.h>
 #include <linux/pcieport_if.h>
 #include "portdrv.h"
+#include "./../pci.h"
 
 struct aer_broadcast_data {
 	enum pci_channel_state state;
@@ -179,11 +180,12 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-static pci_ers_result_t reset_link(struct pci_dev *dev)
+static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
 {
 	struct pci_dev *udev;
 	pci_ers_result_t status;
 	struct pcie_port_service_driver *driver = NULL;
+	u32 service;
 
 	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
 		/* Reset this port for all subordinates */
@@ -193,8 +195,12 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 		udev = dev->bus->self;
 	}
 
-	/* Use the aer driver of the component firstly */
-	driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
+	if (severity == DPC_FATAL)
+		service = PCIE_PORT_SERVICE_DPC;
+	else
+		service = PCIE_PORT_SERVICE_AER;
+
+	driver = pcie_port_find_service(udev, service);
 
 	if (driver && driver->reset_link) {
 		status = driver->reset_link(udev);
@@ -281,7 +287,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
 
-	if (severity == AER_FATAL)
+	if ((severity == AER_FATAL) ||
+	    (severity == DPC_FATAL))
 		state = pci_channel_io_frozen;
 	else
 		state = pci_channel_io_normal;
@@ -291,8 +298,9 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 			"error_detected",
 			report_error_detected);
 
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
+	if ((severity == AER_FATAL) ||
+	    (severity == DPC_FATAL)) {
+		result = reset_link(dev, severity);
 		if (result != PCI_ERS_RESULT_RECOVERED)
 			goto failed;
 	}
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 38e40c6..517c8b4 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -13,6 +13,7 @@
 #include <linux/pcieport_if.h>
 #include "../pci.h"
 #include "aer/aerdrv.h"
+#include "portdrv.h"
 
 struct dpc_dev {
 	struct pcie_device	*dev;
@@ -45,6 +46,33 @@ static const char * const rp_pio_error_string[] = {
 	"Memory Request Completion Timeout",		 /* Bit Position 18 */
 };
 
+static int find_dpc_dev_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct device **dev;
+
+	dev = (struct device **) data;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
+			*dev = device;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
+{
+	struct device *dev = NULL;
+
+	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
+
+	return dev;
+}
+
 static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 {
 	unsigned long timeout = jiffies + HZ;
@@ -82,12 +110,25 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 		dev_warn(dev, "Link state not disabled for DPC event\n");
 }
 
-static void dpc_work(struct work_struct *work)
+/**
+ * dpc_reset_link - reset link DPC routine
+ * @pdev: pointer to Root Port's pci_dev data structure
+ *
+ * Invoked by Port Bus driver when performing link reset at Root Port.
+ */
+static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
-	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
-	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
 	struct pci_bus *parent = pdev->subordinate;
-	u16 cap = dpc->cap_pos, ctl;
+	struct pci_dev *dev, *temp;
+	struct dpc_dev *dpc;
+	struct pcie_device *pciedev;
+	struct device *devdpc;
+	u16 cap, ctl;
+
+	devdpc = pci_find_dpc_dev(pdev);
+	pciedev = to_pcie_device(devdpc);
+	dpc = get_service_data(pciedev);
+	cap = dpc->cap_pos;
 
 	pci_lock_rescan_remove();
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
@@ -104,21 +145,31 @@ static void dpc_work(struct work_struct *work)
 
 	dpc_wait_link_inactive(dpc);
 	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
-		return;
+		return PCI_ERS_RESULT_DISCONNECT;
 	if (dpc->rp_extensions && dpc->rp_pio_status) {
 		pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
 				       dpc->rp_pio_status);
 		dpc->rp_pio_status = 0;
 	}
 
-	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
+	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
 		PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
 			      ctl | PCI_EXP_DPC_CTL_INT_EN);
+
+	return PCI_ERS_RESULT_RECOVERED;
 }
 
+static void dpc_work(struct work_struct *work)
+{
+	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
+	struct pci_dev *pdev = dpc->dev->port;
+
+	/* From DPC point of view error is always FATAL. */
+	pcie_do_recovery(pdev, DPC_FATAL);
+}
 static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 {
 	struct device *dev = &dpc->dev->device;
@@ -297,6 +348,7 @@ static struct pcie_port_service_driver dpcdriver = {
 	.service	= PCIE_PORT_SERVICE_DPC,
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
+	.reset_link     = dpc_reset_link,
 };
 
 static int __init dpc_service_init(void)
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8f87bbe..9cfd0b8 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -15,6 +15,8 @@
 #define AER_FATAL			1
 #define AER_CORRECTABLE			2
 
+#define DPC_FATAL			4
+
 struct pci_dev;
 
 struct aer_header_log_regs {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (3 preceding siblings ...)
  2018-04-09 14:41 ` [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-09 23:25   ` Keith Busch
  2018-04-09 14:41 ` [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system Oza Pawandeep
  2018-04-16  3:16 ` [PATCH v13 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
  6 siblings, 1 reply; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Clients such as HP, DPC are using pcie_wait_link_active(), which waits
till the link becomes active or inactive.

Made generic function and moved it to drivers/pci/pci.c

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8..e0c2b8e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void __pcie_wait_link_active(struct controller *ctrl, bool active)
-{
-	int timeout = 1000;
-
-	if (pciehp_check_link_active(ctrl) == active)
-		return;
-	while (timeout > 0) {
-		msleep(10);
-		timeout -= 10;
-		if (pciehp_check_link_active(ctrl) == active)
-			return;
-	}
-	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
-			active ? "set" : "cleared");
-}
-
 static void pcie_wait_link_active(struct controller *ctrl)
 {
-	__pcie_wait_link_active(ctrl, true);
+	struct pci_dev *pdev = ctrl_dev(ctrl);
+
+	pcie_wait_for_link(pdev, true);
 }
 
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd1..bbea997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4176,6 +4176,35 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	return 0;
 }
 
+/**
+ * pcie_wait_for_link - Wait for link till it's active/inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive ?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+	int timeout = 1000;
+	bool ret;
+	u16 lnk_status;
+
+	for (;;) {
+		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
+		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
+		if (ret == active)
+			return true;
+		if (timeout <= 0)
+			break;
+		timeout -= 10;
+	}
+
+	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
+		 active ? "set" : "cleared");
+
+	return false;
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index abc514e..5c44fbc 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -345,6 +345,8 @@ void pci_enable_acs(struct pci_dev *dev);
 /* PCI error reporting and recovery */
 void pcie_do_recovery(struct pci_dev *dev, int severity);
 
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
+
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 517c8b4..8e1553b 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -95,19 +95,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 
 static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 {
-	unsigned long timeout = jiffies + HZ;
 	struct pci_dev *pdev = dpc->dev->port;
-	struct device *dev = &dpc->dev->device;
-	u16 lnk_status;
 
-	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
-					!time_after(jiffies, timeout)) {
-		msleep(10);
-		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	}
-	if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
-		dev_warn(dev, "Link state not disabled for DPC event\n");
+	pcie_wait_for_link(pdev, false);
 }
 
 /**
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (4 preceding siblings ...)
  2018-04-09 14:41 ` [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
@ 2018-04-09 14:41 ` Oza Pawandeep
  2018-04-10 21:03   ` Bjorn Helgaas
  2018-04-16  3:16 ` [PATCH v13 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
  6 siblings, 1 reply; 38+ messages in thread
From: Oza Pawandeep @ 2018-04-09 14:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

DPC and AER should attempt recovery in the same way, except the
cases where system is with hotplug enabled.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 8e1553b..6d9a841 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -108,8 +108,6 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
  */
 static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
-	struct pci_bus *parent = pdev->subordinate;
-	struct pci_dev *dev, *temp;
 	struct dpc_dev *dpc;
 	struct pcie_device *pciedev;
 	struct device *devdpc;
@@ -120,19 +118,6 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	dpc = get_service_data(pciedev);
 	cap = dpc->cap_pos;
 
-	pci_lock_rescan_remove();
-	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
-					 bus_list) {
-		pci_dev_get(dev);
-		pci_dev_set_disconnected(dev, NULL);
-		if (pci_has_subordinate(dev))
-			pci_walk_bus(dev->subordinate,
-				     pci_dev_set_disconnected, NULL);
-		pci_stop_and_remove_bus_device(dev);
-		pci_dev_put(dev);
-	}
-	pci_unlock_rescan_remove();
-
 	dpc_wait_link_inactive(dpc);
 	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
 		return PCI_ERS_RESULT_DISCONNECT;
@@ -152,13 +137,37 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
+static void dpc_reset_link_remove_dev(struct pci_dev *pdev)
+{
+	struct pci_bus *parent = pdev->subordinate;
+	struct pci_dev *dev, *temp;
+
+	pci_lock_rescan_remove();
+	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
+					 bus_list) {
+		pci_dev_get(dev);
+		pci_dev_set_disconnected(dev, NULL);
+		if (pci_has_subordinate(dev))
+			pci_walk_bus(dev->subordinate,
+				     pci_dev_set_disconnected, NULL);
+		pci_stop_and_remove_bus_device(dev);
+		pci_dev_put(dev);
+	}
+	pci_unlock_rescan_remove();
+
+	dpc_reset_link(pdev);
+}
+
 static void dpc_work(struct work_struct *work)
 {
 	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
 	struct pci_dev *pdev = dpc->dev->port;
 
 	/* From DPC point of view error is always FATAL. */
-	pcie_do_recovery(pdev, DPC_FATAL);
+	if (!pdev->is_hotplug_bridge)
+		pcie_do_recovery(pdev, DPC_FATAL);
+	else
+		dpc_reset_link_remove_dev(pdev);
 }
 static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming
  2018-04-09 14:41 ` [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
@ 2018-04-09 23:14   ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2018-04-09 23:14 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:49AM -0400, Oza Pawandeep wrote:
> This patch renames error recovery to generic name with pcie prefix
> 
> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

Looks fine.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER
  2018-04-09 14:41 ` [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
@ 2018-04-09 23:15   ` Keith Busch
  2018-04-10 11:36   ` kbuild test robot
  1 sibling, 0 replies; 38+ messages in thread
From: Keith Busch @ 2018-04-09 23:15 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:50AM -0400, Oza Pawandeep wrote:
> This patch factors out error reporting callbacks, which are currently
> tightly coupled with AER.
> 
> DPC should be able to register callbacks and attempt recovery when DPC
> trigger event occurs.
> 
> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

Looks fine.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service
  2018-04-09 14:41 ` [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
@ 2018-04-09 23:15   ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2018-04-09 23:15 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:51AM -0400, Oza Pawandeep wrote:
> This patch implements generic pcie_port_find_service() routine.
> 
> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

Looks good.

Reviewed-by: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI
  2018-04-09 14:41 ` [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
@ 2018-04-09 23:25   ` Keith Busch
  2018-04-12  8:40     ` poza
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2018-04-09 23:25 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:53AM -0400, Oza Pawandeep wrote:
> +/**
> + * pcie_wait_for_link - Wait for link till it's active/inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive ?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +{
> +	int timeout = 1000;
> +	bool ret;
> +	u16 lnk_status;
> +
> +	for (;;) {
> +		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> +		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
> +		if (ret == active)
> +			return true;
> +		if (timeout <= 0)
> +			break;
> +		timeout -= 10;
> +	}

This is missing an msleep(10) at each iteration.

> +
> +	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
> +		 active ? "set" : "cleared");
> +
> +	return false;
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC
  2018-04-09 14:41 ` [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
@ 2018-04-09 23:29   ` Keith Busch
  2018-04-09 23:51     ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2018-04-09 23:29 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote:
> +static int find_dpc_dev_iter(struct device *device, void *data)
> +{
> +	struct pcie_port_service_driver *service_driver;
> +	struct device **dev;
> +
> +	dev = (struct device **) data;
> +
> +	if (device->bus == &pcie_port_bus_type && device->driver) {
> +		service_driver = to_service_driver(device->driver);
> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
> +			*dev = device;
> +			return 1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
> +{
> +	struct device *dev = NULL;
> +
> +	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
> +
> +	return dev;
> +}

The only caller of this doesn't seem to care to use struct device. This
should probably just extract struct dpc_dev directly from in here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC
  2018-04-09 23:29   ` Keith Busch
@ 2018-04-09 23:51     ` Sinan Kaya
  2018-04-10  0:05       ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-09 23:51 UTC (permalink / raw)
  To: Keith Busch, Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Timur Tabi

On 4/9/2018 7:29 PM, Keith Busch wrote:
> On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote:
>> +static int find_dpc_dev_iter(struct device *device, void *data)
>> +{
>> +	struct pcie_port_service_driver *service_driver;
>> +	struct device **dev;
>> +
>> +	dev = (struct device **) data;
>> +
>> +	if (device->bus == &pcie_port_bus_type && device->driver) {
>> +		service_driver = to_service_driver(device->driver);
>> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
>> +			*dev = device;
>> +			return 1;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
>> +{
>> +	struct device *dev = NULL;
>> +
>> +	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
>> +
>> +	return dev;
>> +}
> 
> The only caller of this doesn't seem to care to use struct device. This
> should probably just extract struct dpc_dev directly from in here.
> 

Bjorn wants to kill the port service driver infrastructure but that is a much
bigger task. 

How do we obtain the DPC object from the parent object directly? Each port
service driver object is a children.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC
  2018-04-09 23:51     ` Sinan Kaya
@ 2018-04-10  0:05       ` Sinan Kaya
  0 siblings, 0 replies; 38+ messages in thread
From: Sinan Kaya @ 2018-04-10  0:05 UTC (permalink / raw)
  To: Keith Busch, Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Timur Tabi

On 4/9/2018 7:51 PM, Sinan Kaya wrote:
> On 4/9/2018 7:29 PM, Keith Busch wrote:
>> On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote:
>>> +static int find_dpc_dev_iter(struct device *device, void *data)
>>> +{
>>> +	struct pcie_port_service_driver *service_driver;
>>> +	struct device **dev;
>>> +
>>> +	dev = (struct device **) data;
>>> +
>>> +	if (device->bus == &pcie_port_bus_type && device->driver) {
>>> +		service_driver = to_service_driver(device->driver);
>>> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
>>> +			*dev = device;
>>> +			return 1;
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
>>> +{
>>> +	struct device *dev = NULL;
>>> +
>>> +	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
>>> +
>>> +	return dev;
>>> +}
>>
>> The only caller of this doesn't seem to care to use struct device. This
>> should probably just extract struct dpc_dev directly from in here.
>>
> 
> Bjorn wants to kill the port service driver infrastructure but that is a much
> bigger task. 
> 
> How do we obtain the DPC object from the parent object directly? Each port
> service driver object is a children.
> 

How about implementing pcie_port_find_service_dev() as a follow up patch
to "PCI/PORTDRV: Implement generic find service" similar to what was done for
pcie_port_find_service()?

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER
  2018-04-09 14:41 ` [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
  2018-04-09 23:15   ` Keith Busch
@ 2018-04-10 11:36   ` kbuild test robot
  1 sibling, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2018-04-10 11:36 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: kbuild-all, Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi,
	Oza Pawandeep

[-- Attachment #1: Type: text/plain, Size: 2342 bytes --]

Hi Oza,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.16]
[cannot apply to pci/next linus/master next-20180410]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180410-045644
config: x86_64-randconfig-s4-04101607 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: the linux-review/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180410-045644 HEAD db413d3e52b272ef8e6f03ff125e65468abd31e6 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   drivers/pci/pcie/err.o: In function `reset_link':
>> drivers/pci/pcie/err.c:197: undefined reference to `find_aer_service'

vim +197 drivers/pci/pcie/err.c

   181	
   182	static pci_ers_result_t reset_link(struct pci_dev *dev)
   183	{
   184		struct pci_dev *udev;
   185		pci_ers_result_t status;
   186		struct pcie_port_service_driver *driver = NULL;
   187	
   188		if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
   189			/* Reset this port for all subordinates */
   190			udev = dev;
   191		} else {
   192			/* Reset the upstream component (likely downstream port) */
   193			udev = dev->bus->self;
   194		}
   195	
   196		/* Use the aer driver of the component firstly */
 > 197		driver = find_aer_service(udev);
   198	
   199		if (driver && driver->reset_link) {
   200			status = driver->reset_link(udev);
   201		} else if (udev->has_secondary_link) {
   202			status = default_reset_link(udev);
   203		} else {
   204			pci_printk(KERN_DEBUG, dev,
   205				"no link-reset support at upstream device %s\n",
   206				pci_name(udev));
   207			return PCI_ERS_RESULT_DISCONNECT;
   208		}
   209	
   210		if (status != PCI_ERS_RESULT_RECOVERED) {
   211			pci_printk(KERN_DEBUG, dev,
   212				"link reset at upstream device %s failed\n",
   213				pci_name(udev));
   214			return PCI_ERS_RESULT_DISCONNECT;
   215		}
   216	
   217		return status;
   218	}
   219	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34614 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-09 14:41 ` [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system Oza Pawandeep
@ 2018-04-10 21:03   ` Bjorn Helgaas
  2018-04-12  1:41     ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-10 21:03 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:54AM -0400, Oza Pawandeep wrote:
> DPC and AER should attempt recovery in the same way, except the
> cases where system is with hotplug enabled.

What's the connection with hotplug?  I see from the patch that for
hotplug bridges you remove the tree below the bridge, and otherwise
you just reset the secondary link (I think).

The changelog should explain why we need the difference.

I'm a little skeptical to begin with, because I'm not sure why we
should handle a DPC event differently just because a bridge has the
*capability* of hotplug.  Even if a hotplug bridge reports a DPC
event, that doesn't necessarily mean a hotplug has occurred.

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index 8e1553b..6d9a841 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -108,8 +108,6 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
>   */
>  static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  {
> -	struct pci_bus *parent = pdev->subordinate;
> -	struct pci_dev *dev, *temp;
>  	struct dpc_dev *dpc;
>  	struct pcie_device *pciedev;
>  	struct device *devdpc;
> @@ -120,19 +118,6 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  	dpc = get_service_data(pciedev);
>  	cap = dpc->cap_pos;
>  
> -	pci_lock_rescan_remove();
> -	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> -					 bus_list) {
> -		pci_dev_get(dev);
> -		pci_dev_set_disconnected(dev, NULL);
> -		if (pci_has_subordinate(dev))
> -			pci_walk_bus(dev->subordinate,
> -				     pci_dev_set_disconnected, NULL);
> -		pci_stop_and_remove_bus_device(dev);
> -		pci_dev_put(dev);
> -	}
> -	pci_unlock_rescan_remove();
> -
>  	dpc_wait_link_inactive(dpc);
>  	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
>  		return PCI_ERS_RESULT_DISCONNECT;
> @@ -152,13 +137,37 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +static void dpc_reset_link_remove_dev(struct pci_dev *pdev)
> +{
> +	struct pci_bus *parent = pdev->subordinate;
> +	struct pci_dev *dev, *temp;
> +
> +	pci_lock_rescan_remove();
> +	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> +					 bus_list) {
> +		pci_dev_get(dev);
> +		pci_dev_set_disconnected(dev, NULL);
> +		if (pci_has_subordinate(dev))
> +			pci_walk_bus(dev->subordinate,
> +				     pci_dev_set_disconnected, NULL);
> +		pci_stop_and_remove_bus_device(dev);
> +		pci_dev_put(dev);
> +	}
> +	pci_unlock_rescan_remove();
> +
> +	dpc_reset_link(pdev);
> +}
> +
>  static void dpc_work(struct work_struct *work)
>  {
>  	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>  	struct pci_dev *pdev = dpc->dev->port;
>  
>  	/* From DPC point of view error is always FATAL. */
> -	pcie_do_recovery(pdev, DPC_FATAL);
> +	if (!pdev->is_hotplug_bridge)
> +		pcie_do_recovery(pdev, DPC_FATAL);
> +	else
> +		dpc_reset_link_remove_dev(pdev);
>  }
>  static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
>  {
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-10 21:03   ` Bjorn Helgaas
@ 2018-04-12  1:41     ` Sinan Kaya
  2018-04-12 14:06       ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-12  1:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 4/10/2018 5:03 PM, Bjorn Helgaas wrote:
>> DPC and AER should attempt recovery in the same way, except the
>> cases where system is with hotplug enabled.
> What's the connection with hotplug?  I see from the patch that for
> hotplug bridges you remove the tree below the bridge, and otherwise
> you just reset the secondary link (I think).
> 
> The changelog should explain why we need the difference.
> 
> I'm a little skeptical to begin with, because I'm not sure why we
> should handle a DPC event differently just because a bridge has the
> *capability* of hotplug.  Even if a hotplug bridge reports a DPC
> event, that doesn't necessarily mean a hotplug has occurred.
> 

Let's do a recap on what we have discussed about this until now.

There are two conflicting error recovery mechanisms for PCIe. 

If a system supports both hotplug and DPC, endpoint can be removed and inserted safely.
DPC driver shuts down the driver on link down. When link comes back up,
hotplug driver takes over and initiates an enumeration process. 
Keith mentioned the stop and re-enumerate design was chosen because 
someone could remove a drive and insert an unrelated drive back to the
system.  We can't really save and restore state as we do in the AER path. 

Now, let's assume a system without hotplug capability.
Second mechanism is to go through DPC/AER path and do an automatic link
down recovery via DPC retrain/secondary bus reset including register save and
restore.  Second mechanism is more suitable for handling "surprise link down"
event. The goal is to retrain the link and continue driver operation. 

The goal of this patch to separate these two cases from each other as
the DPC driver needs to work on both contexts. Current DPC code doesn't handle
the second use case.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI
  2018-04-09 23:25   ` Keith Busch
@ 2018-04-12  8:40     ` poza
  0 siblings, 0 replies; 38+ messages in thread
From: poza @ 2018-04-12  8:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-04-10 04:55, Keith Busch wrote:
> On Mon, Apr 09, 2018 at 10:41:53AM -0400, Oza Pawandeep wrote:
>> +/**
>> + * pcie_wait_for_link - Wait for link till it's active/inactive
>> + * @pdev: Bridge device
>> + * @active: waiting for active or inactive ?
>> + *
>> + * Use this to wait till link becomes active or inactive.
>> + */
>> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
>> +{
>> +	int timeout = 1000;
>> +	bool ret;
>> +	u16 lnk_status;
>> +
>> +	for (;;) {
>> +		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
>> +		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
>> +		if (ret == active)
>> +			return true;
>> +		if (timeout <= 0)
>> +			break;
>> +		timeout -= 10;
>> +	}
> 
> This is missing an msleep(10) at each iteration.

will take care.

> 
>> +
>> +	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
>> +		 active ? "set" : "cleared");
>> +
>> +	return false;
>> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12  1:41     ` Sinan Kaya
@ 2018-04-12 14:06       ` Bjorn Helgaas
  2018-04-12 14:34         ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-12 14:06 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi,
	Alex Williamson

[+cc Alex because of his interest in device reset]

For context, here's the part of the patch we're discussing:

> >>  static void dpc_work(struct work_struct *work)
> >>  {
> >>         struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
> >>         struct pci_dev *pdev = dpc->dev->port;
> >> 
> >>         /* From DPC point of view error is always FATAL. */
> >> -       pcie_do_recovery(pdev, DPC_FATAL);
> >> +       if (!pdev->is_hotplug_bridge)
> >> +               pcie_do_recovery(pdev, DPC_FATAL);
> >> +       else
> >> +               dpc_reset_link_remove_dev(pdev);
> >>  }

This is at the point where DPC has been triggered in the hardware and
the DPC driver is starting recovery, and I'm wondering why we need to
handle the "!pdev->is_hotplug_bridge" case differently.

On Wed, Apr 11, 2018 at 09:41:56PM -0400, Sinan Kaya wrote:
> On 4/10/2018 5:03 PM, Bjorn Helgaas wrote:
> >> DPC and AER should attempt recovery in the same way, except the
> >> cases where system is with hotplug enabled.
> > What's the connection with hotplug?  I see from the patch that for
> > hotplug bridges you remove the tree below the bridge, and otherwise
> > you just reset the secondary link (I think).
> > 
> > The changelog should explain why we need the difference.
> > 
> > I'm a little skeptical to begin with, because I'm not sure why we
> > should handle a DPC event differently just because a bridge has the
> > *capability* of hotplug.  Even if a hotplug bridge reports a DPC
> > event, that doesn't necessarily mean a hotplug has occurred.
> 
> Let's do a recap on what we have discussed about this until now.
> 
> There are two conflicting error recovery mechanisms for PCIe. 
> 
> If a system supports both hotplug and DPC, endpoint can be removed
> and inserted safely.  DPC driver shuts down the driver on link down.
> When link comes back up, hotplug driver takes over and initiates an
> enumeration process.  Keith mentioned the stop and re-enumerate
> design was chosen because someone could remove a drive and insert an
> unrelated drive back to the system.  We can't really save and
> restore state as we do in the AER path. 
> 
> Now, let's assume a system without hotplug capability.  Second
> mechanism is to go through DPC/AER path and do an automatic link
> down recovery via DPC retrain/secondary bus reset including register
> save and restore.  Second mechanism is more suitable for handling
> "surprise link down" event. The goal is to retrain the link and
> continue driver operation. 
> 
> The goal of this patch to separate these two cases from each other
> as the DPC driver needs to work on both contexts. Current DPC code
> doesn't handle the second use case.

I think the scenario you are describing is two systems that are
identical except that in the first, the endpoint is below a hotplug
bridge, while in the second, it's below a non-hotplug bridge.  There's
no physical hotplug (no drive removed or inserted), and DPC is
triggered in both systems.

I suggest that DPC should be handled identically in both systems:

  - The PCI core should have the same view of the endpoint: it should
    be removed and re-added in both cases (or in neither case).

  - The endpoint itself should not be able to tell the difference: it
    should see a link down event, followed by a link retrain, followed
    by the same sequence of config accesses, etc.

  - The endpoint driver should not be able to tell the difference,
    i.e., we should be calling the same pci_error_handlers callbacks
    in both cases.

It's true that in the non-hotplug system, pciehp probably won't start
re-enumeration, so we might need an alternate path to trigger that.

But that's not what we're doing in this patch.  In this patch we're
adding a much bigger difference: for hotplug bridges, we stop and
remove the hierarchy below the bridge; for non-hotplug bridges, we do
the AER-style flow of calling pci_error_handlers callbacks.

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 14:06       ` Bjorn Helgaas
@ 2018-04-12 14:34         ` Sinan Kaya
  2018-04-12 14:39           ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-12 14:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi,
	Alex Williamson

On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> [+cc Alex because of his interest in device reset]
> 
> For context, here's the part of the patch we're discussing:
> 
>>>>  static void dpc_work(struct work_struct *work)
>>>>  {
>>>>         struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>>>>         struct pci_dev *pdev = dpc->dev->port;
>>>>
>>>>         /* From DPC point of view error is always FATAL. */
>>>> -       pcie_do_recovery(pdev, DPC_FATAL);
>>>> +       if (!pdev->is_hotplug_bridge)
>>>> +               pcie_do_recovery(pdev, DPC_FATAL);
>>>> +       else
>>>> +               dpc_reset_link_remove_dev(pdev);
>>>>  }
> 
> This is at the point where DPC has been triggered in the hardware and
> the DPC driver is starting recovery, and I'm wondering why we need to
> handle the "!pdev->is_hotplug_bridge" case differently.
> 
> On Wed, Apr 11, 2018 at 09:41:56PM -0400, Sinan Kaya wrote:
>> On 4/10/2018 5:03 PM, Bjorn Helgaas wrote:
>>>> DPC and AER should attempt recovery in the same way, except the
>>>> cases where system is with hotplug enabled.
>>> What's the connection with hotplug?  I see from the patch that for
>>> hotplug bridges you remove the tree below the bridge, and otherwise
>>> you just reset the secondary link (I think).
>>>
>>> The changelog should explain why we need the difference.
>>>
>>> I'm a little skeptical to begin with, because I'm not sure why we
>>> should handle a DPC event differently just because a bridge has the
>>> *capability* of hotplug.  Even if a hotplug bridge reports a DPC
>>> event, that doesn't necessarily mean a hotplug has occurred.
>>
>> Let's do a recap on what we have discussed about this until now.
>>
>> There are two conflicting error recovery mechanisms for PCIe. 
>>
>> If a system supports both hotplug and DPC, endpoint can be removed
>> and inserted safely.  DPC driver shuts down the driver on link down.
>> When link comes back up, hotplug driver takes over and initiates an
>> enumeration process.  Keith mentioned the stop and re-enumerate
>> design was chosen because someone could remove a drive and insert an
>> unrelated drive back to the system.  We can't really save and
>> restore state as we do in the AER path. 
>>
>> Now, let's assume a system without hotplug capability.  Second
>> mechanism is to go through DPC/AER path and do an automatic link
>> down recovery via DPC retrain/secondary bus reset including register
>> save and restore.  Second mechanism is more suitable for handling
>> "surprise link down" event. The goal is to retrain the link and
>> continue driver operation. 
>>
>> The goal of this patch to separate these two cases from each other
>> as the DPC driver needs to work on both contexts. Current DPC code
>> doesn't handle the second use case.
> 
> I think the scenario you are describing is two systems that are
> identical except that in the first, the endpoint is below a hotplug
> bridge, while in the second, it's below a non-hotplug bridge.  There's
> no physical hotplug (no drive removed or inserted), and DPC is
> triggered in both systems.
> 
> I suggest that DPC should be handled identically in both systems:
> 
>   - The PCI core should have the same view of the endpoint: it should
>     be removed and re-added in both cases (or in neither case).
> 
>   - The endpoint itself should not be able to tell the difference: it
>     should see a link down event, followed by a link retrain, followed
>     by the same sequence of config accesses, etc.
> 
>   - The endpoint driver should not be able to tell the difference,
>     i.e., we should be calling the same pci_error_handlers callbacks
>     in both cases.
> 
> It's true that in the non-hotplug system, pciehp probably won't start
> re-enumeration, so we might need an alternate path to trigger that.
> 
> But that's not what we're doing in this patch.  In this patch we're
> adding a much bigger difference: for hotplug bridges, we stop and
> remove the hierarchy below the bridge; for non-hotplug bridges, we do
> the AER-style flow of calling pci_error_handlers callbacks.

Our approach on V12 was to go to AER style recovery for all DPC events
regardless of hotplug support or not. 

Keith was not comfortable with this approach. That's why, we special cased
hotplug.

If we drop 6/6 on this patch on v13, we achieve this. We still have to
take care of Keith's inputs on individual patches.

we have been struggling with the direction for a while.

Keith, what do you think?

> 
> Bjorn
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 14:34         ` Sinan Kaya
@ 2018-04-12 14:39           ` Keith Busch
  2018-04-12 15:02             ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2018-04-12 14:39 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > 
> > I think the scenario you are describing is two systems that are
> > identical except that in the first, the endpoint is below a hotplug
> > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > no physical hotplug (no drive removed or inserted), and DPC is
> > triggered in both systems.
> > 
> > I suggest that DPC should be handled identically in both systems:
> > 
> >   - The PCI core should have the same view of the endpoint: it should
> >     be removed and re-added in both cases (or in neither case).
> > 
> >   - The endpoint itself should not be able to tell the difference: it
> >     should see a link down event, followed by a link retrain, followed
> >     by the same sequence of config accesses, etc.
> > 
> >   - The endpoint driver should not be able to tell the difference,
> >     i.e., we should be calling the same pci_error_handlers callbacks
> >     in both cases.
> > 
> > It's true that in the non-hotplug system, pciehp probably won't start
> > re-enumeration, so we might need an alternate path to trigger that.
> > 
> > But that's not what we're doing in this patch.  In this patch we're
> > adding a much bigger difference: for hotplug bridges, we stop and
> > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > the AER-style flow of calling pci_error_handlers callbacks.
> 
> Our approach on V12 was to go to AER style recovery for all DPC events
> regardless of hotplug support or not. 
> 
> Keith was not comfortable with this approach. That's why, we special cased
> hotplug.
> 
> If we drop 6/6 on this patch on v13, we achieve this. We still have to
> take care of Keith's inputs on individual patches.
> 
> we have been struggling with the direction for a while.
> 
> Keith, what do you think?

My only concern was for existing production environments that use DPC
for handling surprise removal, and I don't wish to break the existing
uses.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 14:39           ` Keith Busch
@ 2018-04-12 15:02             ` Keith Busch
  2018-04-12 16:27               ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2018-04-12 15:02 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Thu, Apr 12, 2018 at 08:39:54AM -0600, Keith Busch wrote:
> On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> > On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > > 
> > > I think the scenario you are describing is two systems that are
> > > identical except that in the first, the endpoint is below a hotplug
> > > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > > no physical hotplug (no drive removed or inserted), and DPC is
> > > triggered in both systems.
> > > 
> > > I suggest that DPC should be handled identically in both systems:
> > > 
> > >   - The PCI core should have the same view of the endpoint: it should
> > >     be removed and re-added in both cases (or in neither case).
> > > 
> > >   - The endpoint itself should not be able to tell the difference: it
> > >     should see a link down event, followed by a link retrain, followed
> > >     by the same sequence of config accesses, etc.
> > > 
> > >   - The endpoint driver should not be able to tell the difference,
> > >     i.e., we should be calling the same pci_error_handlers callbacks
> > >     in both cases.
> > > 
> > > It's true that in the non-hotplug system, pciehp probably won't start
> > > re-enumeration, so we might need an alternate path to trigger that.
> > > 
> > > But that's not what we're doing in this patch.  In this patch we're
> > > adding a much bigger difference: for hotplug bridges, we stop and
> > > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > > the AER-style flow of calling pci_error_handlers callbacks.
> > 
> > Our approach on V12 was to go to AER style recovery for all DPC events
> > regardless of hotplug support or not. 
> > 
> > Keith was not comfortable with this approach. That's why, we special cased
> > hotplug.
> > 
> > If we drop 6/6 on this patch on v13, we achieve this. We still have to
> > take care of Keith's inputs on individual patches.
> > 
> > we have been struggling with the direction for a while.
> > 
> > Keith, what do you think?
> 
> My only concern was for existing production environments that use DPC
> for handling surprise removal, and I don't wish to break the existing
> uses.

Also, I thought the plan was to keep hotplug and non-hotplug the same,
except for the very end: if not a hotplug bridge, initiate the rescan
automatically after releasing from containment, otherwise let pciehp
handle it when the link reactivates.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 15:02             ` Keith Busch
@ 2018-04-12 16:27               ` Sinan Kaya
  2018-04-12 17:09                 ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-12 16:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 4/12/2018 11:02 AM, Keith Busch wrote:
> On Thu, Apr 12, 2018 at 08:39:54AM -0600, Keith Busch wrote:
>> On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
>>> On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
>>>>
>>>> I think the scenario you are describing is two systems that are
>>>> identical except that in the first, the endpoint is below a hotplug
>>>> bridge, while in the second, it's below a non-hotplug bridge.  There's
>>>> no physical hotplug (no drive removed or inserted), and DPC is
>>>> triggered in both systems.
>>>>
>>>> I suggest that DPC should be handled identically in both systems:
>>>>
>>>>   - The PCI core should have the same view of the endpoint: it should
>>>>     be removed and re-added in both cases (or in neither case).
>>>>
>>>>   - The endpoint itself should not be able to tell the difference: it
>>>>     should see a link down event, followed by a link retrain, followed
>>>>     by the same sequence of config accesses, etc.
>>>>
>>>>   - The endpoint driver should not be able to tell the difference,
>>>>     i.e., we should be calling the same pci_error_handlers callbacks
>>>>     in both cases.
>>>>
>>>> It's true that in the non-hotplug system, pciehp probably won't start
>>>> re-enumeration, so we might need an alternate path to trigger that.
>>>>
>>>> But that's not what we're doing in this patch.  In this patch we're
>>>> adding a much bigger difference: for hotplug bridges, we stop and
>>>> remove the hierarchy below the bridge; for non-hotplug bridges, we do
>>>> the AER-style flow of calling pci_error_handlers callbacks.
>>>
>>> Our approach on V12 was to go to AER style recovery for all DPC events
>>> regardless of hotplug support or not. 
>>>
>>> Keith was not comfortable with this approach. That's why, we special cased
>>> hotplug.
>>>
>>> If we drop 6/6 on this patch on v13, we achieve this. We still have to
>>> take care of Keith's inputs on individual patches.
>>>
>>> we have been struggling with the direction for a while.
>>>
>>> Keith, what do you think?
>>
>> My only concern was for existing production environments that use DPC
>> for handling surprise removal, and I don't wish to break the existing
>> uses.
> 
> Also, I thought the plan was to keep hotplug and non-hotplug the same,
> except for the very end: if not a hotplug bridge, initiate the rescan
> automatically after releasing from containment, otherwise let pciehp
> handle it when the link reactivates.
> 

Hmm...

AER driver doesn't do stop and rescan approach for fatal errors. AER driver
makes an error callback followed by secondary bus reset and finally driver
the resume callback on the endpoint only if link recovery is successful.
Otherwise, AER driver bails out with recovery unsuccessful message.

Why do we need an additional rescan in the DPC driver if the link is up
and driver resumes operation?

If hotplug is supported and somebody removed the device, link won't come up.
The AER error recovery sequence will fail after timeout.

When the drive is inserted, hotplug driver observes a link up interrupt,
Hotplug driver does a rescan. Drive is functional one more time. 

This should satisfy both use cases, right?


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 16:27               ` Sinan Kaya
@ 2018-04-12 17:09                 ` Keith Busch
  2018-04-12 17:41                   ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2018-04-12 17:09 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote:
> On 4/12/2018 11:02 AM, Keith Busch wrote:
> > 
> > Also, I thought the plan was to keep hotplug and non-hotplug the same,
> > except for the very end: if not a hotplug bridge, initiate the rescan
> > automatically after releasing from containment, otherwise let pciehp
> > handle it when the link reactivates.
> > 
> 
> Hmm...
> 
> AER driver doesn't do stop and rescan approach for fatal errors. AER driver
> makes an error callback followed by secondary bus reset and finally driver
> the resume callback on the endpoint only if link recovery is successful.
> Otherwise, AER driver bails out with recovery unsuccessful message.

I'm not sure if that's necessarily true. People have reported AER
handling triggers PCIe hotplug events, and creates some interesting race
conditions:

https://marc.info/?l=linux-pci&m=152336615707640&w=2

https://www.spinics.net/lists/linux-pci/msg70614.html

> Why do we need an additional rescan in the DPC driver if the link is up
> and driver resumes operation?

I thought the plan was to have DPC always go through the removal path
to ensure all devices are properly configured when containment is
released. In order to reconfigure those, you'll need to initiate the
rescan from somewhere.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 17:09                 ` Keith Busch
@ 2018-04-12 17:41                   ` Sinan Kaya
  2018-04-14 15:53                     ` Sinan Kaya
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-12 17:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 4/12/2018 1:09 PM, Keith Busch wrote:
> On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote:
>> On 4/12/2018 11:02 AM, Keith Busch wrote:
>>>
>>> Also, I thought the plan was to keep hotplug and non-hotplug the same,
>>> except for the very end: if not a hotplug bridge, initiate the rescan
>>> automatically after releasing from containment, otherwise let pciehp
>>> handle it when the link reactivates.
>>>
>>
>> Hmm...
>>
>> AER driver doesn't do stop and rescan approach for fatal errors. AER driver
>> makes an error callback followed by secondary bus reset and finally driver
>> the resume callback on the endpoint only if link recovery is successful.
>> Otherwise, AER driver bails out with recovery unsuccessful message.
> 
> I'm not sure if that's necessarily true. People have reported AER
> handling triggers PCIe hotplug events, and creates some interesting race
> conditions:

By reading the code, I don't see a stop and rescan in the AER error recovery
path.

As both logs indicate, stop and rescan is initiated in response to link down
and link up interrupts triggered by the secondary bus reset. 
The SW entity handling these is not AER driver. It is the hotplug driver
running asynchronous to the AER driver.

AER driver should have tried a slot reset before attempting to do a secondary
bus reset.

/**
 * pci_reset_slot - reset a PCI slot
 * @slot: PCI slot to reset
 *
 * A PCI bus may host multiple slots, each slot may support a reset mechanism
 * independent of other slots.  For instance, some slots may support slot power
 * control.  In the case of a 1:1 bus to slot architecture, this function may
 * wrap the bus reset to avoid spurious slot related events such as hotplug.
 * Generally a slot reset should be attempted before a bus reset.  All of the
 * function of the slot and any subordinate buses behind the slot are reset
 * through this function.  PCI config space of all devices in the slot and
 * behind the slot is saved before and restored after reset.
 *
 * Return 0 on success, non-zero on error.
 */
int pci_reset_slot(struct pci_slot *slot)

Slot reset is there to mask hotplug interrupts before the reset and unmask them
after reset.

> 
> https://marc.info/?l=linux-pci&m=152336615707640&w=2
> 
> https://www.spinics.net/lists/linux-pci/msg70614.html
> 
>> Why do we need an additional rescan in the DPC driver if the link is up
>> and driver resumes operation?
> 
> I thought the plan was to have DPC always go through the removal path
> to ensure all devices are properly configured when containment is
> released. In order to reconfigure those, you'll need to initiate the
> rescan from somewhere.
> 

This is where the contradiction is. 

Bjorn is asking for a unified error handling for both AER and DPC.

Current AER error recovery framework is error callback + secondary
bus reset + resume callback.

How does this stop + rescan model fit?

Do we want to change the error recovery framework? I suppose this will 
become a bigger conversation as there are more customers of this.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-12 17:41                   ` Sinan Kaya
@ 2018-04-14 15:53                     ` Sinan Kaya
  2018-04-16  3:17                       ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-14 15:53 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

Hi Keith, Bjorn;

On 4/12/2018 1:41 PM, Sinan Kaya wrote:
> On 4/12/2018 1:09 PM, Keith Busch wrote:
>> On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote:
>>> On 4/12/2018 11:02 AM, Keith Busch wrote:
>>>>
>>>> Also, I thought the plan was to keep hotplug and non-hotplug the same,
>>>> except for the very end: if not a hotplug bridge, initiate the rescan
>>>> automatically after releasing from containment, otherwise let pciehp
>>>> handle it when the link reactivates.
>>>>
>>>
>>> Hmm...
>>>
>>> AER driver doesn't do stop and rescan approach for fatal errors. AER driver
>>> makes an error callback followed by secondary bus reset and finally driver
>>> the resume callback on the endpoint only if link recovery is successful.
>>> Otherwise, AER driver bails out with recovery unsuccessful message.
>>
>> I'm not sure if that's necessarily true. People have reported AER
>> handling triggers PCIe hotplug events, and creates some interesting race
>> conditions:
> 
> By reading the code, I don't see a stop and rescan in the AER error recovery
> path.
> 
> As both logs indicate, stop and rescan is initiated in response to link down
> and link up interrupts triggered by the secondary bus reset. 
> The SW entity handling these is not AER driver. It is the hotplug driver
> running asynchronous to the AER driver.
> 
> AER driver should have tried a slot reset before attempting to do a secondary
> bus reset.
> 
> /**
>  * pci_reset_slot - reset a PCI slot
>  * @slot: PCI slot to reset
>  *
>  * A PCI bus may host multiple slots, each slot may support a reset mechanism
>  * independent of other slots.  For instance, some slots may support slot power
>  * control.  In the case of a 1:1 bus to slot architecture, this function may
>  * wrap the bus reset to avoid spurious slot related events such as hotplug.
>  * Generally a slot reset should be attempted before a bus reset.  All of the
>  * function of the slot and any subordinate buses behind the slot are reset
>  * through this function.  PCI config space of all devices in the slot and
>  * behind the slot is saved before and restored after reset.
>  *
>  * Return 0 on success, non-zero on error.
>  */
> int pci_reset_slot(struct pci_slot *slot)
> 
> Slot reset is there to mask hotplug interrupts before the reset and unmask them
> after reset.
> 
>>
>> https://marc.info/?l=linux-pci&m=152336615707640&w=2
>>
>> https://www.spinics.net/lists/linux-pci/msg70614.html
>>
>>> Why do we need an additional rescan in the DPC driver if the link is up
>>> and driver resumes operation?
>>
>> I thought the plan was to have DPC always go through the removal path
>> to ensure all devices are properly configured when containment is
>> released. In order to reconfigure those, you'll need to initiate the
>> rescan from somewhere.
>>
> 
> This is where the contradiction is. 
> 
> Bjorn is asking for a unified error handling for both AER and DPC.
> 
> Current AER error recovery framework is error callback + secondary
> bus reset + resume callback.
> 
> How does this stop + rescan model fit?
> 
> Do we want to change the error recovery framework? I suppose this will 
> become a bigger conversation as there are more customers of this.
> 

I also want to highlight that the PCI Error recovery sequence is well
documented here.

https://www.kernel.org/doc/Documentation/PCI/pci-error-recovery.txt

We don't really have to guess what Linux does. 

IMO, the hotplug issues Keith is seeing are orthogonal and needs to be
addressed independent of this series by following the pci slot reset
procedure.

Hotplug driver handles link up/down events due to insertion/removal.
Hotplug driver is expected to do the re-enumeration.

I don't understand why we need to do another re-enumeration if system
observes a PCIe error handled by the AER/DPC driver. 

These two are independent events.

PCIe error recovery framework does the reset callback + SBR + resume
behavior today.

Bjorn,

You indicated that you want to unify the AER and DPC behavior. Let's
settle on what we want to do one more time. We have been going forth
and back on the direction.

We are on V13. I hope we won't hit V20 :)

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (5 preceding siblings ...)
  2018-04-09 14:41 ` [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system Oza Pawandeep
@ 2018-04-16  3:16 ` Bjorn Helgaas
  2018-04-16  3:53   ` Sinan Kaya
  6 siblings, 1 reply; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-16  3:16 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> 
> DPC should behave identical to AER as far as error handling is concerned.
> DPC should remove the devices and not to do recovery for hotplug enabled system.

Is there a specific bug that's fixed by these patches?  I didn't see
one mentioned in the changelogs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-14 15:53                     ` Sinan Kaya
@ 2018-04-16  3:17                       ` Bjorn Helgaas
  2018-04-16  5:33                         ` poza
  2018-04-16 14:46                         ` Sinan Kaya
  0 siblings, 2 replies; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-16  3:17 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Keith Busch, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote:

> You indicated that you want to unify the AER and DPC behavior. Let's
> settle on what we want to do one more time. We have been going forth
> and back on the direction.

My thinking is that as much as possible, similar events should be
handled similarly, whether the mechanism is AER, DPC, EEH, etc.
Ideally, drivers shouldn't have to be aware of which mechanism is in
use.

Error recovery includes conventional PCI as well, but right now I
think we're only concerned with PCIe.  The following error types are
from PCIe r4.0, sec 6.2.2:

  ERR_COR
    Corrected by hardware with no software intervention.  Software
    involved for logging only.

    Handled by AER via pci_error_handlers; DPC is never involved.

    Link is unaffected.

  ERR_NONFATAL
    A transaction is unreliable but the link is fully functional.

    If DPC is not supported, handled by AER via pci_error_handlers and
    the link is unaffected.

    If DPC supported, handled by DPC (because we set
    PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate.

  ERR_FATAL
    The link is unreliable.

    If DPC is not supported, handled by AER via pci_error_handlers and
    the link is reset.

    If DPC supported, handled by DPC via remove/re-enumerate.

It doesn't seem right to me that we handle both ERR_NONFATAL and
ERR_FATAL events differently if we happen to have DPC support in a
switch.

Maybe we should consider triggering DPC only on ERR_FATAL?  That would
keep DPC out of the ERR_NONFATAL cases.

For ERR_FATAL, maybe we should bite the bullet and use
remove/re-enumerate for AER as well as for DPC.  That would be painful
for higher-level software, but if we're willing to accept that pain
for new systems that support DPC, maybe life would be better overall
if it worked the same way on systems without DPC?

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-16  3:16 ` [PATCH v13 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
@ 2018-04-16  3:53   ` Sinan Kaya
  2018-04-16  6:03     ` poza
  0 siblings, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-16  3:53 UTC (permalink / raw)
  To: Bjorn Helgaas, Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
> On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>> This patch set brings in error handling support for DPC
>>
>> The current implementation of AER and error message broadcasting to the
>> EP driver is tightly coupled and limited to AER service driver.
>> It is important to factor out broadcasting and other link handling
>> callbacks. So that not only when AER gets triggered, but also when DPC get
>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>>
>> DPC should behave identical to AER as far as error handling is concerned.
>> DPC should remove the devices and not to do recovery for hotplug enabled system.
> 
> Is there a specific bug that's fixed by these patches?  I didn't see
> one mentioned in the changelogs.
> 

There is no actual bug. 

We realized that DPC and hotplug is heavily integrated today. We have use
cases for systems without hotplug support but still support DPC. That's the
problem we are trying to solve with this patchset.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-16  3:17                       ` Bjorn Helgaas
@ 2018-04-16  5:33                         ` poza
  2018-04-16  5:51                           ` poza
  2018-04-16 14:46                         ` Sinan Kaya
  1 sibling, 1 reply; 38+ messages in thread
From: poza @ 2018-04-16  5:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sinan Kaya, Keith Busch, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 2018-04-16 08:47, Bjorn Helgaas wrote:
> On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote:
> 
>> You indicated that you want to unify the AER and DPC behavior. Let's
>> settle on what we want to do one more time. We have been going forth
>> and back on the direction.
> 
> My thinking is that as much as possible, similar events should be
> handled similarly, whether the mechanism is AER, DPC, EEH, etc.
> Ideally, drivers shouldn't have to be aware of which mechanism is in
> use.
> 
> Error recovery includes conventional PCI as well, but right now I
> think we're only concerned with PCIe.  The following error types are
> from PCIe r4.0, sec 6.2.2:
> 
>   ERR_COR
>     Corrected by hardware with no software intervention.  Software
>     involved for logging only.
> 
>     Handled by AER via pci_error_handlers; DPC is never involved.
> 
>     Link is unaffected.
> 
>   ERR_NONFATAL
>     A transaction is unreliable but the link is fully functional.
> 
>     If DPC is not supported, handled by AER via pci_error_handlers and
>     the link is unaffected.
> 
>     If DPC supported, handled by DPC (because we set
>     PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate.
> 
>   ERR_FATAL
>     The link is unreliable.
> 
>     If DPC is not supported, handled by AER via pci_error_handlers and
>     the link is reset.
> 
>     If DPC supported, handled by DPC via remove/re-enumerate.
> 
> It doesn't seem right to me that we handle both ERR_NONFATAL and
> ERR_FATAL events differently if we happen to have DPC support in a
> switch.
> 
> Maybe we should consider triggering DPC only on ERR_FATAL?  That would
> keep DPC out of the ERR_NONFATAL cases.
> 
> For ERR_FATAL, maybe we should bite the bullet and use
> remove/re-enumerate for AER as well as for DPC.  That would be painful
> for higher-level software, but if we're willing to accept that pain
> for new systems that support DPC, maybe life would be better overall
> if it worked the same way on systems without DPC?
> 
> Bjorn

This had crossed my mind when I first looked at the code.
DPC is getting triggered for both ERR_NONFATAL and ERR_FATAL case.
I thought the primary purpose of DPC to recover fatal errors, by 
triggering HW recovery.
but what if some platform wants to handle both FATAL and NON_FATAL with 
DPC ?

As you said AER FATAL cases and DPC FATAL cases should be handled 
similarly.
e.g. remove/re-enumerate the devices.

while NON_FATAL case; only AER would come into picture.
if some platform would like to handle DPC NON_FATAL then it should 
follow AER NON_FATAL path  (where it does not do remove/re-enumerate)

And the case where hotplug is enabled, remove/re-enumerate more sense in 
case of ERR_FATAL.
And the case where hotplug is disabled, only re-enumeration is required. 
(no need to remove the devices)
but then do we need to handle this case specifically, what is the harm 
in removing the devices in all the cases followed by re-enumerate ?

Regards,
Oza.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-16  5:33                         ` poza
@ 2018-04-16  5:51                           ` poza
  2018-04-16 14:01                             ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: poza @ 2018-04-16  5:51 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sinan Kaya, Keith Busch, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson, linux-pci-owner

On 2018-04-16 11:03, poza@codeaurora.org wrote:
> On 2018-04-16 08:47, Bjorn Helgaas wrote:
>> On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote:
>> 
>>> You indicated that you want to unify the AER and DPC behavior. Let's
>>> settle on what we want to do one more time. We have been going forth
>>> and back on the direction.
>> 
>> My thinking is that as much as possible, similar events should be
>> handled similarly, whether the mechanism is AER, DPC, EEH, etc.
>> Ideally, drivers shouldn't have to be aware of which mechanism is in
>> use.
>> 
>> Error recovery includes conventional PCI as well, but right now I
>> think we're only concerned with PCIe.  The following error types are
>> from PCIe r4.0, sec 6.2.2:
>> 
>>   ERR_COR
>>     Corrected by hardware with no software intervention.  Software
>>     involved for logging only.
>> 
>>     Handled by AER via pci_error_handlers; DPC is never involved.
>> 
>>     Link is unaffected.
>> 
>>   ERR_NONFATAL
>>     A transaction is unreliable but the link is fully functional.
>> 
>>     If DPC is not supported, handled by AER via pci_error_handlers and
>>     the link is unaffected.
>> 
>>     If DPC supported, handled by DPC (because we set
>>     PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate.
>> 
>>   ERR_FATAL
>>     The link is unreliable.
>> 
>>     If DPC is not supported, handled by AER via pci_error_handlers and
>>     the link is reset.
>> 
>>     If DPC supported, handled by DPC via remove/re-enumerate.
>> 
>> It doesn't seem right to me that we handle both ERR_NONFATAL and
>> ERR_FATAL events differently if we happen to have DPC support in a
>> switch.
>> 
>> Maybe we should consider triggering DPC only on ERR_FATAL?  That would
>> keep DPC out of the ERR_NONFATAL cases.
>> 
>> For ERR_FATAL, maybe we should bite the bullet and use
>> remove/re-enumerate for AER as well as for DPC.  That would be painful
>> for higher-level software, but if we're willing to accept that pain
>> for new systems that support DPC, maybe life would be better overall
>> if it worked the same way on systems without DPC?
>> 
>> Bjorn
> 
> This had crossed my mind when I first looked at the code.
> DPC is getting triggered for both ERR_NONFATAL and ERR_FATAL case.
> I thought the primary purpose of DPC to recover fatal errors, by
> triggering HW recovery.
> but what if some platform wants to handle both FATAL and NON_FATAL with 
> DPC ?
> 
> As you said AER FATAL cases and DPC FATAL cases should be handled 
> similarly.
> e.g. remove/re-enumerate the devices.
> 
> while NON_FATAL case; only AER would come into picture.
> if some platform would like to handle DPC NON_FATAL then it should
> follow AER NON_FATAL path  (where it does not do remove/re-enumerate)
> 
> And the case where hotplug is enabled, remove/re-enumerate more sense
> in case of ERR_FATAL.
> And the case where hotplug is disabled, only re-enumeration is
> required. (no need to remove the devices)
> but then do we need to handle this case specifically, what is the harm
> in removing the devices in all the cases followed by re-enumerate ?

To Clarify the last line, what I meant here was, in case of ERR_FATAL we 
can always remove/re-enumerate the devices irrespective of hotplug is 
enabled or not.

and in case of ERR_NONFATAL, DPC will follow AER path (where it just 
tries to recover)
although I am not very sure that how to handle ERR_NONFATAL case if 
hotplug is enabled. Because as Keith suggested device might have been 
changed run-time.

> 
> Regards,
> Oza.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-16  3:53   ` Sinan Kaya
@ 2018-04-16  6:03     ` poza
  2018-04-16 13:27       ` Bjorn Helgaas
  0 siblings, 1 reply; 38+ messages in thread
From: poza @ 2018-04-16  6:03 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 2018-04-16 09:23, Sinan Kaya wrote:
> On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
>> On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>>> This patch set brings in error handling support for DPC
>>> 
>>> The current implementation of AER and error message broadcasting to 
>>> the
>>> EP driver is tightly coupled and limited to AER service driver.
>>> It is important to factor out broadcasting and other link handling
>>> callbacks. So that not only when AER gets triggered, but also when 
>>> DPC get
>>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>>> 
>>> DPC should behave identical to AER as far as error handling is 
>>> concerned.
>>> DPC should remove the devices and not to do recovery for hotplug 
>>> enabled system.
>> 
>> Is there a specific bug that's fixed by these patches?  I didn't see
>> one mentioned in the changelogs.
>> 
> 
> There is no actual bug.
> 
> We realized that DPC and hotplug is heavily integrated today. We have 
> use
> cases for systems without hotplug support but still support DPC. That's 
> the
> problem we are trying to solve with this patchset.

Adding to what Sinan said;

DPC should handle the error handling and recovery similar to AER, 
because finally both
are attempting recovery in some or the other way,
and for that error handling and recovery framework has to be loosely 
coupled.
It achieves uniformity and transparency to the error handling agents 
such as AER, DPC, with respect to recovery and error handling.

So, this patch-set tries to unify lot of things between error agents and 
make them behave in a well defined way. (be it error (FATAL, NON_FATAL) 
handling or recovery).

Regards,
Oza.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-16  6:03     ` poza
@ 2018-04-16 13:27       ` Bjorn Helgaas
  2018-04-16 14:12         ` poza
  2018-04-16 14:30         ` Sinan Kaya
  0 siblings, 2 replies; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-16 13:27 UTC (permalink / raw)
  To: poza
  Cc: Sinan Kaya, Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote:
> On 2018-04-16 09:23, Sinan Kaya wrote:
> > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
> > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
> > > > This patch set brings in error handling support for DPC
> > > > 
> > > > The current implementation of AER and error message broadcasting
> > > > to the
> > > > EP driver is tightly coupled and limited to AER service driver.
> > > > It is important to factor out broadcasting and other link handling
> > > > callbacks. So that not only when AER gets triggered, but also
> > > > when DPC get
> > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> > > > 
> > > > DPC should behave identical to AER as far as error handling is
> > > > concerned.
> > > > DPC should remove the devices and not to do recovery for hotplug
> > > > enabled system.
> > > 
> > > Is there a specific bug that's fixed by these patches?  I didn't see
> > > one mentioned in the changelogs.
> > > 
> > 
> > There is no actual bug.
> > 
> > We realized that DPC and hotplug is heavily integrated today. We
> > have use cases for systems without hotplug support but still
> > support DPC. That's the problem we are trying to solve with this
> > patchset.

Apparently there's a problem with systems that have DPC but not
hotplug.  It will be extremely helpful if you can articulate what that
problem is and include it in the appropriate changelog.

> Adding to what Sinan said;
> 
> DPC should handle the error handling and recovery similar to AER,
> because finally both are attempting recovery in some or the other
> way, and for that error handling and recovery framework has to be
> loosely coupled.  It achieves uniformity and transparency to the
> error handling agents such as AER, DPC, with respect to recovery and
> error handling.
> 
> So, this patch-set tries to unify lot of things between error agents
> and make them behave in a well defined way. (be it error (FATAL,
> NON_FATAL) handling or recovery).

I totally support this objective.

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-16  5:51                           ` poza
@ 2018-04-16 14:01                             ` Bjorn Helgaas
  0 siblings, 0 replies; 38+ messages in thread
From: Bjorn Helgaas @ 2018-04-16 14:01 UTC (permalink / raw)
  To: poza
  Cc: Sinan Kaya, Keith Busch, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson, linux-pci-owner

On Mon, Apr 16, 2018 at 11:21:15AM +0530, poza@codeaurora.org wrote:
> On 2018-04-16 11:03, poza@codeaurora.org wrote:
> > On 2018-04-16 08:47, Bjorn Helgaas wrote:
> > > On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote:
> > > 
> > > > You indicated that you want to unify the AER and DPC behavior. Let's
> > > > settle on what we want to do one more time. We have been going forth
> > > > and back on the direction.
> > > 
> > > My thinking is that as much as possible, similar events should be
> > > handled similarly, whether the mechanism is AER, DPC, EEH, etc.
> > > Ideally, drivers shouldn't have to be aware of which mechanism is in
> > > use.
> > > 
> > > Error recovery includes conventional PCI as well, but right now I
> > > think we're only concerned with PCIe.  The following error types are
> > > from PCIe r4.0, sec 6.2.2:
> > > 
> > >   ERR_COR
> > >     Corrected by hardware with no software intervention.  Software
> > >     involved for logging only.
> > > 
> > >     Handled by AER via pci_error_handlers; DPC is never involved.
> > > 
> > >     Link is unaffected.
> > > 
> > >   ERR_NONFATAL
> > >     A transaction is unreliable but the link is fully functional.
> > > 
> > >     If DPC is not supported, handled by AER via pci_error_handlers and
> > >     the link is unaffected.
> > > 
> > >     If DPC supported, handled by DPC (because we set
> > >     PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate.
> > > 
> > >   ERR_FATAL
> > >     The link is unreliable.
> > > 
> > >     If DPC is not supported, handled by AER via pci_error_handlers and
> > >     the link is reset.
> > > 
> > >     If DPC supported, handled by DPC via remove/re-enumerate.
> > > 
> > > It doesn't seem right to me that we handle both ERR_NONFATAL and
> > > ERR_FATAL events differently if we happen to have DPC support in a
> > > switch.
> > > 
> > > Maybe we should consider triggering DPC only on ERR_FATAL?  That would
> > > keep DPC out of the ERR_NONFATAL cases.
> > > 
> > > For ERR_FATAL, maybe we should bite the bullet and use
> > > remove/re-enumerate for AER as well as for DPC.  That would be painful
> > > for higher-level software, but if we're willing to accept that pain
> > > for new systems that support DPC, maybe life would be better overall
> > > if it worked the same way on systems without DPC?
> > > 
> > > Bjorn
> > 
> > This had crossed my mind when I first looked at the code.
> > DPC is getting triggered for both ERR_NONFATAL and ERR_FATAL case.
> > I thought the primary purpose of DPC to recover fatal errors, by
> > triggering HW recovery.
> > but what if some platform wants to handle both FATAL and NON_FATAL
> > with DPC ?

AER usage is coordinated between the platform and the OS via _OSC (PCI
Firmware spec r3.2, sec 4.5.1) My understanding is that if the
platform grants the OS permission to control AER, the OS is free to
set the AER control registers however it wishes.  If the platform
depends on certain AER control settings, it must decline to grant AER
control to the OS.

As far as I know, there's nothing new in _OSC related to DPC, but
based on the implementation note in PCIe r4.0, sec 6.2.10, we
currently assume the OS controls DPC if and only if it controls AER.

Therefore, if a platform depends on certain DPC control settings,
e.g., if it wants to handle both ERR_FATAL and ERR_NONFATAL with DPC,
I would argue that the platform needs to retain control over AER.

If the platform grants AER control to the OS, I think the OS is free
to set both AER and DPC control registers as it desires.

> > As you said AER FATAL cases and DPC FATAL cases should be handled
> > similarly.  e.g. remove/re-enumerate the devices.
> > 
> > while NON_FATAL case; only AER would come into picture.  if some
> > platform would like to handle DPC NON_FATAL then it should follow
> > AER NON_FATAL path  (where it does not do remove/re-enumerate)
> > 
> > And the case where hotplug is enabled, remove/re-enumerate more
> > sense in case of ERR_FATAL.  And the case where hotplug is
> > disabled, only re-enumeration is required. (no need to remove the
> > devices) but then do we need to handle this case specifically,
> > what is the harm in removing the devices in all the cases followed
> > by re-enumerate ?
> 
> To Clarify the last line, what I meant here was, in case of
> ERR_FATAL we can always remove/re-enumerate the devices irrespective
> of hotplug is enabled or not.

ERR_FATAL means the link is unreliable.  A hotplug event may have
occurred.  In this case, I think remove/re-enumerate is a safe option.
It might be more than is needed in some cases, but I'm not sure we can
reliably determine when we don't need it, and if we
remove/re-enumerate for some ERR_FATAL errors but not others, I think
it complicates the situation for drivers and other higher-level
software.

> and in case of ERR_NONFATAL, DPC will follow AER path (where it just
> tries to recover) although I am not very sure that how to handle
> ERR_NONFATAL case if hotplug is enabled. Because as Keith suggested
> device might have been changed run-time.

ERR_NONFATAL means a particular transaction might be unreliable but
the link itself is fully functional.  No hotplug is possible if the
link has stayed fully functional.

Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-16 13:27       ` Bjorn Helgaas
@ 2018-04-16 14:12         ` poza
  2018-04-16 14:30         ` Sinan Kaya
  1 sibling, 0 replies; 38+ messages in thread
From: poza @ 2018-04-16 14:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sinan Kaya, Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 2018-04-16 18:57, Bjorn Helgaas wrote:
> On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote:
>> On 2018-04-16 09:23, Sinan Kaya wrote:
>> > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
>> > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>> > > > This patch set brings in error handling support for DPC
>> > > >
>> > > > The current implementation of AER and error message broadcasting
>> > > > to the
>> > > > EP driver is tightly coupled and limited to AER service driver.
>> > > > It is important to factor out broadcasting and other link handling
>> > > > callbacks. So that not only when AER gets triggered, but also
>> > > > when DPC get
>> > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>> > > >
>> > > > DPC should behave identical to AER as far as error handling is
>> > > > concerned.
>> > > > DPC should remove the devices and not to do recovery for hotplug
>> > > > enabled system.
>> > >
>> > > Is there a specific bug that's fixed by these patches?  I didn't see
>> > > one mentioned in the changelogs.
>> > >
>> >
>> > There is no actual bug.
>> >
>> > We realized that DPC and hotplug is heavily integrated today. We
>> > have use cases for systems without hotplug support but still
>> > support DPC. That's the problem we are trying to solve with this
>> > patchset.
> 
> Apparently there's a problem with systems that have DPC but not
> hotplug.  It will be extremely helpful if you can articulate what that
> problem is and include it in the appropriate changelog.
> 
>> Adding to what Sinan said;
>> 
>> DPC should handle the error handling and recovery similar to AER,
>> because finally both are attempting recovery in some or the other
>> way, and for that error handling and recovery framework has to be
>> loosely coupled.  It achieves uniformity and transparency to the
>> error handling agents such as AER, DPC, with respect to recovery and
>> error handling.
>> 
>> So, this patch-set tries to unify lot of things between error agents
>> and make them behave in a well defined way. (be it error (FATAL,
>> NON_FATAL) handling or recovery).
> 
> I totally support this objective.

Thanks Bjorn, I will include this objective in Changelog along with 
Sinan's text.
I am not clear on one last thing Bjorn; which is;
do we need last patch ? patch-6 which handles hotplug case.
Also I think we could take this patch-set as basic changes/attempt to 
unify the code which it does.

And, in the next follow-up patches we can improve upon the things such 
as,
whether to do different actions for FATAL cases and NON_FATAL cases. And 
then I can make needed changes to AER and DPC
Please let me know how this sounds.

> 
> Bjorn

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 0/6] Address error and recovery for AER and DPC
  2018-04-16 13:27       ` Bjorn Helgaas
  2018-04-16 14:12         ` poza
@ 2018-04-16 14:30         ` Sinan Kaya
  1 sibling, 0 replies; 38+ messages in thread
From: Sinan Kaya @ 2018-04-16 14:30 UTC (permalink / raw)
  To: Bjorn Helgaas, poza
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 4/16/2018 9:27 AM, Bjorn Helgaas wrote:
>>> We realized that DPC and hotplug is heavily integrated today. We
>>> have use cases for systems without hotplug support but still
>>> support DPC. That's the problem we are trying to solve with this
>>> patchset.
> Apparently there's a problem with systems that have DPC but not
> hotplug.  It will be extremely helpful if you can articulate what that
> problem is and include it in the appropriate changelog.
> 

At a higher level, the DPC driver performs the stop operation regardless of
hotplug. However, DPC driver relies on hotplug driver observing link up to
re-enumerate. 

Of course, when the system didn't support hotplug; there was nobody to
restore functionality.

Our initial attempt was to also do a re-enumeration in the DPC driver
regardless of hotplug driver in the system or not. 

If hotplug driver is present, it would observe two enumerations. It still
worked as long as these were protected by a mutex.

Then, we got your input that you want DPC and AER to behave the same. We
started converging towards the AER path.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-16  3:17                       ` Bjorn Helgaas
  2018-04-16  5:33                         ` poza
@ 2018-04-16 14:46                         ` Sinan Kaya
  2018-04-16 17:15                           ` poza
  1 sibling, 1 reply; 38+ messages in thread
From: Sinan Kaya @ 2018-04-16 14:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 4/15/2018 11:17 PM, Bjorn Helgaas wrote:
> It doesn't seem right to me that we handle both ERR_NONFATAL and
> ERR_FATAL events differently if we happen to have DPC support in a
> switch.
> 
> Maybe we should consider triggering DPC only on ERR_FATAL?  That would
> keep DPC out of the ERR_NONFATAL cases.
> 
>From reliability perspective, it makes sense. DPC handles NONFATAL errors
by bringing down the link. If error happened behind a switch and root port
is handling DPC, we are impacting a lot of devices operation because of one
faulty device.

Keith, do you have any preference on this direction?

> For ERR_FATAL, maybe we should bite the bullet and use
> remove/re-enumerate for AER as well as for DPC.  That would be painful
> for higher-level software, but if we're willing to accept that pain
> for new systems that support DPC, maybe life would be better overall
> if it worked the same way on systems without DPC?

Sure, we can go to this route as well.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system
  2018-04-16 14:46                         ` Sinan Kaya
@ 2018-04-16 17:15                           ` poza
  0 siblings, 0 replies; 38+ messages in thread
From: poza @ 2018-04-16 17:15 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Keith Busch, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 2018-04-16 20:16, Sinan Kaya wrote:
> On 4/15/2018 11:17 PM, Bjorn Helgaas wrote:
>> It doesn't seem right to me that we handle both ERR_NONFATAL and
>> ERR_FATAL events differently if we happen to have DPC support in a
>> switch.
>> 
>> Maybe we should consider triggering DPC only on ERR_FATAL?  That would
>> keep DPC out of the ERR_NONFATAL cases.
>> 
> From reliability perspective, it makes sense. DPC handles NONFATAL 
> errors
> by bringing down the link. If error happened behind a switch and root 
> port
> is handling DPC, we are impacting a lot of devices operation because of 
> one
> faulty device.
> 
> Keith, do you have any preference on this direction?
> 
>> For ERR_FATAL, maybe we should bite the bullet and use
>> remove/re-enumerate for AER as well as for DPC.  That would be painful
>> for higher-level software, but if we're willing to accept that pain
>> for new systems that support DPC, maybe life would be better overall
>> if it worked the same way on systems without DPC?
> 
> Sure, we can go to this route as well.


ok so finally this is what is being proposed and so far Bjorn, Sinan and 
myself agreed on following:

I need to move the stop and re-enumerate code into the AER path instead 
of patch #6 for both DPC_FATAL and AER_FATAL error types.
Also, I should turn off DPC NON_FATAL error detection.

Keith, please confirm if you are okay with above proposal.

Regards,
Oza.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-04-16 17:15 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09 14:41 [PATCH v13 0/6] Address error and recovery for AER and DPC Oza Pawandeep
2018-04-09 14:41 ` [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
2018-04-09 23:14   ` Keith Busch
2018-04-09 14:41 ` [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
2018-04-09 23:15   ` Keith Busch
2018-04-10 11:36   ` kbuild test robot
2018-04-09 14:41 ` [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
2018-04-09 23:15   ` Keith Busch
2018-04-09 14:41 ` [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
2018-04-09 23:29   ` Keith Busch
2018-04-09 23:51     ` Sinan Kaya
2018-04-10  0:05       ` Sinan Kaya
2018-04-09 14:41 ` [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
2018-04-09 23:25   ` Keith Busch
2018-04-12  8:40     ` poza
2018-04-09 14:41 ` [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system Oza Pawandeep
2018-04-10 21:03   ` Bjorn Helgaas
2018-04-12  1:41     ` Sinan Kaya
2018-04-12 14:06       ` Bjorn Helgaas
2018-04-12 14:34         ` Sinan Kaya
2018-04-12 14:39           ` Keith Busch
2018-04-12 15:02             ` Keith Busch
2018-04-12 16:27               ` Sinan Kaya
2018-04-12 17:09                 ` Keith Busch
2018-04-12 17:41                   ` Sinan Kaya
2018-04-14 15:53                     ` Sinan Kaya
2018-04-16  3:17                       ` Bjorn Helgaas
2018-04-16  5:33                         ` poza
2018-04-16  5:51                           ` poza
2018-04-16 14:01                             ` Bjorn Helgaas
2018-04-16 14:46                         ` Sinan Kaya
2018-04-16 17:15                           ` poza
2018-04-16  3:16 ` [PATCH v13 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
2018-04-16  3:53   ` Sinan Kaya
2018-04-16  6:03     ` poza
2018-04-16 13:27       ` Bjorn Helgaas
2018-04-16 14:12         ` poza
2018-04-16 14:30         ` Sinan Kaya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).